An Architecture for Unified Packet Filtering · An Architecture for Unified Packet Filtering...

Politecnico di Torino Facoltà di Ingegneria dell’Informazione

Corso di Laurea in Ingegneria Elettronica

An Architecture

for Unified Packet Filtering

Relatori: Candidato: Prof. Patricia Lago Gianluca Varenni Prof. Fulvio Risso

Novembre 2001

Table of contents

1 INTRODUZIONE................................................................................................... 6

1.1 IL CONCETTO DI PACKET FILTERING ................................................................... 6 1.2 APPLICAZIONI CHE SFRUTTANO IL PACKET FILTERING ....................................... 7

1.2.1 Sistemi di cattura ...................................................................................... 7 1.2.2 Sistemi di monitoring e motori statistici................................................... 7 1.2.3 Firewalls.................................................................................................... 8 1.2.4 Intrusion Detection Systems (IDS) ........................................................... 9

1.3 UNA CARATTERISTICA COMUNE: LA GESTIONE DI TABELLE ............................... 9 1.4 IL SISTEMA DI CATTURA WINPCAP................................................................... 10 1.5 IL TABLE MANAGEMENT ENGINE (TME) ........................................................ 11

1.5.1 Interazione con l’utente .......................................................................... 15 1.6 INTERAZIONE TRA IL TME E LA LIBRERIA LIBPCAP....................................... 16 1.7 IMPLEMENTAZIONE .......................................................................................... 18 1.8 PRESTAZIONI.................................................................................................... 19 1.9 SVILUPPI FUTURI .............................................................................................. 21

2 INTRODUCTION ................................................................................................ 23

2.1 GOALS OF THIS WORK ...................................................................................... 23 2.2 CHAPTERS’ ORGANIZATION ............................................................................. 24

3 MONITORING NETWORKS ............................................................................ 26

3.1 GENERAL FEATURES ........................................................................................ 26 3.2 TYPES OF MONITORING .................................................................................... 27

3.2.1 Single aspects of traffic........................................................................... 28 3.2.2 Traffic matrices....................................................................................... 28 3.2.3 Multiple analyses .................................................................................... 30 3.2.4 Logging capabilities................................................................................ 30 3.2.5 Flow analyses.......................................................................................... 30

3.3 REALIZATION APPROACHES ............................................................................. 30 3.3.1 N-dimensional matrices .......................................................................... 33 3.3.2 Multiple analyses .................................................................................... 37 3.3.3 General logging capabilities ................................................................... 39 3.3.4 Analyses on flows................................................................................... 39

3.4 CONCLUSIONS.................................................................................................. 40

4 FIREWALLS ........................................................................................................ 41

4.1 INTRODUCTION ................................................................................................ 41 4.2 PACKET FILTERING APPROACH ......................................................................... 43 4.3 STATEFUL INSPECTION APPROACH ................................................................... 44 4.4 ACCESS CONTROL LISTS .................................................................................. 45 4.5 REQUIREMENTS ............................................................................................... 46 4.6 NETWORK INTRUSION DETECTION SYSTEMS ................................................... 49 4.7 NETWORK ADDRESS TRANSLATORS ................................................................ 50 4.8 ACLS AND ROUTERS ........................................................................................ 51

3

5 THE ORIGINAL BPF ARCHITECTURE AND WINPCAP .......................... 52

5.1 COMPONENTS OF THE BPF ARCHITECTURE ...................................................... 52 5.1.1 Network tap and BPF virtual machine.................................................... 53 5.1.2 Kernel packet buffer ............................................................................... 55 5.1.3 LIBPCAP library .................................................................................... 55

5.2 INTERACTION BETWEEN LIBPCAP AND THE BPF VIRTUAL MACHINE ............. 56 5.3 IMPLEMENTATION UNDER MICROSOFT OPERATING SYSTEMS: WINPCAP ........ 57

5.3.1 Differences from the original BPF architecture...................................... 58

6 THE TABLE MANAGEMENT ENGINE (TME)............................................. 60

6.1 LIMITS OF THE ORIGINAL BPF ARCHITECTURE................................................. 60 6.1.1 Small and limited memory...................................................................... 60 6.1.2 Impossibility to modify the packet ......................................................... 61 6.1.3 Simple instruction set.............................................................................. 61

6.2 EXTENSIONS TO THE ORIGINAL BPF ARCHITECTURE........................................ 62 6.3 EXTENSION REQUIREMENTS ............................................................................. 63

6.3.1 Proposed extensions................................................................................ 64 6.4 EXTENDED MEMORY ........................................................................................ 64 6.5 TME................................................................................................................ 65

6.5.1 A general overview of the LOOKUP operation ..................................... 67 6.5.2 A general overview of the EXECUTE instruction ................................. 67 6.5.3 Lookup tables.......................................................................................... 68 6.5.4 LOOKUP functions ................................................................................ 70 6.5.5 EXECUTE functions .............................................................................. 71 6.5.6 TME data blocks..................................................................................... 71 6.5.7 What happens when a LOOKUP is performed....................................... 74 6.5.8 What happens when an EXECUTE is performed................................... 75

6.6 INTERACTION OF THE EXTENSIONS WITH THE ORIGINAL BPF ARCHITECTURE .. 76 6.6.1 Interaction with the extended memory ................................................... 77 6.6.2 Interaction with the TME........................................................................ 77

6.7 INTERACTION WITH APPLICATIONS................................................................... 77 6.8 INIZIALIZATION OF THE TME........................................................................... 79 6.9 THIRD-PARTIES EXTENSIONS............................................................................ 85 6.10 KNOWN LIMITATIONS....................................................................................... 85

7 TME – THE IMPLEMENTATION.................................................................... 87

7.1 THE MODE_MON WORKING MODE................................................................ 87 7.2 DATA STRUCTURES .......................................................................................... 87

7.2.1 TME_CORE ........................................................................................... 88 7.2.2 TME_DATA........................................................................................... 89 7.2.3 RECORD ................................................................................................ 90 7.2.4 MEM_TYPE........................................................................................... 90

7.3 NEW BPF INSTRUCTIONS ................................................................................. 91 7.3.1 INIT ........................................................................................................ 91 7.3.2 VALIDATE ............................................................................................ 91 7.3.3 LOOKUP ................................................................................................ 92 7.3.4 EXECUTE .............................................................................................. 92

7.4 LOOKUP CALLBACKS..................................................................................... 93 7.4.1 Two examples: default_lut_w_insert and default_lut_wo_insert ........... 94

4

7.4.2 Some tips for writing new LOOKUP callbacks...................................... 96 7.5 EXECUTE CALLBACKS................................................................................... 98

7.5.1 An example: count_packets............................................................ 98 7.5.2 Some tips for writing new EXECUTE callbacks................................... 99

7.6 INTERACTION WITH THE USER ........................................................................ 100 7.6.1 Passing data to the user ......................................................................... 100 7.6.2 Filter injection....................................................................................... 102

7.7 IMPLEMENTATION ISSUES .............................................................................. 103

8 INTERACTING WITH THE TME.................................................................. 106

8.1 THE SWITCH STATEMENT ............................................................................ 106 8.2 INTERPRETING DATA RECEIVED BY THE TME COPROCESSOR......................... 108 8.3 THE PCAP_KEYPARSER() FUNCTION................................................................. 109 8.4 THE KEY PARSING PROCESS............................................................................ 110

9 A CASE STUDY ................................................................................................. 114

9.1 A SYSTEM TO MONITOR THE TRAFFIC SENT BY EACH HOST IN A NETWORK..... 114 9.2 A SYSTEM TO MONITOR WEB, FTP, SMTP AND TELNET TRAFFIC ................. 116 9.3 A SYSTEM TO MONITOR IP, TCP AND UDP TRAFFIC ...................................... 120 9.4 PERFORMANCES ............................................................................................. 122

9.4.1 The monitor system under test.............................................................. 122 9.4.2 Test bed................................................................................................. 124 9.4.3 Test 1..................................................................................................... 125 9.4.4 Test 2..................................................................................................... 126 9.4.5 Test 3..................................................................................................... 128 9.4.6 Test 4..................................................................................................... 129

10 FUTURE WORK............................................................................................ 131

11 RELATED WORK......................................................................................... 133

11.1 THE STATISTICAL MODE OF WINPCAP DRIVERS ............................................. 133 11.2 NETRAMET ................................................................................................... 133 11.3 NTOP............................................................................................................ 134

12 TME EXTENSION PROGRAMMING MANUAL .................................... 136

12.1 EXTENDED MEMORY MANAGEMENT INSTRUCTIONS....................................... 136 12.1.1 SETMEMORY ..................................................................................... 137 12.1.2 EX_LDx................................................................................................ 137 12.1.3 EX_LDx_REL ...................................................................................... 138 12.1.4 EX_LDXx............................................................................................. 138 12.1.5 EX_STx ................................................................................................ 139 12.1.6 EX_STx_REL....................................................................................... 139 12.1.7 EX_STXx.............................................................................................. 140

12.2 TME OPERATIVE INSTRUCTIONS .................................................................... 140 12.2.1 LOOKUP .............................................................................................. 141 12.2.2 EXECUTE ............................................................................................ 141 12.2.3 SET_ACTIVE....................................................................................... 142

12.3 TME INITIALIZATION INSTRUCTIONS ............................................................. 143 12.3.1 SET_WORKING .................................................................................. 143 12.3.2 INIT ...................................................................................................... 143

5

12.3.3 VALIDATE .......................................................................................... 144 12.3.4 GET_REGISTER_VALUE.................................................................. 146 12.3.5 SET_REGISTER_VALUE................................................................... 147 12.3.6 SET_ACTIVE_READ.......................................................................... 148

13 BIBLIOGRAPHY........................................................................................... 149

1 Introduzione

Negli ultimi anni si è assistito ad una diffusione mondiale delle reti di calcolatori, in

particolare del fenomeno Internet; come conseguenza, il traffico presente sulle reti

informatiche ha avuto un aumento vertiginoso, con l’effetto che la richiesta di sistemi

per la loro realizzazione e il loro mantenimento, quali tools di cattura pacchetti, motori

statistici, firewalls, sistemi anti-intrusione, è esplosa.

In particolare le soluzioni che il mercato offre sono di due tipi: sistemi basati su

piattaforme hardware dedicate, e sistemi software, basati su tools che possono essere

eseguiti su normali PC o workstations, e utilizzano il loro collegamento normale alla

rete per eseguire i loro compiti. Le soluzioni hardware assicurano migliori prestazioni,

in generale, ma sono in generale limitate quanto ad espansibilità ed aggiornabilità, ed

inoltre il loro costo è spesso troppo elevato rispetto alle reali esigenze degli utenti. Le

soluzioni software, invece, nonostante in generale non raggiungano le stesse prestazioni

di quelle hardware, hanno grandi vantaggi, tra cui il loro facile sviluppo, il basso costo,

e molto spesso la loro estrema versatilità nell’adattarsi alle varie tipologie di rete

presenti sul mercato.

La maggior parte dei sistemi per la realizzazione e l’analisi di reti informatiche di solito

condivide una serie di funzionalità comuni, che va sotto il nome di packet filtering.

Questo lavoro di tesi si propone l’obiettivo di:

• individuare quali funzioni accomunano apparati di rete quali sistemi di cattura, tools

di monitoring e statistici, firewalls e intrusion detection systems;

• proporre una architettura software, integrata in un sistema esistente di cattura

pacchetti – WinPcap – che permetta lo sviluppo di nuovi tools di rete attraverso una

interfaccia comune alle varie tipologie di applicativi;

• realizzare questa architettura all’interno del driver di cattura pacchetti che sta alla

base dell’architettura WinPcap.

1.1 Il concetto di packet filtering

Con il termine di packet filtering si intende quel processo attraverso il quale i dati

che passano in una rete di calcolatori, i quali sono raggruppati in pacchetti, cioè in

blocchi, sono catturati e analizzati, generalmente andando ad controllare il valore di

particolari campi presenti negli headers, la parte iniziale di ogni pacchetto; gli headers

contengono le informazioni importanti circa il contenuto dei pacchetti stessi, ad

esempio da chi sono stati mandati e a chi sono diretti . Si parla di filtraggio perché i

1.2 Applicazioni che sfruttano il packet filtering

7

pacchetti che non passano i controlli sugli headers vengono scartati, altrimenti

subiscono una ulteriore elaborazione che è specifica per il particolare tool.


1.2.1 Sistemi di cattura

Sono di solito i sistemi più semplici: si tratta di tools che sono in grado di catturare il

traffico che scorre sulla rete, trasferendolo all’utente per una successiva elaborazione o

visualizzazione. Le soluzioni software di cattura sono di solito formate da un modulo a

basso livello, inserito nel kernel del sistema operativo, che è a stretto contatto con la

scheda di rete, e da questa ottiene i pacchetti, e da un modulo di interfaccia tra il primo

modulo e le applicazioni.

Per avere buone prestazioni, si cerca di solito di limitare il numero di pacchetti trasferiti

a livello utente, il packet filtering viene qui usato per scremare i pacchetti transitanti

sulla rete, e passare all’applicazione a livello utente solo i pacchetti di effettivo

interesse; si opera in questa maniera perché è abbastanza raro che una applicazione

necessiti di tutti i pacchetti per le sue elaborazioni.

1.2.2 Sistemi di monitoring e motori statistici

Si tratta di tools che forniscono informazioni circa il traffico di rete sotto vari aspetti,

di norma in termini di byte e pacchetti al secondo. Possiamo individuare quattro grosse

categorie di statistiche:

• semplici: sono quelle che coinvolgono un singolo aspetto del traffico di rete, per

esempio il numero di pacchetti IP piuttosto che l’occupazione di banda relativa al

Web.

• matrici di traffico: ci interessa visualizzare una matrice del traffico scambiato tra

i vari hosts di una LAN, oppure la distribuzione dei protocolli trasportati su TCP

(e.g. Web, Telnet ed FTP).

• analisi multiple, per esempio siamo interessati ad ottenere il traffico totale, quello

IP e in particolare quello UDP all’interno di uno stesso grafico.

• analisi sui flussi (e.g. le sessioni TCP).

E’ abbastanza chiaro che, a parte la prima, tutte queste tipologie di analisi hanno

bisogno di tabelle per rappresentare i dati, in cui ogni cella rappresenta un particolare

aspetto del traffico:

• per le matrici di traffico e le analisi multiple ogni cella della tabella rappresenta

un singolo aspetto del traffico, come mostrato in Figura 1;

• per le analisi sui flussi, ogni cella rappresenta un diverso flusso, per esempio una

particolare sessione TCP.


8

Web (80)

Mail (25)

Telnet (23)

FTP (21)

Data

packets=134

bytes=296485

Data

packets=93

bytes=59642

Data

packets=24

bytes=19215

Data

packets=16

bytes=4898

Figura 1 - Una semplice tabella relativa al traffico TCP

I tools esistenti di solito sono basati su un sistema di cattura pacchetti, da cui ricevono

tutto il traffico di rete, e quindi effettuano le elaborazioni statistiche completamente a

livello utente. Questa soluzione, benché di più semplice realizzazione, di solito non ha

delle prestazioni soddisfacenti, in quanto l’operazione di trasferire tutti i pacchetti di

rete a livello utente è sempre molto onerosa: tutti i sistemi di cattura infatti sono

architetture cosiddette 2-copy, cioè i pacchetti vengono copiati due volte prima di

arrivare all’utente, e questo è generalmente molto pesante.

1.2.3 Firewalls

Sono sistemi che si frappongono tra due reti di calcolatori, e sono l’unico punto di

contatto tra le due LAN. Questi sistemi sono progettati per isolare e proteggere una rete

dall’altra, e viceversa; non a caso il termine firewall significa “muro contro il fuoco”.

Esistono in generale due tipi di firewalls: i packet filtering firewalls, che lavorano sui

singoli pacchetti, e sono di solito trasparenti alle applicazioni, e gli application

gateways, che invece lavorano a un livello superiore, in pratica a livello di sessioni, e

non sono trasparenti alle applicazioni (i.e. le applicazioni devono essere modificate per

funzionare con questi apparati).

I packet filtering firewalls analizzano ogni pacchetto che arriva da una interfaccia di

rete, e decidono, in base a delle regole definite dall’utente, se un pacchetto deve essere

ritrasmesso su un’altra interfaccia di rete o deve essere scartato. Le regole definite

dall’utente sono in pratica delle condizioni sul contenuto degli headers dei pacchetti, per

cui il processo prima descritto non è altro che un packet filtering.

Le regole definite dall’utente sono le cosiddette ACL, Access Control Lists, cioè una

lista di filtri semplici, e più o meno omogenei, seguite da una azione da intraprendersi se

il pacchetti soddisfa quel filtro. Un esempio di una semplice ACL è il seguente

1.3 Una caratteristica comune: la gestione di tabelle

9

1. ip source 130.192.3.0/24 accept

2. ip dest 130.192.44.25 accept

3. ip source 130.192.2.44 and dest 130.192.11.43 accept

4. all deny

Questa ACL dice di accettare i pacchetti provenienti da particolari indirizzi IP sorgente

o destinazione, e di scartare gli altri.

1.2.4 Intrusion Detection Systems (IDS)

Questi sistemi sono una via di mezzo tra gli i motori statistici e i firewalls: sono usati

per proteggere le reti informatiche, in particolare segnalando i tentativi di attacco agli

hosts, come i firewalls, ma non sono in grado di scartare i pacchetti, e dal punto di vista

della rete sono più simili a dei sistemi di monitoring molto evoluti. Questi sistemi di

solito sono in grado di evidenziare dei tentativi di attacchi attraverso le cosiddette

signatures (firme), cioè delle sequenze di bytes predefinite che sono presenti nei

pacchetti ‘cattivi’. Per esempio un sistema di IDS è in grado di segnalare la presenza di

un virus in un file allegato ad un messaggio di posta elettronica cercando le signatures

dei virus all’interno dei payload dei pacchetti di posta elettronica (i.e. i pacchetti TCP

delle porte relative ai servizi POP, SMTP e IMAP).

Questi sistemi mantengono una tabella delle signatures ‘pericolose’, e controllano ogni

pacchetto alla ricerca di uno queste sequenze, attraverso il sistema del pattern matching,

cioè un sistema che è in grado di segnalare se una particolare stringa è presente

all’interno di un buffer di bytes.

Dal punto di vista utente, un IDS può segnalare la presenza di traffico ‘sospetto’ in vari

modi, per esempio mandando un messaggio WinPopup ad una stazione Windows,

oppure mantenendo un LOG file di questi tentativi di attacco.

1.3 Una caratteristica comune: la gestione di tabelle

Una delle caratteristiche che accomuna i vari tools prima citati (a parte i sistemi di

cattura puri) è la gestione di tabelle: i sistemi di monitoring hanno bisogno di matrici

per memorizzare le statistiche relative al traffico di rete, i firewalls hanno bisogno di un

metodo per mantenere in memoria le ACL, e queste possono essere rappresentate in una

tabella, i sistemi di intrusion detection necessitano di un database in cui memorizzare le

varie signatures degli attacchi. Un altro punto importante è il fatto che ogni tool compie

una azione diversa dopo aver ‘catalogato’ il pacchetto nella tabella: i sistemi statistici

devono incrementare dei contatori relativi a quel tipo di pacchetto (e.g. il byte-counter

relativo ai pacchetti Web), i firewalls ritrasmettono il pacchetto su un’altra interfaccia o

lo scartano, gli IDS devono segnalare all’utente la presenza di pacchetti ‘sospetti’.

1.4 Il sistema di cattura WinPcap

10

1.4 Il sistema di cattura WinPcap

WinPcap è una architettura di cattura pacchetti per i sistemi operativi Windows a 32

bit sviluppata presso il NetGroup del Politecnico di Torino. E’ basato sull’architettura

BPF – Berkeley Packet Filter – e sulla libreria LIBPCAP, sviluppate entrambe presso

l’università di Berkeley ed inclusa nei sistemi operativi BSD.

Si tratta di un sistema che permette la cattura pacchetti selettiva, cioè attraverso un

sistema di packet filtering che trasferisce solo i pacchetti utili alle applicazioni a livello

utente.

L’architettura WinPcap è composta di tre parti fondamentali, cioè da un kernel driver

e da due librerie a caricamento dinamico (DLL).

Il sistema, a basso livello, è composto da un kernel driver che si occupa del filtraggio

dei pacchetti e della memorizzazione in un kernel buffer; il sistema di filtraggio

funziona attraverso un processore virtuale, il processore BPF, che esegue un

programma, scritto in codice Assembler-like, su ogni pacchetto ricevuto; questo

programma analizza il pacchetto attraverso i suoi headers, e decide se il pacchetto è

accettato, e quindi va trasferito alla applicazione, o va scartato. Il programma eseguito

dalla macchina BPF, il cosiddetto filtro, è definito dall’utente.

A livello utente, WinPcap è costituito da una libreria a caricamento dinamico

(wpcap.dll), che esporta una serie di API che sono utilizzate per l’interazione con il

kernel driver; in particolare ci sono una serie di funzioni per aprire e chiudere le istanze

di cattura, per programmare la macchina BPF con l’opportuno filtro, e naturalmente per

ottenere i pacchetti catturati dal driver.

La libreria è dotata anche di un compilatore molto potente, che è in grado di generare i

filtri per la macchina BPF, che sono scritti in linguaggio macchina, a partire da stringhe

di testo che descrivono il traffico di interesse; nella Figura 2 è mostrato un esempio di

filtro in linguaggio ‘human-readable’ e il corrispondente filtro per la macchina virtuale

BPF.

“Ip and src host 10.10.10.10”

(000) ldh [12](001) jeq #0x800 jt 2 jf 5(002) ld [26](003) jeq #0x0A0A0A0A jt 4 jf 5(004) ret #1514(005) ret #0

pcap_compile

BPFVIRTUAL PROCESSOR

pcap_setfilter

Packets Output

Figura 2 - Un esempio di filtro.

1.5 Il Table Management Engine (TME)

11

La seconda libreria a livello utente è packet.dll; si tratta di una interfaccia tra il driver e

wpcap.dll, che è usata per rendere quest’ultima indipendente dalla particolare versione

di Windows.

Nel disegno in Figura 3 è rappresentata l’architettura nella sua interezza.

filter

Network Tap

Packet

CaptureDriver

Kernel

UserLevel

KernelBuffer

read IOCTL/write

Level

Packet.dll

Wpcap.dll

Application

Figura 3 - L'architettura WinPcap

La macchina virtuale BPF è dotata di:

• due registri a 32 bit, A e X;

• una piccola memoria RAM a 32 bit di 16 parole;

• un program counter implicito.

Il suo set di istruzioni permette:

• caricamenti (LOAD) dal pacchetto, e dalla memoria RAM;

• salvataggi (STORE) nella memoria RAM;

• salti (JUMP) condizionati e incondizionati, solo in avanti;

• operazioni aritmetiche su numeri interi;

• istruzioni di ritorno(RET); queste istruzioni sono quelle che fermano l’esecuzione

del filtro BPF; il valore ritornato indica se il pacchetto è accettato oppure no.


L’architettura BPF ha un set di istruzioni abbastanza limitato quando si vogliono

effettuare operazioni complesse; ad esempio la mancanza di salti all’indietro pone seri

vincoli circa la realizzazione di tabelle. Un’altra limitazione è legata alla memoria: essa

è infatti molto piccola, ed inoltre è indirizzata per parole di 32 bit, per cui non è adatta

per la memorizzazione di tabelle.

Come conseguenza, l’architettura originaria della macchina virtuale è stata estesa con

due blocchi: l’extended_memory e il TME.

L’extended_memory è una memoria di dimensione definibile dall’utente (è per

default di 64 kiloBytes, ma può essere modificata a run-time), indirizzabile per parole di


12

8 bit. Per gestire questa memoria sono state aggiunte delle nuove istruzioni di LOAD e

di STORE; in particolare, esistono istruzioni che lavorano a 8, 16 e 32 bit; i valori a 16

e 32 bit sono rappresentati in network byte order, per preservare l’ordine con cui i bytes

sono rappresentati nei pacchetti.

Il Table Management Engine (TME) è un coprocessore del processore BPF originale;

la sua funzionalità principale è quella di gestire tabelle multidimensionali in maniera

totalmente generica e programmabile dall’utente.

Le due estensioni sono mostrate graficamente in Figura 4.

BPF core

registers

extended_memory

TME core

packet

registers

basic

mem

ory

ctrl

TME co-processor

Figura 4 - Le estensioni alla macchina virtuale BPF

Il coprocessore TME usa la memoria estesa per memorizzare delle tabelle di lookup,

ogni cella della tabella è contraddistinta dalla key, cioè da un identificativo che la

distingue univocamente. Ogni cella è associata ad un blocco di memoria, di dimensione

uguale per tutte le celle, in cui è possibile memorizzare dei dati caratteristici di quella

cella: per esempio, un sistema statistico memorizza in questo blocco i contatori relativi

a pacchetti e byte contraddistinti da quella chiave.

Per esempio, supponiamo di volere realizzare un sistema per ottenere il traffico relativo

ai servizi Web, Telnet, FTP e Mail. La tabella che vogliamo realizzare è mostrata in

Figura 5.

In questo caso la key è rappresentata dalla porta TCP relativa al servizio desiderato, e

ogni blocco associato alla cella contiene i due contatori packets e bytes. Per gestire

questa situazione, il core BPF crea la key estraendo la porta TCP dal pacchetto, quindi

richiama il coprocessore attraverso l’istruzione LOOKUP.


13

Web (80)

Mail (25)

Telnet (23)

FTP (21)

Data

packets=134

bytes=296485

Data

packets=93

bytes=59642

Data

packets=24

bytes=19215

Data

packets=16

bytes=4898

Figura 5 - Una semplice tabella

Questa istruzione prende la key e fa una ricerca nella tabella, ritornando alla macchina

BPF originaria l’indice corrispondente alla cella nella tabella, la cui key è uguale a

quella fornita alla LOOKUP.

Il coprocessore TME in questa fase non incrementa i contatori relativi, questo viene

fatto attraverso l’istruzione EXECUTE: eseguendo questa istruzione, il core BPF passa

il controllo al TME, che incrementa i contatori relativi alla cella trovata durante l’ultima

esecuzione dell’istruzione LOOKUP.

L’istruzione LOOKUP: in realtà la gestione della istruzione LOOKUP da parte del

TME è molto particolare: la gestione della tabella viene infatti fatta tramite delle

callbacks, cioè delle funzioni definite dall’utente; in particolare, esse non sono altro che

funzioni scritte in linguaggio ANSI C che sono inserite in maniera trasparente nei

sorgenti dell’estensione TME, e vengono compilati quando il kernel driver (WinPcap)

che ospita il coprocessore viene compilato. E’ possibile utilizzare delle funzioni

predefinite, oppure si possono creare delle nuove funzioni e aggiungerle all’architettura

in maniera totalmente trasparente. La scelta della particolare funzione che verrà usata

dal TME quando viene eseguita l’istruzione LOOKUP è fatta in una fase preliminare,

prima che il sistema sia messo in funzione. Se durante la fase di inizializzazione non

viene scelta una callback precisa, l’architettura utilizza una funzione predefinita che

gestisce la tabella tramite un hash ad indirizzamento aperto.

Sempre durante la fase di inizializzazione è possibile definire anche altri parametri di

gestione della tabella, quali la sua dimensione e la lunghezza della key.

L’istruzione EXECUTE: come per l’istruzione LOOKUP, anche questa istruzione è

gestita tramite callbacks; in questo modo è possibile associare azioni diverse alla

tabella, a seconda del particolare utilizzo della stessa.


14

In pratica, dopo aver richiamato l’istruzione LOOKUP per effettuare la ricerca nella

tabella gestita dal TME, si richiama l’istruzione EXECUTE per eseguire una particolare

operazione sulla cella trovata con la LOOKUP.

La più semplice EXECUTE, che è quella predefinita, usa il blocco di memoria associato

alla key come due contatori a 64 bit che rappresentano i bytes e i pacchetti,

corrispondenti a quella key, che sono transitati sulla rete.

Come per l’istruzione LOOKUP, anche in questo caso la scelta della callback da usare

viene fatta nella fase di inizializzazione, in caso contrario viene utilizzata la funzione

predefinita che mantiene dei contatori relativi a pacchetti e bytes (come nell’esempio di

Figura 5).

Il processo eseguito dalle istruzioni LOOKUP ed EXECUTE è mostrato in Figura 6 e in

Figura 7.

right

VOID

BPF core

registers

TME core

registersbasic

mem

ory

ctrl

packet

extended_memory

key

VOID

right

VOID

wrong

wrong

VOID

2. LOOKUP call

1. BPF core copies from packet

to extended memory

3. search

4. TME core saves the address of the found

entry in a register of the

TME

5. UPFE core gives control

back to BPF core

wrong means ‘keys don’t match’

right means ‘keys match’VOID means a void entry

lookup table

Figura 6 - L'istruzione LOOKUP

right

VOID

BPF core

registers

TME core

registersbasic

mem

ory

ctrl

1. EXECUTE call

Packets = 175

176Bytes = 55434

55758

2. Read LUT entry in a

register of the TME

3. Execute the exec_fcn on the data present in that entry

4. TME core givescontrol back to BPF core

extended_memory

lookup table

Figura 7 - L'istruzione EXECUTE


15

La scelta di separare la gestione della ricerca nella tabella, cioè il compito effettuato

dalla LOOKUP, dalla azione effettuata sulla singola cella, cioè cosa avviene

richiamando la EXECUTE, è voluta: in questo modo è possibile riutilizzare lo stesso

algoritmo di gestione delle tabelle anche con funzioni EXECUTE diverse, e viceversa:

modificando la gestione della tabella (per esempio modificando l’algoritmo di hash) le

funzioni EXECUTE non devono essere modificate.

Il sistema è in grado di gestire tabelle multidimensionali collassandole in tabelle

unidimensionali: in questo caso le etichette che contraddistinguono ogni dimensione

della tabella sono concatenate a fornire una unica etichetta. Per esempio, se si vuole

gestire una tabella bidimensionale rappresentante il traffico tra i vari hosts di una rete

informatica, questa si può collassare in una tabella unidimensionale in cui ogni cella ha

una key che rappresenta la sequenza ordinata dell’indirizzo sorgente e destinazione. Con

questa tecnica è inoltre più facile gestire tabelle sparse (cioè con parecchie celle vuote),

in quanto è sufficiente salvare solo le celle realmente utilizzate.

Un’altra peculiarità dell’architettura è il fatto che il TME è in grado di gestire più

tabelle contemporaneamente, l’unico limite è dato dalla dimensione

dell’extended_memory, essendo le varie tabelle salvate in questa memoria.

1.5.1 Interazione con l’utente

L’interazione tra il TME e l’utente può essere divisa in due fasi: da un lato è

necessario inizializzare il TME in modo opportuno e utilizzare un filtro BPF adeguato,

dall’altro è necessario trasferire i dati contenuti nella tabella all’utente, per una

successiva post-elaborazione.

Dal punto di vista dell’inizializzazione, il TME viene pre-programmato attraverso

una serie di nuove istruzioni BPF che istruiscono il TME sulla tabella che deve gestire;

questo codice di inizializzazione viene passato nella macchina virtuale BPF nella stessa

maniera dei filtri BPF tradizionali; un esempio molto semplice di codice di

inizializzazione è il seguente

1. SET_MEMORY 1500000

2. INIT 0

3. LD 2

4. SET_REGISTER_VALUE TME_KEY_LEN

5. LD 8

6. VALIDATE 0

7. SET_ACTIVE_READ 0

8. RET 1

1.6 Interazione tra il TME e la libreria LIBPCAP

16

Questo codice di inizializzazione crea una tabella con 320071 celle, gestita con le

funzioni di LOOKUP e di EXECUTE di default; la key è lunga 2 parole da 32 bit (=8

bytes).

Il filtro BPF deve invece creare la key e quindi richiamare le istruzioni LOOKUP ed

EXECUTE. La chiave viene costruita estraendo alcuni campi dal pacchetto e salvandoli

in una particolare locazione di memoria nella extended memory. Un esempio è il

seguente:

1. LD P[6]

2. EX_ST MEM_EX[0]

3. LDH P[10]

4. EX_STH MEM_EX[4]

5. LOOKUP

6. EXECUTE

7. RET 1

Questo codice di filtraggio usa l’indirizzo MAC sorgente come key.

Per quanto riguarda il trasferimento dei dati a livello utente, il TME passa alle

applicazioni l’intera tabella, senza modifiche; se più tabelle sono utilizzate

contemporaneamente, durante la fase di inizializzazione è necessario scegliere quale di

queste deve essere trasferita all’utente (non è possibile trasferire più di una tabella

all’utente).


La libreria LIBPCAP, e la corrispondente DLL in ambiente Windows, wpcap.dll, è

dotata di un compilatore che è in grado di generare dei filtri BPF per la cattura pacchetti

a partire da stringhe di testo, e.g. ‘ip’; per generare i filtri BPF, e il corrispondente

codice di inizializzazione, che sfruttino l’estensione TME, è necessario modificare il

compilatore.

Poiché l’architettura TME è generica, ci siamo limitati a proporre una estensione al

compilatore che permetta di creare dei sistemi di monitoring complessi.

In particolare il compilatore esteso è in grado di tradurre in istruzioni BPF un nuovo

costrutto, switch, che è usato per dichiarare un sistema di monitoring.

1 Il valore non è casuale: si tratta di un numero dispari, e ciò garantisce migliori prestazioni nella gestione della tabella come un hash.


17

Un esempio di costrutto switch, che dichiara il sistema statistico presentato in Figura 5,

è il seguente

switch(tcp src port or tcp dst port= 80, 21, 23, 25)

Questo costrutto, dato in pasto al compilatore esteso, permette la creazione di una

tabella di 4 celle, in cui ogni cella rappresenta il traffico delle porte TCP relative a Web,

FTP, Telnet e SMTP. Il compilatore esteso, ricevendo questa stringa, genera il filtro

BPF (che costruisce la key e richiama la LOOKUP e la EXECUTE) e il codice per

l’inizializzazione del TME.

Dal punto di vista del trasferimento della tabella a livello utente, ogni cella è

identificata univocamente dalla key, che non è altro che una sequenza di bytes; è

necessario quindi fornire una API che sia in grado di interpretare questa sequenza di

bytes, in pratica una traduzione in formato leggibile, pcap_parsekey(). Per questo

motivo il compilatore esteso, oltre a generare codice BPF, fornisce anche una struttura

dati che viene utilizzata dalla pcap_parsekey per tradurre le chiavi in formato leggibili.

E’ necessario adottare una soluzione di questo genere perché le chiavi sono generate

direttamente dal codice BPF generato dal compilatore, che di conseguenza è l’unico a

conoscere il modo per ritradurre le chiavi in un formato leggibile.

Una descrizione grafica del sistema di interpretazione delle chiavi è mostrato in Figura

8.

EXTENDEDCOMPILER

EXTENDEDCOMPILER

SWITCH(tcp src port=21,23,25,80)SWITCH(tcp src port=21,23,25,80)

struct

bpf_keyparser

To parse the key

struct

bpf_keyparser

To parse the keyBPF filterBPF filter

TME data

pcap_keyparser()pcap_keyparser()

BPF core

registers

extended_memory

TME core

packet

registers

basic

mem

ory

ctrl

TME co-processor

BPF + TME Figura 8 - La decodifica delle keys

Per esempio, nell’esempio sopra riportato, pcap_parsekey() è in grado di effettuare

questa traduzione (la chiave è il numero 80 in esadecimale su 32 bit)

1.7 Implementazione

18

key=0x00000050 � “tcp src port or tcp dst port 80”

1.7 Implementazione

Il codice del TME è stato scritto interamente in linguaggio ANSI C, poiché questo è

in pratica l’unico linguaggio di programmazione utilizzabile per scrivere codice che è

eseguito a livello kernel; nella fattispecie, il codice è stato integrato all’interno del

driver di protocollo di WinPcap, nel quale è presente la macchina virtuale BPF.

E’ stato aggiunto un nuovo modo di funzionamento (working mode) – MODE_MON -

per permettere il trasferimento delle tabelle del TME a livello utente; dal punto di vista

della macchina BPF, infatti, il coprocessore TME può essere sempre utilizzato, ma i dati

ad esso relativi vengono portati a livello utente solo in modalità MODE_MON.

Il trasferimento dei dati dal coprocessore TME alle applicazioni è effettuato con un

approccio 1-copy: i dati presenti a livello kernel, nella tabella del TME, vengono copiati

in un buffer allocato dall’applicazione. Un’altra possibile soluzione implementativi era

quella di utilizzare un buffer condiviso tra il TME e l’applicazione; questa soluzione è

stata scartata perché i vantaggi, in termini di prestazioni, derivanti da tale soluzione

erano di gran lunga minori dei problemi implementativi e di concorrenza che ne

derivavano.

I principali problemi implementativi sono stati causati dalla particolare famiglia di

sistemi operativi, Windows; il principale problema che si è dovuto fronteggiare è

relativo ai timestamps: ogni cella delle tabelle del TME contiene un timestamp relativo

all’ultima volta in cui la stessa è stata aggiornata. L’unica API fornita dai tools di

sviluppo Microsoft che garantisca una precisione dell’ordine del microsecondo è

KeQueryPerformanceCounter, che tuttavia ha un problema molto pesante: ci vogliono

circa 3 µs per ottenere il risultato. Come conseguenza questa funzione rappresenta un

collo di bottiglia molto stretto per le prestazioni dell’architettura. Una soluzione

alternativa è quella di utilizzare una istruzione Assembler presente in tutti i processori

Intel a 32 bit: RDTSC; questa istruzione ritorna il numero di cicli di clock passati

dall’accensione della macchina, per cui è estremamente precisa, ed inoltre viene

eseguita in un solo ciclo di clock, per cui è pure velocissima. L’uso di questa istruzione

Assembler pone però alcuni problemi:

• È necessario sapere la frequenza del processore con buona precisione, e i sistemi

Windows non forniscono questa informazione.

• Questa istruzione è presente solo sui processori Intel (e sui ‘cloni’ AMD come il

processore Athlon).

1.8 Prestazioni

19

1.8 Prestazioni

Per misurare le prestazioni dell’architettura proposta, è stato realizzato un sistema di

monitoring basato sull’estensione TME, che gestisce una tabella la cui key è

rappresentata dalla sequenza ordinata di indirizzi IP sorgente e destinazione e porte TCP

sorgente e destinazione.

La tabella è stata inizializzata con i valori di default, in pratica si tratta di una tabella

di 32007 celle gestite mediante un hash, la funzione di EXECUTE è quella che conta i

pacchetti e i bytes per ogni singola cella.

La piattaforma di test è basata su due PC con processori Intel, operanti con sistema

operativo Windows 2000 Service Pack 2 Inglese. I due PC montano entrambi una

scheda di rete Gigabit Ethernet su rame, collegate tramite un cavo UTP incrociato. Su

entrambe le schede di rete è stato disabilitato il protocollo TCP/IP.

Questa configurazione è mostrata in Figura 9.

Ethernet 1Gbps

Full duplex

Traffic

Generator

Sender Receiver

Pentium III - 500MHz

256MB RAM, 13GB HD

3Com Gigabit Ethernet 3C996T

Windows 2000 Professional, eng

Pentium II - 400MHz

128MB RAM, 6.4GB HD


Windows 2000 Server, eng

Ethernet 1Gbps

Full duplex

Traffic

Generator

Sender Receiver

Pentium III - 500MHz

256MB RAM, 13GB HD


Windows 2000 Professional, eng

Pentium II - 400MHz

128MB RAM, 6.4GB HD


Windows 2000 Server, eng Figura 9 - Piattaforma di test

La macchina usata come generatore di pacchetti è in grado di trasmettere una serie di

pacchetti IPv4 contenenti un payload TCP, uguali tra loro, a un packet rate costante e

predefinito, attraverso una versione modificata dello stesso driver WinPcap. I pacchetti

hanno una dimensione fissa di 100 bytes.

Le misure sono state effettuate usando il Task Manager di Windows 2000, per le

misure di carico di CPU, e usando alcuni contatori particolari di cui i processori Intel

Pentium sono dotati, per le misure dei cicli del processore.

Test 1: questo primo test vuole dare una visione globale delle prestazioni del sistema,

circa il packet rate massimo che è in grado di gestire.

In questo caso sono stati mandati dei burst di alcuni milioni di pacchetti tutti uguali a

differenti packet rate, ed è stato misurato il carico di CPU. Il risultato è mostrato nella

Figura 10. Il test è stato eseguito usando utilizzando la versione dei drivers che usa

l’istruzione RDTSC per generare i timestamps.

Prima di tutto occorre notare che il sistema di generazione pacchetti non è riuscito a

generare un numero di pacchetti maggiore di circa 150.000 pacchetti al secondo, anche

1.8 Prestazioni

20

se il carico di CPU non era mai al 100%; questo potrebbe essere dovuto a qualche limite

delle schede di rete o a qualche problema nei driver della scheda stessa o in WinPcap.

È interessante notare come l’andamento del carico di CPU sia lineare con il packet rate;

attraverso un semplice calcolo teorico è possibile dedurre che la macchina di test

sarebbe in grado di gestire un packet rate massimo di circa 225.000 pacchetti al

secondo.

CPU vs. packet rate

149.300

225.000

0%

20%

40%

60%

80%

100%

0 50.000 100.000 150.000 200.000 250.000

packet rate

CP

U Theorical

Experimental

Figura 10 - Carico di CPU vs. packet rate

Test 2: il secondo test vuole mettere in luce il contributo di ogni componente alle

prestazioni del coprocessore TME. Attraverso i contatori dei processori Intel Pentium

sono stati misurati i colpi di clock necessari per ognuna delle fasi di elaborazione del

pacchetto. I test sono stati effettuati sia con la versione del driver che usa RDTSC per

generare i timestamps, sia con la versione che usa l’API KeQueryPerformanceCounter.

I risultati sono mostrati in Figura 11 e Figura 12.

CPU ticks of each component (RDTSC)

418

146

296

1241350

Underlying drivers

Key generation

LOOKUP withouttimestamp

Timestamp

EXECUTE

Total CPU ticks = 2334

Figura 11 - Cicli di clock relativi a ogni fase di elaborazione del pacchetto

Timestamps generati tramite l’istruzione RDTSC

1.9 Sviluppi futuri

21

CPU ticks of each component

(KeQueryPerformanceCounter)

418

146

2156

124

1350

Underlying drivers

Key generation


Timestamp

EXECUTE


Figura 12 - Cicli di clock relativi a ogni fase di elaborazione del pacchetto

Timestamps generati tramite l’API KeQueryPerformanceCounter

Alcune considerazioni possono essere tratte da questi grafici:

• Il calcolo dei timestamps usando KeQueryPerformanceCounter è quasi dieci volte

più lenta del calcolo con l’istruzione RDTSC; l’uso dell’API di Windows è quindi

certamente un collo di bottiglia per le prestazioni del sistema.

• Anche usando l’istruzione RDTSC, il processo di generazione dei timestamps

occupa circa il 10% del totale; questo è dovuto al fatto che è comunque necessario

effettuare alcune divisioni intere a 64 bit per ottenere i timestamps nel formato usuale

usato da WinPcap/LIBPCAP, operazioni che sono in generale molto lente sui

processori a 32 bit.

• E’ interessante notare come nel caso di Figura 11, più della metà dei cicli di clock

necessari per processare il pacchetto sono utilizzati dai drivers sottostanti WinPcap.

• Utilizzando un processore a 500 MHz (come quello della macchina di test), il tempo

minimo di interarrivo dei pacchetti deve essere di almeno 4.7 µs (utilizzando la

versione ‘RDTSC’), per cui il massimo packet rate che il sistema può elaborare è di

circa 215.000 pacchetti al secondo, un risultato comparabile con quello ottenuto

teoricamente nel test 1.

1.9 Sviluppi futuri

Al momento di scrivere questa tesi, l’architettura TME è per certi versi ancora

immatura: il core TME è stato scritto interamente, ma altre caratteristiche sono ancora

da perfezionare.

In particolare sono allo studio le seguenti modifiche e aggiunte:

• Nuove callbacks di LOOKUP e EXECUTE; in particolare sarebbe interessante

implementare una nuova LOOKUP che permetta di gestire in maniera performante

1.9 Sviluppi futuri

22

le ACL di un firewall; per quanto riguarda le EXECUTE, al momento esiste una

funzione che è in grado di gestire le sessioni TCP, riconoscendo i bytes trasmessi e

ricevuti nel payload TCP e lo stato della connessione attraverso i numeri di sequenza

e i flags. Questa funzione è tuttavia abbastanza rozza (non è in grado di gestire tutti i

casi possibili nella gestione delle sessioni, ed inoltre non è conforme alle ultime

versioni di TCP, come il TCP SACK). Inoltre bisogna ancora aggiungere una

funzione di LOOKUP, ed una eventuale EXECUTE, per eseguire il pattern

matching, necessario se si vogliono implementare dei sistemi IDS.

• Realizzare una versione del driver che bufferizzi i pacchetti a livello kernel. Questa

soluzione potrebbe garantire prestazioni migliori durante i bursts di rete.

• Risolvere in maniera definitiva i problemi relativi ai timestamps: al momento la

soluzione migliore è senz’altro quella che fa uso dell’istruzione Assembler RDTSC,

l’unico problema rimane quello di calcolare la frequenza del processore in maniera

precisa ed elegante.

• Implementare delle funzioni di logging all’interno della architettura estesa di

WinPcap (quella che fa uso del TME).

• Permettere la modifica ‘on-the-fly’ delle tabelle gestite dall’estensione TME: al

momento non è possibile modificare le tabelle che sono state create dal TME, è

necessario fermare il sistema e reinizializzare il coprocessore TME.

2 Introduction

In the last years we have assisted an enormous diffusion of computer networks, in

particular of Internet, that has entered in all the realities, from industry to schools.

In this scenario, the complexity of networks grows every day to grant the needs of final

users; as a consequence, there is an increasing need of network tools and components to

monitor, protect and enhance the functionality and security of networks.

The main tools required can be summarized in three categories:

• statistical and monitoring tools: they are systems that are able to give information

about network traffic at all levels, from the overall throughput to the number of

opened TCP sessions per second.

• components to protect network: they are tools that are able to filter the network

traffic directed to a LAN in order to prevent attacks conducted against its hosts. The

main tools of this category are firewalls and Intrusion Detection Systems (IDS).

• General capturing tools: these are components that are able to capture all the traffic

flowing through the network, transferring all the data to user applications for

subsequent processing.

Two approaches are usually available for these network tools: the former is based on

special purpose hardware, while the latter uses standard PCs or workstations to perform

its tasks, so the processing is mainly made in software.

The hardware solution usually has better performances, since it is a dedicated

architecture, but has the disadvantage that is hardly upgradeable and modifiable.

The software solution is less performing, particularly on slow machines, but it is

cheaper, it is faster to develop, and it is easy to upgrade. For this reason most of the

network tools use a software approach to perform these tasks.

2.1 Goals of this work

This work originates from the consideration that the above mentioned network tools

share a common part, the so called packet filtering, that is to say the process by which

each network packet is analyzed by looking at the values contained in the various

protocol headers, and processing packets that satisfy some criteria, that is to say that

contain some particular values in their headers.

The goals of this thesis is

• point out what features are needed for each network tool

• propose an architecture to perform all these tasks in a coherent and unified way

2.2 Chapters’ organization

24

• development of an high performing architecture that is able to perform these

tasks.

The first goal wants to point out that a lot of network tools share common features,

among which the most outstanding is packet filtering and lookup tables’ management.

The second goal wants to propose an extension to the original BPF/LIBPCAP

architecture (in particular the Windows porting, WinPcap), so that these features can be

obtained in an easy way, without caring of the details of the low level details (for

example the specific algorithms to handle lookup tables). Moreover the proposed

architecture wants to be fully integrated with the existing API of WinPcap, so that

programmers can integrate these new features in their tools seamlessly.

Last but not least, the implementation of such architecture must be realized in such a

way so that performances are good even over fast throughput network, for example up

to Gigabit Ethernet LANs.


This work is organized into eight chapters plus one appendix. The following list

gives a concise description of each of them:

• Chapter 2 (this chapter): it gives a brief overview of the subject of this thesis and its

goals.

• Chapter 3: this chapter gives a detailed description of all the issues related to

monitoring and statistical tools; it gives also some hints on how to realize such

system in a performing way under WinPcap.

• Chapter 4: it gives an overview of the problems that arise when dealing with these

sort of network components; it describes also what are intrusion detection systems.

• Chapter 5: this chapter describes the original BPF architecture, developed by S. Mc

Canne and V. Jacobson, and the relative high level library, LIBPCAP. There is also a

quite detailed description of the Windows porting of this architecture, particularly

highlighting the differences from the original system.

• Chapter 6: it presents the proposed extension to the BPF, the Table Management

Engine (TME), a virtual co-processor to classify packets using lookup tables.

• Chapter 7: this chapter describes the real implementation of the proposed extension

in the existent WinPcap architecture, dealing with data structures, main functions,

and implementation issues.

• Chapter 8: it describes the way user level applications interact with the TME, in

particular how they create the BPF filters and how they interpret the data received by

the TME.


25

• Chapter 9: it gives some examples of the use of the proposed extension to perform

the so called kernel monitoring, that is to say some examples of monitoring/statistical

tools that work mainly at kernel level using the TME. This chapter discusses also the

performances of such extension when used to monitor network traffic.

• Chapter 10: this chapter describes the possible future development of the proposed

architecture.

• Chapter 11: it describes some existent tools to perform network traffic monitoring.

• Chapter 12: this appendix is the programming manual of the extension; it describes

the BPF instruction set used to make the TME extension work.

3 Monitoring networks

In the last years the enormous spread of Internet has produced an increasing claim of

systems devoted to network statistics; with this term we intend every sort of information

regarding what happens on the network cable, from the number of collisions of an

Ethernet LAN, to the throughput of a Web session. In this context we will deal with the

statistics that can be inferred from data-link layer packets up to application layer, in

particular the so called ‘traffic monitoring’, that is to say statistics regarding bytes

transferred; we will not analyze the information regarding the physical layer of the OSI

model.

These systems are needed to perform various tasks, among which:

• enhance network topology and services, on the basis of bandwidth occupation for

each network protocol;

• point out abnormal situations, for example a large amount of broadcast traffic;

• find out suspicious traffic generated by some hosts; for example, a host that

generates a large amount of outbound traffic may be a symptom of an ‘hacked’

station, trying to perform attacks on other machines2;

• generate traffic logs for billing purposes;

• point out what are the bottlenecks of the network.

In the following paragraphs we will deal with the requirements of such systems, and

how they can be carried out.

3.1 General features

Monitoring systems must be able to perform a set of ordinary tasks to achieve the

functionalities previously highlighted; however, they share a common feature: they need

to classify packets depending on some fields of the protocol headers. For example if we

want to create a table of network usage divided into application protocols (e.g. Web,

Telnet, FTP and Mail traffic) we need to classify packets on the basis of the TCP port,

and then maintain a byte counter for each protocol.

Therefore we have a common part, in which packets must be interpreted, parsed, in

order to decide what protocols they are carrying.

After being parsed packets must be further classified and assigned to a precise category,

through the value contained in a particular header field (in the example above, the TCP

2 Usually, clients receive more data than they transmit, while server show the opposite behavior.

3.2 Types of monitoring

27

port). After this, an action is taken, for example a counter relative to that particular

value of the field is incremented (in the example above, we increment the byte counter

relative to particular TCP port).

The former part (parsing) is the one in common with usual packet filtering systems,

the latter is specific to monitoring systems.

As a consequence, these systems usually need to manage matrices, in which each

cell corresponds to a particular value of a field in the packet; in the example above, a 1-

dimensional matrix is needed, and each cell in the matrix is described (labeled) by the

TCP port number. This is shown in Figure 1.

Get port

25 (Mail)

23 (Telnet)

21 (FTP)

80 (Web)

25 (Mail)

23 (Telnet)

21 (FTP)

80 (Web)

Is IP? Is TCP?

Discard Discard

YESYES

NONO

Figure 1 – A monitoring system using matrices

Other features depend on what monitor systems have to do: for example, after

classifying a packet, the action to be taken can vary; we can be interested in counting

the packets with that particular feature (in the example above, the feature is the specific

application protocol); in other cases we need to perform some additional processing. For

example if we want to know the number of bytes transferred in a TCP session, first we

need to classify packets on the basis of the session ID3, and then we have to properly

manage other fields of the TCP header, like sequence number and flags.

Another requirement of such systems regards their performances-versatility ‘ratio’:

usually systems characterized by a great flexibility show lower performances compared

to special purpose ones.

The optimal system must be very fast, even with normal machines (intended as physical

computers running these applications), and yet versatile.


In the previous paragraphs we dealt with the basics of monitoring systems, pointing out

briefly what this class of systems are used for, and what characteristics are common to

them. Now it is time to give a list of what a monitoring system must be able to offer,

and then describe how to realize it.

The most valuable features that these systems must be able to deal with are:

• single aspects of traffic, for example the traffic directed to a precise host;

3 The session ID of TCP connections is formed by the IP source and destination addresses, and TCP source and destination ports.


28

• n-dimensional traffic matrices, for example the traffic matrix between each host and

another in a LAN, or, easier, the protocol distribution in a network;

• multiple analyses of traffic load, for example we want to know the amount of IP

traffic, that one of TCP and that one of TCP port 21, at the same time;

• general logging capabilities;

• analyses on flows, rather than packets.

3.2.1 Single aspects of traffic

This type of analysis involves only one sort of packets at a time; in this category we

can put all those statistics like ‘IP traffic’ or ‘Web traffic’ of ‘packets directed to host

BART’.

This kind is the simplest one, and can be realized with whatever system of packet

capture; for example, BPF4 usually captures packets on the net, after having filtered5

them, and the filter processes each packet looking at its headers. This process works

correctly but has very poor performance. If we want to monitor the packets efficiently,

we have to filter them, and increment some counters (e.g. a byte counter), when the

filter accepts the packet, instead of passing it to the user. This is shown in Figure 2.

PASS PACKET

TO THE USER

packet

FILTER

accept

drop

PACKETCAPTURE

SIMPLEMONITOR

INCREMENT

COUNTERS

packets++

Figure 2 – A simple monitor

3.2.2 Traffic matrices

This is an extension of the previous type of statistics: we want to create a n-

dimensional matrix, where each dimension represents a particular field in the packet

header, and each ‘position’ in a dimension stands for a value of that field. A good

example can be found in the next table, that shows a 2-dimensional matrix where each

cell represents a different IP source/destination address couple.

4 Berkeley Packet Filter. 5 Filtering is the process by which a packet is analyzed through its headers and accepted or rejected depending on a user-defined set of conditions.


29

Source

Destination 1.4.5.2 2.8.7.10 1.8.5.22 9.8.7.6 1.9.8.5

1.4.5.2 0 75 5 865 0

1.8.5.22 12 5 0 3 99

8.57.22.1 142 78 67 44 35

9.8.7.6 1 56 445 0 45

1.9.8.5 47 3778 4 0 0

Numbers in the cells represent the packet from host Source to host Destination.

One of the main characteristics of this type of monitoring is that the label that

identifies each cell is a particular combination of values of the same header fields; in the

example above all the cells represent IP source and destination addresses. This feature

will become important when we will deal with the realization of this class of

monitoring.

Moreover, this type of analysis can be divided into three groups:

• the so called close matrices, in which the size, intended as the number of elements

in each dimension, is fixed (e.g. sixteen rows and thirteen columns); this is the case

of statistics relative to particular values of the fields, for example particular TCP

ports, on which a precise application communicates (e.g. “monitor traffic on port 10,

11 and 12”).

• the so called open matrices, in which the size is not defined. This kind of matrices

can be used to create statistical data relating to traffic that is either not pre-definable

or simply too expensive to define. An example is a system monitoring the traffic

generated, or received, by each station on a LAN: it would be too expensive to

determine all the hosts on the net (especially for large ones) and then create a close

matrix. It is more useful to have a system that adds entries in the matrix whenever a

new host is discovered (e.g. when a packet from/to that host is seen).

• the so called bucket matrices. This type of matrix from the previous ones since each

cell corresponds to an interval, rather than a single value. An example of this type is

a matrix in which each cell corresponds to a different packet size, as shown in the

following table:

PACKET SIZE PACKETS

1 ÷ 64 243

65 ÷ 128 56

129 ÷ 256 3

257 ÷ 512 66

513 ÷ 1024 44

1025 ÷ 1514 7755

3.3 Realization approaches

30

3.2.3 Multiple analyses

In this class we can put those statistics that consider dissimilar aspects of network

traffic, that cannot be put in matrix like the one described previously. A good example

is a monitoring system that gives the amount of IP, TCP and web traffic; it is clear that

these three types of traffic are totally different: they even belong to different protocol

layers. It is quite usual to have analyses of traffic at different protocol layers, all at the

same time, since it can give the band occupation of a protocol in relationship with the

underlying protocols.

3.2.4 Logging capabilities

In this class we can put all those statistics that simply need to log some network

activity, that is to say that they do not need to have some sort of processing like

counting packets. For example we are interested in receiving a log message whenever a

TCP packet on port 4995 is seen on the network; the message could be something like

“Alert! TCP packet on port 4995”.

Another example of this type of monitoring, totally different from the previous one,

is when an application wants to receive only some excerpts of the packets (without any

further processing), for example a particular field in the packet, or a definite header (e.g.

the IP header).

3.2.5 Flow analyses

This class of analysis involves the reconstruction of traffic flows, prior to analyze

data and obtain some statistical information. The previous types of monitoring involved

simple mechanisms to create statistics because each packet is managed independently

from the others; in fact, the usual operations of such monitors are to classify packet (e.g.

IP packets, identified by their source address), and perform an action (e.g. increment

some counters). Here we have one more step: classify packets (e.g. by their TCP session

identification), reassembly packets to create a flow (e.g. by looking at the sequence

numbers in the TCP header of the packets), and then perform an action (e.g. look if the

stream contains a FTP file request, and then increment some counters).


Several techniques can be used to performs these classes of analyses, all of them with

strengths and drawbacks. First of all we must distinguish between off-line and real-time

systems:

Off-line systems. They perform statistics on network traffic after having captured the

packets on a file. This type of systems is not-critical, since there are no timing issues

(we process a packet after another one has been processed). As a consequence, they are

always made up of three components: a packet capture program, that just dumps the


31

packets to a file, a statistical engine, that reads packets from the dump files and

generates statistical information, and a GUI that shows the results received by the

statistical engine; we will not deal with this type of systems.

Real-time systems. They perform statistics while packets are flowing through the net;

these systems are really critical since packets are processed on-the-fly, and

consequently the timing issues are critical; here several approaches can be followed, and

they will be the topic of the next paragraphs.

Real-time systems with packet capture. One possible approach to realize real-time

monitor systems is to simply put together a packet capture program and a statistical

engine, like the ones described in the off-line systems, as shown in Figure 3; in practice,

we do not dump packets to file, but rather we feed the statistical engine with them.

Although it is very simple, this technique proves to be very inefficient: every packet is

copied several times before it reaches the statistical engine; in fact, usual packet capture

systems copy packets at least two times (2-copy system), before delivering them to the

application. Moreover statistical engines usually need only the first bytes of the packets

(the protocol headers), so there is a waste of memory and CPU resources if we give the

whole packet to the statistical engine.

Packet CaptureSystem

user level

kernel level

LAN

Statistical

Engine

packets

Statistical GUI

Buffer

Figure 3 – A statistical system running at user level

We can solve the last problem by capturing only the first bytes of the packet (if the

capture system is able to do so); this tip greatly decreases the amount of bytes copied.

This solution is, however, only a patch to the real problem: multiple copies are

useless and expensive; we must perform statistics without copying the packet, that is to

say when the packet is still in the network card.


32

Real-time systems at kernel level without buffering. The above mentioned result can

be achieved by moving the statistical engine down to the kernel, combining it to the low

part of the capture system, the one that sits at kernel level and is closely coupled with

the network card. At user level, where we usually had the statistical engine, we will

have only a GUI that displays monitor data received by the statistical engine, as shown

in Figure 4. It is quite important to move the statistical engine down to the kernel for

several reasons:

• tasks running at kernel level have an higher priority than user level ones, so packet

processing has the precedence over user level applications;

• a statistical GUI needs to be refreshed quite seldom, so context-switches6, involved

whenever the GUI retrieves data from the engine, happen rarely;

• it is easier to interface directly with the packet capture system, since this last block

work mostly at kernel level (having to work strictly coupled with network adapters).

If we are able to combine the statistical engine with the capture system so that packets

are never copied, performances will surely increase.

This approach has its drawbacks: avoiding copies before feeding the statistical

engine involves the need to process each packet before the next one arrives, and this

requirement becomes critical when network bursts occur; in other words, the packet

processing phase must be very fast. The first mentioned technique (statistical engine at

user level), although time and CPU consuming because of the copies and context

switches, can show a better behavior during bursts since the copy itself guarantees a sort

of buffering when several packets arrive in a short amount of time; however, this is

guaranteed only if the buffer is able to contain several packets at the same time.


user level

kernel level

LAN

StatisticalEngine

packets

Statistical GUI

statistics

Figure 4 – A statistical engine working at kernel level

Real-time systems at kernel level with buffering. There is also a third possible

approach, which mediates the previous described ones: it is possible to put the statistical

6 A context switch is involved whenever some data is passed beyond the user/kernel boundary.


33

engine at kernel level, chaining it to the packet capture system through a kernel buffer.

This solution has the drawback that the packet is copied once, but the buffer itself

‘softens’ network bursts, a situation in which buffer-less solutions usually fail. This

technique is shown in Figure 5.

user level

kernel level

LAN

StatisticalEngine

packets

Statistical GUI

statistics

Buffer

Packet Capture

System

Figure 5 – A statistical engine running at kernel level with packet buffering

After we have discussed the place to put the statistical engine, it is time to look inside

this block: what has it to do?

3.3.1 N-dimensional matrices

One of the needs of such a system is to handle matrices of statistical data; this could be

achieved by putting several simple blocks in parallel, each block monitoring only one

particular feature of the net, like the ones described in paragraph 3.2.1. For example,

let’s suppose we want to monitor the traffic of some specific applications, like Web,

FTP, Telnet and mail. In practice we are interested in creating a matrix, like the

following

Traffic Bytes

Web 345.433

FTP 256.788

Telnet 12.777

Mail 8.908

We can put up four simple monitors, each one for a different service, as shown in

Figure 6; each of them must perform the following actions for every packet:

• parse protocols up to level 4 (TCP);


34

• extract the field containing the TCP port, and check if corresponds to the interested

service, for example 80 as Web.

user level

kernel level

LAN

packets

Statistical GUI

statistics

accept

drop

accept

drop

accept

drop

accept

drop

Web MailTelnetFTP


Figure 6 – Statistical engine realized with several simple monitors in parallel

This approach has a lot of drawbacks: the parsing phase is repeated four times for

each packet, although this is always the same, so this solution does not scale when the

number of elements grows; moreover, this technique does not fit to the open matrices,

because we do not know how many monitoring blocks are needed.

A better approach is to shrink the blocks into a macro-block, in which the first phase,

the so called parsing, is performed only once per packet; after, we can increment the

right counters by switching (i.e. choosing the right cell) on the field that discriminates

the various cells of the matrix.

With this technique, the example previously described could be realized in this way: in

the common, parsing, phase we check if the packet is a TCP one, then we extract the

TCP port number and we increment the counters in the cell of the matrix corresponding

to that port, as shown in Figure 7.


35

LAN

packets


is TCP?Give me TCP port

8.908Mail (25)

12.777Telnet (23)

256.788FTP (21)

345.433Web (80)

BytesTraffic

8.908Mail (25)

12.777Telnet (23)

256.788FTP (21)

345.433Web (80)

BytesTrafficTCP port

Statistical GUI

statistics

common part

(Parsing)

Figure 7 – A statistical engine that uses tables

There is another problem related to matrices: when their dimension is bigger than

one (two-, three-dimensional, and even more), matrices tends to become sparse. If we

implement such structures in the ‘normal’ way, memory occupation tends to increase

exponentially. This is a small example of why this happens: let’s suppose we want to

know the amount of traffic between each host and another on a network; usually a

machine does not communicate with all the others, but rather with a subset of them, for

example servers or routers. Nevertheless, if we create a ‘standard’ matrix, we will have

all the N2 possible combinations, where N is the number of hosts. If we have 100 hosts,

we will create 10.000 cells in the table.

A feasible solution to the problem can be the use of open matrices: in the example

above a row or a column is created only if a new source or destination is seen (in a

packet); however this technique proves to be inefficient, as shown in the following

table.

SOURCE

DEST. A B C D E F G H

A 0 3 0 14 0 17 0 0

D 0 0 58 0 0 2 0 0

E 57 0 0 0 0 0 4 0

F 0 0 87 0 863 0 0 0

G 90 877 24 7 24 0 0 12

H 0 0 0 0 0 0 87 0

I 0 0 0 0 0 0 9 0


36

A packet from host G to host I arrived; since destination host I did not exist, a row

was added. The problem is that to add data relating traffic from G to I (which

corresponds to only one cell in the matrix) we added eight cells.

A better behavior can be obtained by modifying the matrix in two steps:

• first we transform the matrix into a 1-dimensional one, i.e. a single column; this can

be done by labeling all the cells of the columns in the new matrix with the ordered

concatenation of all the values of the single components in the old matrix. The tables

below show a small example of this technique:

A B

A1 1 2

B1 6 7

C1 4 3

• the second step to optimize memory occupation is to store only those cells that

contain valid data.

When a new value, for example a source address, must be inserted, only one cell is

added to the matrix, instead of an entire row, as shown in the previous example.

The drawback of such an approach is that we need an algorithm to find a cell in this

‘compressed’ table; since the labels that identify the cells are bytes, we can store this

column into an hash; this method assures a fast search7, in theory like a direct lookup in

a table.

The above described technique is not able to deal with bucket matrices; for this type

of statistics it is necessary to create an ad-hoc search algorithm, for example using the

dichotomic method; in practice we need to establish a relationship between each packet

size and the corresponding interval (the ‘bucket’); this bucket is the label that identifies

each cell of the table. This is shown in Figure 8. At this point we can use the bucket

label, together with other header fields, to create another N-dimensional matrix, and use

the normal search algorithms, as shown in Figure 9.

7 Proved that some conditions regarding hashes are satisfied.

A1-A 1

A1-B 2

B1-A 6

B1-B 7

C1-A 4

C1-B 3


37

. . .

256

. . .

129

128

. . .

65

64

. . .

1

Packet size

. . .

256

. . .

129

128

. . .

65

64

. . .

1

Packet size

Bucket 5

1025�1514

Bucket 4

513�1024

Bucket 3

257�512

Bucket 2

129�256

Bucket 1

65�128

Bucket 0

1�64

Buckets

Bucket 5

1025�1514

Bucket 4

513�1024

Bucket 3

257�512

Bucket 2

129�256

Bucket 1

65�128

Bucket 0

1�64

Buckets

Figure 8 – How buckets can be managed

Enter table

Bucket 3

Bucket 2

Bucket 1

Bucket 0

Bucket 3

Bucket 2

Bucket 1

Bucket 0

Obtain Bucket

. . .

256

. . .

129

128

. . .

65

64

. . .

1

Packet size

. . .

256

. . .

129

128

. . .

65

64

. . .

1

Packet size

Bucket 5

1025�1514

Bucket 4

513�1024

Bucket 3

257�512

Bucket 2

129�256

Bucket 1

65�128

Bucket 0

1�64

Buckets

Bucket 5

1025�1514

Bucket 4

513�1024

Bucket 3

257�512

Bucket 2

129�256

Bucket 1

65�128

Bucket 0

1�64

Buckets

Figure 9 – The buckets system

3.3.2 Multiple analyses

Another requirement of the statistical engine is the chance to manage statistics on

different aspects of network traffic at the same time.

This feature could be exploited by putting several simple monitor blocks in parallel, in a

way similar to the one depicted in Figure 6; this approach has a big drawback: if the

data of the different blocks are not retrieved by the GUI at the same time, this

information can be incoherent. Here is a small example of this problem: let’s suppose

we want to monitor IP and TCP traffic, by counting packets of these types; at t=t0

packetsIP and packetsTCP are equal 59; the GUI begins to transfer statistical data, and

receives packetsIP=59; then a packet arrives, the counters, corresponding to IP and TCP

are updated8, and they both contain 60. The GUI continues to transfer data, retrieves

8 The GUI is interrupted since the blocks run at kernel level, so with an higher priority.


38

packetsTCP=60, which is clearly incoherent, since TCP is transported over IP (and

consequently packetsTCP must not be greater than packetsIP).

The solution is to put the simple blocks into a bigger one, like we did for traffic

matrices, and then retrieve data from them ‘atomically’, that is to say while the

monitoring engine is blocked; this solution cannot be applied if we put several simple

monitor blocks in parallel, since they are completely independent. The difference is that

the various blocks are heterogeneous, while before they share the same protocol header

field to decide the right block (in the example shown in Figure 7 it was the TCP port).

A method to solve this problem is to generate the label that identifies each cell ‘by

hand’: let’s suppose we want to perform an analysis on IP, TCP and WEB traffic

• first we check if it is IP; if so, we increment the counters of the cell labeled with the

string “IP”;

• then we check if it is TCP; if so, we increment the counters of the cell labeled with

the string “TCP”;

• finally we check if it also part of WEB traffic, and increment counters of entry

labeled “WEB”.

The flow diagram of this technique is shown in Figure 10.

Is IP?

Increment

counters

relative to

key=‘IP’

YES

NO

YES

EXIT

NO

NO

YES

Is TCP?

Increment

counters

relative to

key=‘TCP’

Is WEB?

Increment

countersrelative to

key=‘WEB’

NO

Figure 10 – How to perform multiple analyses using a lookup table

The difference between this example and the one shown in Figure 7 is that there the

label identifying a cell is taken directly from the packet (it is the TCP port), while here

we must generate it manually.

With this technique, we can reuse entirely the same system proposed for traffic

matrices.


39

3.3.3 General logging capabilities

It is quite simple to perform logging, especially if we want to have messages like “a

packet of type XYZ was seen”. In practice, we need a mechanism to signal and deliver

these messages to the user; this can be done with a buffer in which these messages are

stored, and a system to alert the user application that something new was put in it; this

alert could be a select() on a file under Unix platforms, or a named event under

Windows platforms.

Moreover, the buffer we need can be the one usually shipped with the packet capture

system9, that one used to store the packets to be passed to the user; in practice the user

application, which usually receive packets in a buffer provided by the capture engine,

now receives log messages in the same buffer.

Another feature is the possibility to pass some excerpts of packets to the user: again

we can use the packet capture buffer to store these excerpts, as they were log messages;

we have only to modify the capture system, so that it is able to save not only the whole

packet, but also a fragment of it (e.g. from byte 13 to byte 17).

3.3.4 Analyses on flows

Here the problems is only partially related to monitoring; in practice, we need a sort

of preprocessor that reconstructs flows in some manner, and then feed the statistical

engine with these data. A good starting point is to reassemble packets of a flow to fill a

buffer of a certain fixed size, for example 16 kiloBytes, and then pass it to the statistical

engine10; this is not true reconstruction (we subdivide the flow into big chunks), but this

approach has proved to be very useful in lots of situations. In fact, this is the technique

used by snort, one of the most used non-intrusive Intrusion Detection Systems for

small-medium networks.

In practice, the method used to reconstruct flows depends on to two parameters:

1. what we want to do over flows: if we want to catch the requests from a server, for

example a “GET / HTTP/1.0” request to a Web server, the above technique

proves to be good, since these requests are usually at the beginning of the flow;

2. the fact that it is impossible to realize a system that analyzes continuous flows of

bytes, the only possible manner is to split the stream into overlapping11 blocks of

data, and then feed the statistical engine with them.

After the reconstruction has been done in some manner, the previously mentioned

analyses (matrices, logging…) can be used.

9 Actually, there are some capture systems without buffering capabilities, whose performances are, however, very poor. 10 These 16kB blocks must be overlapping, to avoid that a searched string is split into two blocks. 11 Overlapping is needed to maintain some sort of continuity between the various blocks.

3.4 Conclusions

40

The actual problem dealing with flows rather than packets is that things become a lot

more complicated: in a packet we can identify an header, and therefore some fixed

fields, analyses perform actions on that basis; in a flow, it is almost impossible to

identify something fixed, it can be a file transfer, and therefore raw data, or an

application protocol, and it is very difficult to recognize it. In fact, it is not a case that

the usual analysis made on flow is pattern matching, that is to say a scanning in the

flow to find if a particular string is present.

3.4 Conclusions

After this discussion on the requirements of monitoring systems, it is clear that three

main elements are needed to build a powerful statistical system, that are:

• something to handle matrices, from now on called tables; with the proposed

approach, they must offer a fast search algorithm, they must be dynamic, in the

sense that new elements can be added at any time (to realize the so called open

matrices);

• a mechanism to offer logging capabilities; as stated before, the proposed approach

tends to reuse the buffering system of the packet capture system to perform this

task;

• a mechanism to reconstruct flows; actually, as stated before, this task is quite

independent from statistics, in the sense that reconstruction, alone, does not give any

statistical information; rather, it is a basis to perform a more specific monitoring.

4 Firewalls

4.1 Introduction

A firewall is a network component that sits between two LANs, with the aim of

selectively bridge packets from a LAN to another and viceversa.

Literally, a firewall is a brick wall that isolates and protects something from fire; in

the network context, a firewall is something that separates two or more LANs with

different security levels; in practice a firewall protects a more secured LAN from

malicious traffic originated from the less secured LAN.

In order to obtain this result, a firewall must be the only mean by which the two

LANs communicate, as shown in Figure 11; being a single point of failure12, much care

must be taken in projecting these systems.

LAN X LAN YFIREWALL

Figure 11 - Placement of a firewall

In general, firewalls can be divided into two categories: packet filtering firewalls,

that operate on packets and therefore are transparent to applications, and application

gateways, that are not transparent to hosts.

Packet filtering firewalls analyze each packet that has to go from a LAN to another,

deciding if it ‘permitted’ or not; application gateways are systems that work at

application level: a host (A) that needs to open a connection with another one (B) must

open a connection to the gateway, instead; on its hand the gateway opens a connection

with host B, thus enabling the communication between A and B.

There is also an intermediate class of systems, the so called Level 4-13

7 Switches: in fact

they are transparent to hosts applications, like the packet filtering firewalls, but they do

not work on packets, rather they work in a manner which is similar to the application

gateways (i.e. they do not limit to analyze packets, they can respond to hosts).

In this chapter we will deal only with packet filtering firewalls.

From the network point of view, for every packet received from a NIC interface, a

packet filtering firewall checks if the packet is ‘good’ or ‘bad’, that is to say if it can 12 A network component is a single point of failure when all the traffic must pass through it, so if it fails, the two parts are isolated. 13 Level refers to the layers of the OSI model, in particolar these switches work at transport, session, presentation and application level.

4.1 Introduction

42

pass or not; if the packet is ‘good’, it is sent out from an interface different from the one

it was received, otherwise it is discarded.

From the user level point of view, a firewall can perform a different action for every

‘good’ or ‘bad’ packet; for example it can capture, or simply log, ‘bad’ packets, and

maintain a counter of ‘good’ packets, or respond to a host with a control message (e.g a

TCP reset, or an ICMP xxx unreachable message).

We can subdivide packet filtering firewalls into two more categories:

• systems that analyze only the packet headers, usually up to transport layer of the

OSI model; since we will mainly deal with TCP/IP protocols, they evaluate up to the

TCP or UDP header;

• systems that analyze also the packet payload, and are usually able to deal with

flows, too, thus working with all the layers of the OSI stack; this technique is usually

known as stateful inspection.

Before discussing these two kinds of firewalls, let’s have a quick look at what we

have to protect. In general a firewall can be used to perform two different tasks: prevent

attacks and block some available services from a LAN to the outside world.

Attacks prevention: in general an attack to a machine happens when an entity of some

sort tries to have access to the resources of a machine, for several reasons, going from

theft of information, up to forcing the machine to behave in a wrong manner (for

example generating network traffic until the bandwidth saturation).

Attacks to machines connected to a network can be divided into two big categories:

• attacks that exploit the TCP/IP protocols themselves; in this category we can put

attacks like port scanning, SYN-flooding, hijacking… In practice they usually abuse

of the lack of controls and security of the protocols to hack systems.

• attacks that exploit bugs in the network applications. A common bug is when an

application is not able to deal with all possible messages it can receive from the

network, thus behaving strangely; the consequences are unpredictable: we can go

from a crash of the application, to the complete bandwidth saturation causes by an

application gone ‘crazy’. A good example of this type of attack is the one exploited

by virus nimda: this virus, hidden in a HTTP request to an IIS14 web server, is able

to make the web server itself generate an enormous amount of HTTP request to

randomly selected hosts, thus saturating the entire network. This type of attacks is

usually recognizable through the so called signatures: every attack is characterized

by a precise message that the attacker sends to its victim, the signature is a part of

this message that uniquely identifies the attack.

14 Internet Information Server, the Web Server developed by Microsoft.

4.2 Packet filtering approach

43

Service blocking: in this case we want that a service, available in one of the LANs, is

not used the other LAN. For example, let’s suppose we have a network formed by

several hosts, connected to the Internet, and we want to have something that prevent

some hosts from accessing web servers outside the network.

Network fingerprint: we want to identify the system (in terms of operative system and

specific release), on the basis of the packets it sends. It is an important feature to

prevent future attacks.

4.2 Packet filtering approach

As said before, these types of firewalls analyze traffic by inspecting the packet

headers up to level 4 of the OSI model, that is to say up to the TCP or UDP headers.

The usual approach followed by these components is to extract some fields of the

packets and decide if they are legal or not; in practice they perform a sort of packet

classification, depending on these fields, as shown in Figure 12

.

Packet FilteringSystem

Is packet of type X?

LAN X

packets

LAN Y

accepted packets

Figure 12 - Packet Filtering approach

For example, let’s suppose we have a LAN, connected to the Internet, and we want

to avoid that some hosts of the LAN access the Internet; we can place a firewall

between the LAN and Internet, and then create a rule like this “drop packets from hosts

X, Y, Z going outside”. This rule can be realized by classifying packets through their

source IP address, thus deciding if a packet can pass or has to be dropped.

In practice, every packet ingoing from a physical interface is classified through some

fields of the headers, that are usually

• IP source and destination addresses,

• transport protocol source and destination ports, in practice the TCP or UDP ports,

• connection state information, through the flags of the TCP header,

• sometimes packet length.

This kind of firewall is usually sufficient to perform service blocking and to block

simple attacks, but it is not able to deal with ‘application level’ attacks, recognizable

only by inspecting the whole content of packets.

4.3 Stateful inspection approach

44

4.3 Stateful inspection approach

These types of firewalls inspect the whole packets, not only the headers. The usual

analysis they perform is pattern matching, that is to say they search for a particular

given string in the packets, the above mentioned signature.

In practice, a firewall implementing this method is formed by a simple packet

filtering ‘engine’, that pre-filters packets, and then a pattern matching engine that

works on these packets, as shown in Figure 13. It is important to ‘pre-filter’ packet for a

simple reason, that is speed: in fact, pattern matching is a quite complex task, so its use

must be limited restricted groups of packets.

For example, let’s suppose we want to block packets trying to attack web servers with

virus XYZ; moreover, we know that the signature of the attack is the string “I am

virus XYZ”.

An approach is to inspect all the packets searching for that string; this is too expensive

when we are working on a heavy loaded network. Instead, we can pre-filter all the

packets going to a web server, and then perform pattern matching only on those ones.

Packet FilteringSystem

LAN X

packets

Pattern matchingSystem

LAN Y

accepted packets

Figure 13 - Stateful inspection approach

This type of firewall is used whenever one wants to protect a network from attacks to

applications; in fact, in this case the only method to recognize attacks is to search

packets for the signature, and therefore drop these packets, maybe logging them or

giving an alarm to the user.

Another detail must be pointed out in these firewalls: usually application attacks are

conducted through TCP connections, so the signature has to be searched in the flows,

rather than in the packets. Moreover, attackers can force the signature to be split into

two or more packets (for example by forcing the MTU15 to 1 byte); as a consequence,

sometimes it is not enough to look at the payloads of the packets, and it would be very

useful to have a system that reconstruct flows before feeding the pattern matching

engine.

15 Maximum Transfer Unit, it is the maximum number of bytes of the TCP payload.

4.4 Access Control Lists

45

4.4 Access Control Lists

In the above paragraphs, we said that packets must satisfy certain user-defined in

order to pass the firewall. In the firewalls world, rules are often called Access Control

Entries, ACE, and a group of ACEs is called Access Control List, ACL.

Every ACE corresponds to a particular type of packets, and it is composed of two main

parts: a key part and an action; the key part identifies uniquely a class of packets, and

the action is what we have to do if a given matches that key.

For example, an ACE can be

ip source 130.192.3.0 netmask=255.255.255.0 accept

The first part identifies all the packets originated from the subnet 130.192.3; if the

packet arrives from one of the hosts of that subnet, it must accepted, that is to say it can

pass through the firewall.

As said before, every ACL is formed by several ACEs; two approaches can be followed

in creating ACLs: the order-dependent mode and the order-independent one.

In an order dependent ACL, ACEs must evaluated in the order they are; when a packet

does not match an ACE, the evaluation skips to the next ACE; the process ends when

the given packet matches an ACE; this is a small example


2. ip dest 130.192.44.25 accept


4. all deny

Let’s suppose we have a packet to IP 130.192.44.25; the packet does not match ACE 1,

so ACE 2 is evaluated, and the packet is accepted.

The main feature of this type of ACL is that ACEs are not mutually exclusive, that is to

say a packet can match more than one ACE, but the taken action is always the one of

the first matching ACE. As a consequence, it is not possible to change the order of the

ACEs without modifying its behavior.

On the opposite, order-independent ACLs are characterized by the fact that ACEs are

mutually exclusive, that is to say a packet always matches one and only one ACE. As a

consequence, ACEs can be scrambled, without modifying the behavior of the ACE.

In practice, the existent firewalls always use order-dependent ACLs.

A typical feature of ACEs is that they usually work with masks: the key does not

correspond to a fixed value, but rather to an interval; since, usually, each ACE has a

different mask, this can give big problems for the implementation of such systems.

4.5 Requirements

46

4.5 Requirements

As stated before, there are two key elements to realize firewalls: a system to handle

ACLs, if we want to build a packet filtering engine, and something for pattern matching,

if we want to deal with stateful inspection.

First of all, we have to analyze each packet to see if it carries a protocol of interest,

and then we must extract those header fields that are used to match the ACEs of an

ACL. Usually, we extract source and destination addresses at data-link, network and

transport layer of the OSI model, in practice MAC and IP addresses and TCP or UDP

ports. It is not unusual to extract some flags of the headers, too, for example the

connection state flags of the TCP header. We can call these extracted fields the key of

the packet.

This task is quite simple to perform, since packet headers are usually in well known

positions, and it is the usual job of selective packet capture systems. An example, using

the ACL described in the previous paragraph, is shown in Figure 14.

82C0022C 82C00B2B

IP src=130.192.2.44IP dest=130.192.11.43

packet

Extract IP src and IP dest

82C0022C 82C00B2B

ACL


2. ip dest 130.192.44.25 accept


4. all deny

The packet matches the ACE n. 3 � the packet is accepted Figure 14 – Working principle of a packet filtering firewall

The problem is to realize a fast system to deal with ACLs, that is to say we need

something that is fast to match the ACEs.

The simplest possible approach is to keep ACLs as they are defined by the user, and

check each ACE in order, stopping at the first matched ACE. This approach can be used

with ordered and unordered ACLs, since unordered ones are just a special case of the

ordered ones. An approach of this type has the big advantage that user defined ACLs

can be used without any modification, but it has also two big disadvantages:

4.5 Requirements

47

• it does not scale when the number of ACEs grows, since in the worst case each ACE

must be checked;

• performances highly depend on the order of the ACEs: if an ACE matching with

90% of keys16 is put at the end of the ACL, 9 keys out of 10 have to be checked with

the previous ACEs before having a successful match. It is not a case that it is usually

advised to put ACEs, that match with a great percentage of packets, at the beginning

of the ACL.

Another idea could be to manage ACLs as lookup tables, considering each entry as

an ACE. This approach has two big problems:

• if the ACL is order dependent, several ACEs can match to a key, in this case we

have to consider only the first matching ACE, in order; in a lookup table there is a

1-to-1 correspondence between each key (and so each packet) and each entry in the

table, instead;

• each ACE is formed by two parts, that is to say a couple formed by a value and a

mask, that varies among ACEs: the matching algorithm must apply the mask to the

given key, and then check if the result is equal to the value associated with the mask.

As a consequence, because of the lack of a 1-to-1 correspondence between keys

(extracted from the packet headers) and ACEs, it is impossible to manage a lookup table

made of ACEs with the usual search algorithms.

The first problem can be solved by transforming ordered ACLs into unordered

ACLs; let’s suppose we have an ordered ACL with three ACEs,

ACE1 � A

ACE2 � B

ACE3 � C

where A, B and C represent a type of packets.

We can transform this ACL in the following one

ACE’1 � A

ACE’2 � (A’)B

ACE’3 � (A’)(B’)C

where (A’) means packets not of type A, and the product means the logical AND.

In practice, each new ACE is the product of the previous ACEs, complemented, with

the original ACE. The new ACL is equivalent to the original one for this reason: in the

16 Remember that there is an almost 1-to-1 correspondence between a packet and a key.

4.5 Requirements

48

original ACL, ACEi was evaluated only if the i-1 preceding ones where not satisfied,

that is to say if their complements were satisfied.

The three new ACEs are mutually exclusive, so the new ACL is order independent.

The second problem is much more complicated to solve; a good starting point is to

group the ACEs that have the same mask, and then apply a normal search algorithm on

each group. Let’s suppose we have the following ACEs (they are IPv4 subnets), in an

order independent ACL.

ACE1 � 130.192.28.0 mask 255.255.255.0

ACE2 � 130.192.134.24 mask 255.255.255.248

ACE3 � 130.193.233.0 mask 255.255.255.0

ACE4 � 10.11.128.0 mask 255.255.255.248

ACE5 � 130.192.16.128 mask 255.255.255.248

ACE6 � 84.112.11.0 mask 255.255.255.0

ACE7 � 130.192.65.0 mask 255.255.255.0

Since ACE1, ACE3, ACE6 and ACE7 (and also ACE2, ACE4 and ACE5) have the same

mask, we can group them into a ‘super’ ACE, as shown below

ACE’1 � mask=255.255.255.0 value1= 130.192.28.0

value2= 130.193.233.0

value3= 84.112.11.0

value4= 190.192.65.0

ACE’2 � mask=255.255.255.248 value1= 130.192.134.24

value2= 10.11.128.0

value3= 130.192.16.128

Then, we analyze each super ACL by first masking the given key, and then

performing an ordinary search into the values associated with that super ACL. The

process is repeated for each super ACE, until a match happens.

This approach gives good results only if an ACL contains few different masks repeated

several times in the ACEs; unfortunately, usual ACLs do not have this property.

The second requirement of a firewall is the ability to perform pattern matching:

the task is quite simple, but it is necessary to implement some sort of flow

reconstruction due to the problems described in paragraph 4.3. It is not necessary to

have a true flow reconstruction, rather we need to have a part of the flow large enough

to permit the pattern matching operation, that is to say at least the length of the

4.6 Network Intrusion Detection Systems

49

signature. Another requirement is needed from the flow reconstruction: we have said

that is enough to receive a block of the flow, but it also necessary that the various

blocks from a flow overlap, otherwise the signature can not recognize the signature

when it is divided into two blocks, as shown in Figure 15 and Figure 16.

signature

flow

not overlapping blocks

signature

The signature is splitted into two blocks, so it is not recognized by the pattern

matching systemsignature

compare

with

signature

compare

with

Figure 15 - Flow reconstruction without overlapping

signature

flow

overlapping blocks

signature

signature

The signature is not splitted into two blocks, so it is recognized by the pattern

matching system

Figure 16 - Flow reconstruction with overlapping blocks

4.6 Network Intrusion Detection Systems

Network intrusion detection systems, usually known as IDS or NIDS, are network

components between firewalls and monitoring systems; they are used to find out

network attacks, like firewalls, but they are not able to drop packets; in fact they work

like a monitoring system: they analyze the traffic present on a LAN, and signal the user

about malicious packets seen on the wire.

The most famous one is snort, a freeware program based on LibPcap; it is able to

detect a variety of network attacks by analyzing packets seen on the wire.

First of all, it is able to discover attacks through the signatures; in practice, it is able to

perform a flow reconstruction into blocks, and then do a pattern matching on these ones;

4.7 Network Address Translators

50

basically, it behaves as a stateful inspection firewall, deprived of the chance to drop

packets.

It is able to simply filter packets on the basis of their headers, too; in this case it behaves

as a packet filtering system, or a monitoring system.

Moreover it is capable of signaling the user of malicious traffic in several ways:

• by logging a message to the system logger17;

• by sending a WinPopup message to a Windows station;

• by writing a message to a log file.

The main difference between a monitoring system and an IDS is that the former

gives the user statistical data relative to the network traffic, while the latter has to signal

the user of malicious packets seen on the wire.

As a consequence, the requirements of such systems are almost the same as the ones

needed by monitoring systems and firewalls:

• classify packets on the basis of some fields of the headers;

• perform a sort of flow reconstruction to do pattern matching;

• a logging system to give the user some messages relating to malicious traffic;

• a means to signal the user of malicious traffic seen on the network.

4.7 Network Address Translators

A network address translator, usually known as NAT, is a network component that is

used to masquerade the hosts of a LAN to the outside world, usually Internet, by

changing the source and destination addresses of the packets. It always works at level 3,

that is to say it deals with IP addresses. As a consequence it is not a method for

protecting a network from external attacks, like firewalls do; nevertheless, it acts in a

way that is similar to firewalls: it captures network packets from an interface, analyzes

(and modifies) them and then it sends them out of another network interface. Moreover,

it is not unusual to have commercial firewalls that implement this feature.

In practice, each machine of the LAN is assigned to an IP address, which is usually

taken from the private ones18, for example 192.168.1.5; when a packet originating from

this host has to go outside the LAN, the NAT changes the source IP address with

another one, which has to be public.

The NAT can use two techniques to map the internal (private) to the public ones:

• a 1-to-1 correspondence; in this case a different public IP address correspond to

each private IP address;

17 Snort works under *nix operating systems, and it is therefore able to log messages in the syslog. 18 See RFC xxxx for more details on private IP addresses.

4.8 ACLs and routers

51

• a 1-to-n correspondence; all the private IP addresses correspond to a single public IP

address.

The first approach is quite simple to realize, since it is enough to keep a table of the

corresponding private and public addresses, like the one in the below table

Public IP Private IP

130.192.12.55 192.168.10.1

130.192.12.56 192.168.10.2

130.192.12.57 192.168.10.4

130.192.12.58 192.168.10.5

The second approach is a lot more complicated; this is because a packet arriving from

outside to the LAN has always the same IP address, so there is no mean to know to what

particular host of the LAN the packet is directed. To solve the problem, the NAT

modifies also the level 4 header, by changing the TCP source port of outgoing ports,

and destination port of ingoing packets; in practice it is the TCP port that identifies the

host originating the packet. It is clear that such an approach works only if the transport

layer is TCP (or UDP, with some limitations).

This second technique has a particular feature, that can be very useful: it is impossible

to open a TCP connection from outside the LAN to inside; this characteristic implicitly

guarantees that internal servers cannot be accessed, and consequently attacked, from the

outside.

It is quite simple to realize a NAT, the requirements are a lookup table to store the

couples of private – public addresses, and the chance to modify packets, rather

analyzing them, only. The former constraint is the one that was needed by firewalls and

monitoring systems, too; indeed, the latter is unique to NATs.

4.8 ACLs and routers

There are also other network components that make use of the ACLs, in a slightly

different way: routers. In this case the ACLs can be used to route packets out of a

particular interface on the basis of their source and destination addresses.

This is done for several reasons, one of them is Quality of Service: for example we

want to use a faster outbound interface for packets coming from a particular subnet.

5 The original BPF architecture and WinPcap

The Berkeley Packet Filter architecture was presented about ten years ago by Van

Jacobson and Steven McCanne of the Lawrence Berkeley Laboratory. The rationale of

the project was to realize an efficient system for user-level packet filter under BSD

Unix, that is to say a system to acquire packets from the network and to deliver them to

user level applications.

This project wants to pursue several requirements, among which:

• performances

• protocol and data-link independence

• high programmability.

The adopted philosophy is to minimize the number of packet copies by means of a

kernel agent, the kernel filter, which discards not interesting packets without copying

them. This approach minimizes the amount of context switches19, too, since the number

of packets passed to user level is limited to those ones of interest.

5.1 Components of the BPF architecture

The so called kernel filter is a system that analyzes each packet that traverse the net,

deciding if it must be copied to the user or has to be discarded through a series of user-

defined rules; an example rule can be ‘accept IP packets’. It is realized through the so

called network tap, that collects packets from the net, and a virtual register-based

processor, the BPF virtual machine, fed by the network tap, that executes a user defined

code, the BPF filter (the processed version of the user-defined rule) for each packet seen

on the wire; an example of the BPF filter is shown in Figure 17.

“Ip and src host 10.10.10.10”

(000) ldh [12](001) jeq #0x800 jt 2 jf 5(002) ld [26](003) jeq #0x0A0A0A0A jt 4 jf 5(004) ret #1514(005) ret #0

pcap_compile

BPFVIRTUAL PROCESSOR

pcap_setfilter

Packets Output

Figure 17 – How the BPF compiler works

19 Context switch is the process by which data traverse the user/kernel boundary, and it is performed whenever a packet must be transferred from the network to a user application.


53

The second component is a kernel buffer in which packets, accepted by the kernel

filter, are stored before being passed to the user.

The third component is a user library, exporting a set of APIs20, that abstracts the

user from the low-level details of the architecture. For example, this library is shipped

with a compiler that creates the code, used by the BPF virtual machine, from a human

readable string describing the wanted packets (e.g. “ip”).

A graphical representation of such components is shown in Figure 18.

In the next paragraphs we will give a brief description of each of the components.

User codeCalls to libpcap

Otherprotocol

stacks

Network Interface Card (NIC) driver

filter1

Packets

Network Tap

Device Driver (BPF)

Kernel

Level

Network

User

Level

Hold buffer

Store buffer

KernelBuffers1

user-buffer1

Applications

Libpcap Library(usually included during compilation)

user-buffer2User code

(direct accessto the BPF)

Hold buffer

Store buffer

KernelBuffers2

filter2 ...

User codeCalls to libpcap

Figure 18 - The original BPF architecture

5.1.1 Network tap and BPF virtual machine

The network tap is closely linked with network cards (NICS), in particular with the

data-link layer; it collects all the packets from the net and feeds the BPF virtual

machine. In practice, when a packet is seen by the network card, the NIC triggers the

tap21, which in turn feeds the BPF machine.

The BPF virtual machine is the core of the whole filtering mechanism. It is

composed of two registers, the 32-bit accumulator A and an index register X and an

implicit program counter; it is equipped with a 16 words of a 32-bit memory. The to-be

processed packet is seen by the machine as a ROM memory.

Each instruction is composed of four elements:

code: it contains the opcode of the instruction;

k: it contains a value passed to the instruction, for example in LOAD instructions it can

contain a constant or an address;

20 Application Programming Interface. 21 The NIC triggers the usual protocol stacks, for example the TCP/IP stack, too.


54

jt: it contains the number of instructions to be skipped forward; it is used only in

JUMP on condition instructions; the program counter is incremented by this value

when the condition coupled with the instruction is true;

jf: it is almost the same as jt, but the program counter is incremented when the

condition is false.

jt and jf are positive values, so, since they are relative skips of the program

counter, backward jumps are not realizable.

The instruction set is composed of six classes of instructions:

• LOAD instructions

- from the packet: it loads a byte, an half-word22 or a word, to register A, in direct

mode (i.e. k contains the offset from where load in the packet) or indirect mode

(i.e. register X contains the base address, and k the offset); a special LOAD

instruction is also available to load the HLEN23 field of the IP header to register X

with a single instruction;

- immediate: it loads k value directly to register A or X;

- immediate packet size: there is a special load instruction to put the packet length in

either register A or X;

- from memory: it loads a 32-bit word from memory to either register A or X, only in

direct mode.

• STORE instructions permit to store a value from registers A and X to a location in

memory. They work only in direct mode.

• JUMP instructions: BPF makes available a non-conditional jump and a series of

conditional jumps; the conditions compare register A with either k or register X; as

stated before, it is not possible to perform backward jumps.

• Arithmetic instructions: it is possible to perform a series of arithmetic (“+” “-”

“×” “/”) and logical (“AND” “OR” “XOR” “NOT”) operations on the

accumulator; the other operand can be either k or register X, the result is always

stored in register A. The processor supports only integer values.

• RETURN instructions: it is possible to return either k or A. This class instructions

interrupts the program flow immediately, the returned value has a special meaning:

0 means that the packet must be dropped;

-1 means that the entire packet must be passed to the user (by copying it to the

kernel buffer);

n>0 means that the first n bytes of the packet must be passed to the user. 22 An half-word is 16 bits wide. 23 Header Length. This instruction was added because otherwise the header length implied a lot of BPF instructions to be calculated.


55

• Other instructions: two other instructions are defined, TXA and TAX, that transfer

the value of a register (A or X) to the other.

If the machine fails executing an instruction, for example because a division by zero

is attempted, the machine implicitly executes a “RET 0” instruction; If no code is

provided to the virtual machine, it implicitly executes a “RET -1” instruction.

5.1.2 Kernel packet buffer

This is the buffer in which packets are stored after being accepted by the virtual

machine, and before being passed to the user. This is one of the components that makes

the whole system efficient: during network bursts, it is usual that the application is put

in a wait state, since the tap has to process packets and runs at an higher priority, so

packets must be stored somewhere at kernel level.

The BSD realization of this buffer is based on two buffers, the store and the hold

buffer; packets accepted by the virtual machine are put in the store buffer, while user

applications receive the hold buffer; when the store buffer is full, and the hold one is

empty, they are exchanged, so that the application can receive new packets and the

virtual machine is able to store new accepted packets. The default size of each of the

two buffers is 4 kiloBytes.

Buffer size is one of the most critical details of the whole BSD architecture:

• a bigger buffer generally guarantees less lost packets and less context switches;

nevertheless, applications will receive packets less promptly, since they tend to

remain a long time in the store buffer;

• a smaller buffer solves the above mentioned problem of prompt receiving, but

packet losses and context switches generally increase, since the buffer is able to store

a lower amount of packets at a time, and consequently it becomes full early during

network bursts.

As described in a following paragraph, the Win32 porting of the architecture uses a

totally different realization approach of this buffer, to obtain better performances.

5.1.3 LIBPCAP library

This is the user level component of the architecture; this ANSI-C library provides a

series of APIs to interact with the kernel part of BPF. It hides the user from details like

system calls to set and retrieve data to and from the kernel part of the architecture,

5.2 Interaction between LIBPCAP and the BPF virtual machine

56

offering functions and data structures that are totally independent from the operating

system24 and from the various data-link layers the library is able to work on.

These APIs can be divided into three main groups:

• control APIs: there is a series of functions to open a capture instance, close it and set

various parameters:

• compiling APIs: the library is shipped with a special compiler that translates rules to

filter packets from a semi-human readable format to the code for the BPF machine.

Examples of these user rules are “ip”, “arp”, “tcp src port (21 or

80)”, “ether 1:2:3:4:5:6”; the generated code, the so called BPF filter, is

injected in the kernel through a control API;

• APIs to receive packets: the process of receiving packets is obtained through the so

called dispatcher_handler. This function is a callback, that is to say a function

defined by the user, whose prototype is fixed by the library itself; it is called

whenever a packet accepted by the BPF virtual machine is transferred at user level.

The dispatcher_handler receives a buffer containing the packet (or a fragment of it,

defined by the user), and a timestamp of it, as parameters. For example, a

dispatcher_handler can print out the whole packet on screen, or perform some

special processing. An application that wants to receive packets must define this

function and then call one of the two library functions devoted to packet receiving,

pcap_loop() and pcap_dispatch(), and pass them a pointer to the defined callback. In

turn, the receive function continuously transfer packets from kernel to user level and

calls the callback.

A detail must be pointed out: the library allocates a user level buffer to store packets

at user level; in practice, packets are first copied from the kernel buffer to the user

buffer, and then the dispatcher handler works on the packets stored in user level

buffer.

5.2 Interaction between LIBPCAP and the BPF virtual machine

The typical method by which a kernel component communicates with a user

application is a device, on which it is possible to perform OPEN, CLOSE, READ,

WRITE and IOCTL system calls. The virtual machine, running at kernel level, exports a

device (/dev/bpf) on which it is possible to perform OPENs to create a capture instance,

READs to retrieve packets from the kernel buffer, and IOCTLs, to set and query some

parameters, for example to inject the BPF code in the kernel.

24 During the years the architecture was gradually ported to a variety of operating systems, but the APIs exported by LIBPCAP remained the same, so that all applications written upon this library could be used without modifications.

5.3 Implementation under Microsoft Operating Systems: WinPcap

57

Therefore, LIBPCAP control APIs are wrapper functions to IOCTLs and OPEN

system calls.

The above mentioned pcap_loop() and pcap_dispatch(), used to receive packets,

perform two tasks: at first, they call READ to transfer data from the kernel buffer to the

user buffer, then they call the dispatcher_handler until the user buffer contains some

packets.


The porting of the BPF architecture and LIBPCAP under Microsoft 32-bit operating

systems started about three years ago by the NetGroup at Politecnico di Torino, under

the project name of WinPcap.

Microsoft Windows provides an architecture for developing new network

components that interact with existing ones called NDIS25, as shown in Figure 19. NDIS

gives the possibility to create three kinds of components, to be more precise drivers,

that are miniport, intermediate and protocol drivers.

Miniport drivers are those ones that communicate with the NICs directly, thus they are

dependent on the particular network card; a driver of such type is, for example, that one

for the 3COM 3C509 NICs.

Intermediate Driver

NDIS Lower Edge

Network Interface Card

NIC Driver (miniport driver)

Protocol Driver

NDIS Intermediate Interface

NDIS Upper Edge

ND

IS

PacketsNetwork

Kernelspace

User

space

Figure 19 - The NDIS framework

Protocol drivers are those ones that implement a particular protocol stack, for example

the TCP/IP stack.

Intermediate drivers are a sort of mixture of the previous ones, and are used for

special purposes, for example to build personal firewalls26.

25 Network Driver Interface Specification. 26 They are agents that filter packets ingoing and outgoing from a host on the basis of several user-defined ACLs. They are used to protect PCs from attacks coming from the net.


58

WinPcap is composed of three components: a protocol driver (packet.sys) and two

user level libraries, implemented as DLLs27, packet.dll and wpcap.dll.

The protocol driver contains the tap, the BPF virtual machine and the kernel buffer;

the first DLL, packet.dll, is OS dependent, and exports a series of OS independent APIs

used by the other DLL, wpcap.dll. It is actually a wrapper to the various OS specific

systems calls used to communicate with the protocol driver.

wpcap.dll is the implementation of the LIBPCAP library, and it is OS independent

since it relies on the APIs exported by packet.dll.

A graphical description of WinPcap is shown in Figure 20.

BPF

Protocol Driver

Packet.dll

Wpcap.dll

packets

Application

Figure 20 - WinPcap architecture

The architecture is similar to that one of the TCP/IP stack; in fact, this stack is

implemented through two components: a protocol driver (tcpip.sys) and a user level

DLL (winsock.dll) that communicates with the protocol driver, and exports a series of

APIs used by applications based on TCP/IP, the so called sockets.

5.3.1 Differences from the original BPF architecture

WinPcap has some differences from the original architecture; however, all of them

are either implementation modifications, thus transparent to the user level APIs, or new

features.

The main implementation difference regards the kernel buffer: WinPcap does not

implement two kernel buffers, store and hold, but a circular buffer; the user application

(through the LIBPCAP APIs) retrieves data from the head of the buffer, and the BPF

virtual machine inserts packets at the tail of the buffer, as shown in Figure 21. The

default size of the buffer is 1 Megabyte.

27 Dynamic link libraries.


59

filter

Network Tap

Packet

CaptureDriver

Kernel

UserLevel

KernelBuffer

read IOCTL/write

Level

Packet.dll

Wpcap.dll

Application

Figure 21 – Interaction between the driver and the overlying DLLs

A feature that is closely linked to this buffer is the delayed write: packets are

transferred to user level only if the kernel buffer contains enough data (usually some

kiloBytes).

WinPcap offer a series of enhancements over the original BPF/LIBPCAP

architecture, that are:

• chance to send packets; it is also possible to instruct the driver to send multiple

copies of the same packet, for testing purposes;

• statistical mode; it is possible not to retrieve packets to user level, but rather count

packets and bytes accepted by the BPF machine, and then deliver only these counters

to applications. This operative mode can be used to perform statistics at kernel level,

i.e. without copying packets to user level and then processing them, thus obtaining

better results28.

The last difference regards the APIs themselves: as stated before, WinPcap is formed by

two user level DLLs, each one exporting a set of APIs. As a consequence, it is possible

to use the usual functions of the LIBPCAP library, exported by wpcap.dll, or a set of

lower level APIs, exported by packet.dll. Both the two sets of APIs are independent

from the various Windows flavors: at the moment of writing they support Windows 95,

98, ME, NT4, 2000, XP.

28 Packets are not copied in statistical mode (0-copy system).

6 The Table Management Engine (TME)

The goal of this chapter is to point out why the original BPF architecture is

insufficient to perform new tasks, like high performance monitoring, IDS29 or firewalls

and how it can be extended in order to offer these new functionalities, maintaining

backward compatibility with the original BPF system. Then this chapter will present a

detailed description of the proposed architecture.

6.1 Limits of the original BPF architecture

The BPF architecture has been proved being very performant, and yet quite simple to

use (this is also due to the excellent compiler, which translates filters from ‘human’

language, e.g. tcp port 80, to BPF assembly code); however, it was created with

the main purpose of selective packet capture; although it is possible to use it for other

tasks, it has some limits, among which:

• small and limited amount of memory in which store intermediate results.

• it is impossible to modify the packet

• simple instruction set, not tailored for complex tasks like packet classification.

In the following paragraphs we will explain why these limits represent a bottleneck

for the extended use of the architecture, and why an extension is needed.

6.1.1 Small and limited memory

Small RAM: the machine has 16 blocks of 32-bit words RAM, which is not enough for

any interesting task; this limit could be avoided by incrementing the number of words in

memory.

Limited RAM management: the machine has big limitations in the way memory is

managed, that restrict its use:

• it’s not possible to address it per byte (and consequently only 32 bit LOADs and

STOREs exist30);

• data is stored in host-byte-order, which is little-endian31 on INTEL processors;

unfortunately network-byte-order (the order in which bytes are transmitted on the

wire) is big-endian32; so, if we save parts of the original packet to RAM, it is always

necessary to swap bytes in order to obtain the original flow. The problem is

29 Intrusion Detection Systems 30 However, it is possible to perform 8- and 16-bit loads from the packet, that is contained into an ad-hoc ROM. 31 Bytes are saved, in consecutive memory locations, from the least to the most significant byte. 32 It is the opposite of little-endian, MSB comes first.

6.1 Limits of the original BPF architecture

61

important because the user expects to receive data extracted from the packet in the

order in which bytes where stored in the packet.

Another issue is that, whenever a field extracted from a packet, that is not 32-bit

aligned, is saved to memory, bytes are stored in an unusual way. For example, here

is how MAC address 10:20:30:40:50:60 is (usually) saved from the packet to

memory:

packet (network byte order)

Accumulator A Accumulator A

01 02 03 04 00 00 05 06

RAM (host byte order INTEL)

• STORE instructions work only in immediate mode (i.e. it’s not possible to use a

register, e.g. X, as base address);

• whenever the filter code is executed (i.e. for every packet received), this memory is

recreated (i.e. physically reallocated), and so it is stateless (content is not maintained

among filter runs).

These limits can be put off only by adding some more powerful memory.

6.1.2 Impossibility to modify the packet

There are some tasks that may need to modify the packet, for example a NAT needs

to modify source or destination IP addresses; the original instruction set does not allow

such operations.

6.1.3 Simple instruction set

Actually all the limitations of the architecture for complex tasks is its simplicity: the

instruction set was thought for ‘load and compare’ operations (the basic modus-

operandi of a selective packet capture system), but it fails when one wants to use it for

different tasks33. 33 However, this simplicity of the instruction set favors the code-check.

MAC destination MAC source

01 02 03 04 05 06 07 08 09 0A 0B 0C

04 03 02 01 06 05 00 00 Content

LOAD 32 bit LOAD 16 bit

STORE STORE

6.2 Extensions to the original BPF architecture

62

Here are just some limitations of the instruction set:

• it’s not possible to create loops, since backward JUMP instruction is not

implemented

• there is no flag register, so it is not possible to add two 64-bit integers using 32-bit

registers

• complex operations usually require more than two registers (A and X)

• lack of a stack (push – pop instructions).

As a consequence, it is very difficult to handle the problems that are needed for an

extended use of the BPF architecture, in a simple way; for example, one of the

requirements highlighted in the previous chapters is the chance to manage lookup

tables. In theory, it is possible to realize some types of tables in a direct manner, that is

to say implementing the lookup as a series of predetermined compare operations, as

shown in the pseudo-code

if (x=a0) increment counter0

else


else


else

...

where x is the value to be searched in the table, and ai are the various values contained

in the table.

Such an approach can be used only in some cases, for example order dependent ACLs,

but cannot be applied in others, for instance if we have to deal with the open matrices

described in chapter 3 - Monitoring networks.

The problem is that without backward jumps, or special instructions for this particular

purpose, there is no way to manage tables.

6.2 Extensions to the original BPF architecture

As a consequence to the above mentioned limitations of the original BPF

architecture, we need to extend the system in some manner.

The proposed technique is to ship the original BPF processor with some extensions,

that is to say some new parts that permit to perform the tasks we have dealt in the

previous chapters, to realize the so called ‘architecture for universal packet filtering’.

In the following paragraphs we will first highlight the requirements of such

extensions, and we present the proposed architecture.

6.3 Extension requirements

63

6.3 Extension requirements

Since it must be used for various different tasks, the extension of the BPF machine

needs to satisfy, and preserve, some requirements, that are basically:

• some stateful memory

• opportunity to manage search tables

• highly programmable.

Other requirements, that make the architecture practical to use, are

• ease of use

• extensibility by third-parties.

Stateful memory: we need to add some powerful memory, shared between the original

machine and the new extensions.

This memory will be used to store lookup tables and to exchange data between the two

components (BPF machine and extension).

Search tables: since almost all the tasks concerning packet filtering require to classify

packets, it is very useful to have a look-up table coupled with a fast search algorithm.

Here are just some examples of how this structure can be efficiently used:

• multi-dimensional statistics: it is needed to count packets on the basis of some

header fields (e.g. IP source address, or TCP port). If we have a lot of different

categories (e.g. each host on the network is a different entry) the table becomes

large, so the search has to be implemented efficiently, for example using an hash;

• firewalls: basically these systems determine if a packet can go forward or has to be

dropped by looking at the headers, through ACLs. Moreover a good firewall must be

able to execute a different operation for each different packet (e.g. the user must be

warned that a particular type of packet was seen, another type of packet must be

analyzed deeply in order to modify the table itself), so it would be very interesting to

couple each entry, that corresponds to an ACE, with some execution code that is

executed whenever a packet ‘satisfies’ an entry.

Highly programmable: one of the goals of this extension is that it must be open, in the

sense that we must be able to program the machine to perform the most various tasks.

As a consequence, it must not be a special-purpose system, but rather an open

architecture, a framework.

Ease of use: although it has to be highly programmable, it must still be simple to use,

and also safe: if you make a mistake using it, the extension must either signal it or work

6.4 Extended memory

64

in a wrong manner, but the machine (intended as the workstation running the system,

too) must not hang up.

Extensibility by third parties: last but not least, third-parties must have the chance to

extend the architecture with new features without modifying the original framework, i.e.

they must be able to add new functions to process and/or classify data without

modifying the basic structure of the architecture.

6.3.1 Proposed extensions

The proposed architecture of the extensions is shown in Figure 22; it is composed of

two parts:

• the extended_memory and

• the TME (Table Management Engine).

In the following paragraphs we will deal with the two blocks, describing how they

are realized, how they work and interact with each other and with the original BPF

engine.

BPF core

registers

extended_memory

TME core

packet

registers

basic

mem

ory

ctrl

TME co-processor

Figure 22 - The BPF machine with the new extensions

6.4 Extended memory

This new block consists of a 8-bit word addressable memory, with 32-bit wide

addresses; it is stateful (i.e. content is maintained through the life of the processor); it is

directly accessible by all the cores (BPF core and TME core).

We added some new instructions to the original set of the BPF machine, basically

there are 8-, 16- and 32-bit LOADs and STOREs, in immediate mode (to/from registers

6.5 TME

65

A and X) and relative mode (using X register as base address and A as

destination/source). All the data in this memory is saved in network-byte-order (big-

endian), to preserve the order in which bytes are represented on the wire. When 8- and

16- bit LOADs are used, data is saved in the least significant part of register A (or X),

and when 8- or 16-bit STOREs are involved, only the first one or two least significant

bytes of A (or X) are saved to memory.

A more detailed description of the new instructions can be found in 12 - TME

Extension Programming Manual.

6.5 TME

This is the main part of the extension; it’s implemented as a co-processor that

communicates with the BPF core, the main processor.

It’s composed of a core, some general registers, and some ‘blocks’ of special registers,

as shown in Figure 23.

TME core

registers

validated_blocks

active

working

active_read

data_block0

data_block3data_block2

data_block1

data_blockNN

Figure 23 – The TME co-processor

The main purpose of the TME is the handling of lookup tables, with a special key

feature: each entry in the table is associated to a different action (intended as processing

task), that is executed after the search (the LOOKUP) is performed; the action is the so

called exec_fcn; moreover each entry is coupled with some memory (usually 64 bytes)

that can be used to store useful data for the processing.

Let’s suppose we want to acquire some statistical data of particular classes of traffic,

for example Web, Telnet, FTP and Mail traffic. To achieve this result, we create a table,

and each entry corresponds to a particular TCP port.

6.5 TME

66

For every packet received, we check first if it is TCP, then we search the entry in the

table corresponding to the TCP port of the packet34 (the TCP port is the key to enter the

table); when the LOOKUP returns, we run the code associated with that entry, by

calling instruction EXECUTE35: if we want to count packet, there will be something like

“increment counters”, and these counters will be stored in the block of memory

associated with that entry.

Web (80)

Mail (25)

Telnet (23)

FTP (21)

Data

packets=134

bytes=296485

Data

packets=93

bytes=59642

Data

packets=24

bytes=19215

Data

packets=16

bytes=4898

Figure 24 – A simple lookup table

The advantage of the architecture is that the code associated with each entry can do, in

theory, whatever sort of processing, even modifying the original packet (e.g. if we

implement a NAT36, we need to modify some fields, like IP source), or signaling the

user that a special packet (or a sequence of them) was received. Moreover EXECUTEs

and even LOOKUPs can be chosen from a set of predefined ones, or they can be

defined by the user, since they are ANSI-C callback functions. This behavior gives great

flexibility to the TME extension.

As seen, the process is always split into two parts: LOOKUP and EXECUTE; although

they could have been put together, the approach we followed has some benefits:

• the algorithm used to perform LOOKUPs is independent from the EXECUTEs, so it

is possible to change it (for example, to increase performances) without modifying

these last ones;

• usually LOOKUP is quite general, and can be used with a variety of EXECUTEs,

and this approach avoids code duplication;

34 Actually, the packet contains two ports, source and destination, therefore we suppose to choose that one with number below 1024 (i.e. Reserved ports). 35 This instruction executes the exec_fcn. 36 Network Address Translator.

6.5 TME

67

• there are some cases in which a LOOKUP is enough, particularly when we want to

know whether an entry is present in the table or not (e.g. if we implement a bridge

we need only to look if a MAC address is present in the lookup table or not).

6.5.1 A general overview of the LOOKUP operation

Figure 30 gives a graphical explanation of what happens when a BPF program calls

the LOOKUP instruction.

right

VOID

BPF core

registers

TME core

registersbasic

mem

ory

ctrl

packet

extended_memory

key

VOID

right

VOID

wrong

wrong

VOID

2. LOOKUP call


to extended memory

3. search

4. TME core saves the address of the found

entry in a register of the

TME

5. UPFE core gives control

back to BPF core

wrong means ‘keys don’t match’


lookup table

Figure 25 - How a LOOKUP is performed

Here is an brief description of each step:

1. previous to the call to LOOKUP, the BPF core has to generate the key to enter the

lookup table, so it parses the packet (i.e. it recognizes what is the type of the

packet), and copies some fields of the headers from it to the extended memory

2. the BPF core executes LOOKUP: it triggers the TME core, that is given control;

3. on its hand, the TME core performs the search, and

4. stores37 the pointer to the matching entry in a register of the TME;

5. the TME core gives control back to the BPF core.

6.5.2 A general overview of the EXECUTE instruction

Figure 26 illustrates what happens when an EXECUTE instruction is performed.

37 as stated before, this is recommended, but not compulsory.

6.5 TME

68

right

VOID

BPF core

registers

TME core

registersbasic

mem

ory

ctrl

1. EXECUTE call

Packets = 175

176Bytes = 55434

55758

2. Read LUT entry in a

register of the TME

3. Execute the exec_fcn on the data present in that entry


extended_memory

lookup table

Figure 26 - How the EXECUTE instruction works

1. The BPF core executes the EXECUTE instruction giving control to the TME core;

2. on its hand, the TME core reads a register of the TME, that contains a pointer to an

entry in the lookup table; this register is the one used by LOOKUP instruction;

3. the TME core executes the code associated with the EXECUTE (the exec_fcn); this

function works on the data of that entry, an ID of the to-be-executed function is

stored in the entry itself;

4. when the TME core has finished executing the code associated with EXECUTE

instruction, it gives control back to the BPF core.

6.5.3 Lookup tables

In the preceding paragraphs we gave an small overview of the whole

architecture, pointing out what LOOKUP and EXECUTE are, and how they work, in

brief.

We have considered that the data associated with each entry was stored in the entry

itself; the proposed architecture is slightly different, because the lookup table is actually

split up into two parts: lut_segment38 and shared_memory_segment39;

lut_segment is divided into entries (that can be considered the true entries of the

lookup table); shared_memory_segment is divided into blocks.

38 LUT is an abbreviation for lookup table. 39 Its name is derived from the fact that its content is shared between the TME and user applications.

6.5 TME

69

EXECUTE_ID1block_ offset1


















lut_segment

dataKey1

dataKey2

dataKey3

dataKey4

dataKey5

dataKey6

dataKey7

dataKey8

dataKey9

datakey10

dataKey1

dataKey2

dataKey3

dataKey4

dataKey5

dataKey6

dataKey7

dataKey8

dataKey9

datakey10

shared_memory_segment

Figure 27 - Relationship between an entry and a block

lut_segment contains the entries of the lookup table, but each entry contains only a

pointer to a block in the shared_memory_segment and the ID of the EXECUTE

associated with that entry.

A shared_memory_segment block contains the key of the entry that points to it, a

timestamp and the data associated with that entry (in the example above, it contains the

counters bytes and packets).

User applications are interested in receiving the data contained in the blocks, so they

will receive the shared_memory_segment.

We followed this approach for a particular reason: an efficient lookup table requires

a good search algorithm, a excellent choice is to manage the table as an open-addressed

hash; this technique proves to be very fast if the table is quite free (i.e. if the number of

free entries is at least half the size of the table). Having a single table (like the one

described in the example) has two drawbacks:

1. in a hash, the filled entries are not consecutive, that is to say they are interleaved by

empty entries. The process of passing the table to the user involves the copy of the

entries into a user level buffer. Consequently, we can choose two approaches: copy

all the entries of the table, thus disregarding of their fill state40, or check each entry

before copying it to the user buffer. Both the approaches are not performant: the

former copies useless entries, and they are usually more than 50% of the table, the

latter imposes to perform a fill state check for each entry, which is quite CPU

consuming.

2. as stated before, good performances are achieved if the table is quite free; if we have

only one table, the waste of memory is not negligible. For example, let’s suppose we

have a table of about 32.000 entries, each entry occupying 64 bytes; if we keep the

40 If they are filled or not.

6.5 TME

70

table free for at least an half, the table occupies about 2 Megabytes of memory, but

the amount of memory ‘used’ by free entries is at least 1 Megabyte. Moreover, the

problem gets more serious if we manage several tables at the same time.

Splitting the lookup table into two ones solves the problem in this manner: if we create

less blocks (in the shared_memory_segment) than entries (in the

lut_segment), we implicitly guarantee that the lookup table is never fully occupied,

since the number of occupied entries is always less or equal the number of blocks. Two

consequences come from this approach: we grant better performances with hashes,

since the table is never totally occupied, and we use less memory; for example, a table

of about 32.000 entries, but with only 16.000 blocks of 64 bytes occupies about 1

Megabyte, which is half the memory occupied by the same entries in a unique table;

moreover, if we store the filled blocks at the beginning of the

shared_memory_segment, the process of passing data to the user is much more

straight-forward, since we have to copy subsequent bytes.

Since EXECUTEs may need to store some specific, but yet general, data, there is

also another segment, extra_segment, that is totally general.

Another key feature of the architecture is that these segments used to store the table

(lut_segment, shared_memory_segment and extra_segment) are located

in the extended memory, thus permitting the BPF core to modify data in the table

(although this behaviour is strongly discouraged).

6.5.4 LOOKUP functions

LOOKUP instruction is a front-end to a callback function, generally called

lookup_fcn, which is, as discussed above, either written by the user or chosen from a

series of standard ones. The callbacks are called by the architecture itself when a

LOOKUP instruction is processed. Several lookup_fcns are needed, since each of them

can serve a different purpose, for example one to deal with masks, one to deal with

buckets, and then a general one, useful for common lookups.

These lookup_fcns receive some information on the table they are working on, for

example key length and table size (they receive the currently active TME data block,

described in the following paragraph) and a pointer to a location in the extended

memory, from which the key (to search in the table) is stored. Only this last parameter

is passed to the function by the LOOKUP (in form of a parameter of the LOOKUP

instruction call), the others are passed by the TME architecture itself.

Usually, lookup_fcns store a pointer to the found entry in the LAST_FOUND register

of the TME, or 0xFFFFFFFF if the search was not successful; this is the recommended

modus operandi of these callbacks, but is not compulsory (i.e. one can create a lookup-

6.5 TME

71

fcn that doesn’t use LAST_FOUND register, bearing in mind that EXECUTE instruction

cannot be used afterwards since this last one always refers to this register).

6.5.5 EXECUTE functions

EXECUTE instruction is a front-end to a callback function, generally called

exec_fcn, which is, as discussed above, either written by the user or chosen from a set of

standard ones, and called by the architecture itself when an EXECUTE instruction is

processed. These exec_fcns receive some information on the table they are working on,

the same ones received by lookup_fcns, a pointer to the block in the

shared_memory_segment to deal with (this is the block associated with the entry

found with the last call to LOOKUP), some information on the packet the machine is

currently processing, and a pointer to a location in the extended memory, to pass

additional information to the function. While the last parameter, in the previous list, is

passed through the parameters of the front-end EXECUTE (i.e. EXECUTE instruction

needs as parameter a pointer to a location in the extended memory), the other

parameters are passed to exec_fcns by the TME architecture.

The interaction between LOOKUPs and EXECUTEs is achieved via register

LAST_FOUND, that contains the pointer (in the extended memory) to the entry which

matched the key during the last call to LOOKUP41. In practice, the lookup_fcn stores

the pointer to the found entry in this register, and the exec_fcn retrieves this pointer to

know which block it has to deal with.

When EXECUTE instruction is processed, the TME reads LAST_FOUND register and

runs the exec_fcn whose ID is stored in the entry at that address, on the block (in the

shared_memory_segment) stored in that entry42.

6.5.6 TME data blocks

In the above paragraphs we dealt with the working principle of the architecture, limiting

ourselves to one lookup table. Actually the architecture is able to deal with several

tables or, to be more precise, TME data blocks. Each TME data block contains one

entity of the previous described type, and blocks are totally independent (i.e. they do not

share anything), as shown in Figure 28. Each block can be activated and used

independently from the others; since we can use only one block at a time, there is a

special instruction to switch between a pool of usable ones (the validated blocks).

41 If the search was not successful, this register contains a special value, 0xFFFFFFFF. 42 If the last search was not successful (i.e. LAST_FOUND register contains 0xFFFFFFFF), the exec_fcn ID to be executed is stored in another register, OUT_LUT_EXEC, and this last function works always on the first block in the shared_memory_segment.

6.5 TME

72

extended_memory

TME block

LUT segment

shared segment

extra segment

TME block

LUT segment

shared segment

extra segment

registers

validated_blocks

active

working

active_read

data_block0


data_block1

registers

validated_blocks

active

working

active_read

data_block0


data_block1

TME core

Figure 28 - TME data blocks

Some global TME registers are used to manage the blocks, that are

• active: it contains the index of the currently active TME data block (i.e. the block

used when a LOOKUP or an EXECUTE is performed);

• active_read: it contains the index of the TME data block to be passed to the

user. As stated before, only the shared_memory_segment is passed to the user;

• working: it contains the index of the TME data block on which the initialization

instructions work, before this block is validated;

• validated_blocks: it is a flag register; if bit n is set, then TME data block n is

validated, and can be used to perform LOOKUPs and EXECUTEs.

6.5 TME

73

Each TME data block is associated with a collection of registers, containing data

specific to that block, for example the number of entries or blocks in the tables coupled

with it.

data_blockX

FILLED_BLOCKS

OUT_LUT_EXECKEY_LEN

MAX_FILL_STATE

REHASHING_VALUE

SHARED_MEMORY_BLOCKS

LOOKUP_CODE

BLOCK_SIZE

EXTRA_SEGMENT_SIZE

LUT_ENTRIES

DEFAULT_EXEC

LAST_FOUND

FILLED_ENTRIES SHARED_MEMORY_BASE_ADDRESS

EXTRA_SEGMENT_BASE_ADDRESS

LUT_BASE_ADDRESS

Figure 29 - TME data block registers

Figure 29 shows a detailed view of all these registers, here is a description of each one:

• LUT_ENTRIES: it contains the number of total entries in the lookup table (which is

stored in the lut_segment); it can be set only during the initialization phase of

the TME;

• MAX_FILL_STATE: it is a upper bound to the fill rate of the lookup table, for

example if this value is set to 1500, the number of filled entries must remain under

1500; it can be used by LOOKUP instruction, but at the moment is not used;

• REHASHING_VALUE: it contains a value that can be used by LOOKUP instruction

in the search algorithm; if the lookup table is managed as an hash, this value is used

when a collision happens;

• KEY_LEN: it contains the length of the key used to identify the entries in the lookup

table; it is represented in terms of 32-bit words43 (e.g. KEY_LEN=3 means 3*4=12

bytes);

• SHARED_MEMORY_BLOCKS: it contains the number of total blocks in the

shared_memory_segment; the first block is always reserved, and contains data

relative to unsuccessful LOOKUPs;

• FILLED_ENTRIES: it contains the number of filled_entries in the lookup_table;

43 The key length is in terms of 32 bit words to allow 32-bit comparisons during the access to the table. This is an optimization for 32-bit processors, that are able to perform comparisons with one assembler instruction with words up to 32 bits long.

6.5 TME

74

• BLOCK_SIZE: it contains the size of each block in the

shared_memory_segment; it takes into account also memory occupied by the

key;

• EXTRA_SEGMENT_SIZE: it contains the size, in bytes, of the extra_segment;

• FILLED_BLOCKS: it contains the number of occupied blocks in the

shared_memory_segment; it is initialized to 1, since the first block, as stated

before, is reserved;

• LOOKUP_CODE: it contains the ID of the lookup function (the so called lookup_fcn)

called by the front-end instruction LOOKUP in that TME data block;

• DEFAULT_EXEC: it contains the ID of the exec_fcn that is coupled to an entry,

when this last one is created by a LOOKUP;

• OUT_LUT_EXEC: it contains the ID of the exec_fcn that is executed when an

EXECUTE is processed and register LAST_FOUND contains the special value

0xFFFFFFFF (i.e. when last LOOKUP was not successful);

• LUT_BASE_ADDRESS: it contains the base address of the lut_segment; it is a

read-only register;

• SHARED_MEMORY_BASE_ADDRESS: it contains the base address of the

shared_memory_segment; it is a read-only register; • EXTRA_SEGMENT_BASE_ADDRESS: it contains the base address of the

extra_segment; it is a read-only register; • LAST_FOUND: it contains a pointer to an entry in the lookup table, or

0xFFFFFFFF if last LOOKUP was not successful.

6.5.7 What happens when a LOOKUP is performed

Figure 30 gives a detailed explanation of what sits behind the LOOKUP front-end.

6.5 TME

75

right

VOID

BPF core

registers

TME core

registersbasic

mem

ory

ctrl

packet

extended_memory

key

EXTRA SEGMENT

SHARED

MEMORY SEGMENT

LUT

SEGMENT

VOID

right

VOID

wrong

wrong

VOID

2. LOOKUP call


to extended memory

3. search

4. TME core saves address of the entry in

TME_LAST_FOUND

5. TME core gives control

back to BPF core, with a

return value TME_TRUE (entry found) wrong means ‘keys don’t match’


TME

data block

Figure 30 - How a LOOKUP is performed

Here is an brief description of each step:

1. previous to the call to LOOKUP, the BPF core has to generate the key to enter the

lookup table, so it parses the packet (i.e. it recognizes what is the type of the

packet), and copies some fields of the headers from it to the extended memory in a

location it decides (i.e. the key must not be stored in a particular memory location,

since a parameter in the LOOKUP call contains the base address of the key), the

starting location is decided when the filter code running is generated (either by hand

or by a compiler);

2. the BPF core executes LOOKUP: it triggers the TME core, that gains control;

3. on its hand, the TME core executes the lookup_fcn whose ID44 is stored in register

LOOKUP_CODE; this lookup_fcn performs the search, and

4. stores45 the pointer to the matching entry in LAST_FOUND register, and return a

status code, which can be TME_TRUE, if the search was successful, TME_FALSE, if

the search was unsuccessful, or TME_ERROR if an error occurred;

5. the TME core gives control back to the BPF core, together with the return value of

the lookup_fcn; since the LOOKUP instruction is implemented as a conditional

JUMP, the BPF core jumps to a program instruction depending on this last value.

6.5.8 What happens when an EXECUTE is performed

Figure 30 gives a graphical explanation of what sits behind the EXECUTE front-end.

44 The conversion between IDs and functions is realized through a “function mapper”, that is to say a function that, given an ID, returns the corresponding function pointer. 45 as stated before, this is recommended, but not compulsory.

6.6 Interaction of the extensions with the original BPF architecture

76

right

VOID

BPF core

registers

TME core

registersbasic

mem

ory

ctrl

extended_memory

EXTRA

SEGMENT

SHARED

MEMORY

SEGMENT

LUT

SEGMENT

TMEdata block

1. EXECUTE call

Packets = 175176

Bytes = 5543455758

2. Read LUT entry at

address pointed by TME_LAST_FOUND

block address

exec_fcn ID

3. Execute exec_fcn with ID “exec_fcn ID”, on data at

address “block address”


Figure 31 - How the EXECUTE instruction works

5. The BPF core executes the EXECUTE instruction by giving control to the TME

core;

6. on its hand, the TME core reads register LAST_FOUND, that contains a pointer to an

entry in the lookup table;

7. the TME core executes the exec_fcn whose ID46 is stored in that entry, on the block,

in the shared_memory_segment, whose pointer is stored in that entry; in the

example in Figure 31, the exec_fcn increments two counters representing packets

and bytes associated with that entry;

8. when the exec_fcn is completed, the TME core gives control back to the BPF core.

6.6 Interaction of the extensions with the original BPF architecture

The interaction between the original BPF core and the extensions (extended memory

and TME) is achieved through some new instructions that were added in the original

instruction set. Moreover, since the TME architecture needs some initialization before it

can be used, a new phase, initialization, was added; this phase is performed when the

filter is injected in the BPF processor, but prior to putting the BPF machine in the state

of receiving packets.

46 Again a function mapper is used for the conversion between IDs and functions.

6.7 Interaction with applications

77

6.6.1 Interaction with the extended memory

The new instructions that deal with this extension permit the resizing47 of this

memory (which is by default of 64 kilobytes), LOADs and STOREs.

In particular

• LOAD instructions can transfer data from extended memory to registers A and X of

the BPF core, in chunks of 8, 16 and 32 bits (usually referred as bytes, half-words

and words), in immediate mode (instruction contains the absolute address from

where load) and relative mode (instruction contains an offset to the location from

where load, X contains the base address);

• STORE instructions can transfer data from registers A and X of the BPF core to the

extended memory, in chunks of bytes, half-words and words, in immediate and

relative mode.

6.6.2 Interaction with the TME

The interaction with the TME is quite complex, mainly because the architecture is

very powerful. The instructions that were added can be divided into two main groups:

operative and initialization instructions.

Operative instructions: there are three instructions, that are LOOKUP, EXECUTE and

SET_ACTIVE. LOOKUP and EXECUTE are the front-ends described in the previous

chapters; SET_ACTIVE is used to switch between validated TME data blocks.

Initialization instructions: there are some instructions that, basically, allow the BPF

core to initialize a TME data block with default values, get and set the values of the

registers of a block, validate a block. This last operation is the most important: it checks

if the registers in a block are coherent, initializes others registers (e.g.

XXX_BASE_ADDRESS registers), and finally mark the block as validated (by setting

the appropriate bit in the global register VALIDATED_BLOCKS). A more detailed

description of the initialization phase can be found in paragraph 6.8, while details on

the instructions can be found in 12 - TME Extension Programming Manual.


The interaction with applications can be divided into two parts: users need to instruct

the BPF core (the so-called filter injection), and to receive data from it.

Filter injection is almost the same as the original BPF architecture, with only one

difference: we need to inject also the initialization code to the system.

Indeed, data is passed to the user in a ‘raw’ mode: the application receives a copy of

the blocks in the shared_memory_segment without any further formatting (a 1-to-

1 copy is performed). As stated before, it is possible to handle several TME data blocks

47 This operation is allowed only during the initialization phase.


78

at the same time; however, the application must receive the data corresponding to only

TME data block; the TME data block, that must be passed to the user, is set in the

global TME register ACTIVE_READ. Therefore, during initialization phase it is

necessary to issue a special instruction, SET_ACTIVE_READ, to instruct the TME core

on which block has to be passed to the user; if this instruction is not instructed, the TME

core passes block #0 to the user.

Furthermore, TME copies only those blocks, in the shared_memory_segment, that

contain valid data; in particular it copies the first n blocks, where n is the value

contained in FILLED_BLOCKS register of the ACTIVE_READ TME data block.

Since, as pointed out before, data passed to the application highly depends on the filter

injected in the filtering machine, the only way to correctly understand them is through

the knowledge of the filter itself. The usual process to parse information received is

shown in Figure 32:

• the application receives a buffer of raw bytes from the TME,

• since it knows the dimension of the blocks, it can divide the buffer into blocks,

• then, through the knowledge of the key length, it can interpret the key associated

with each block,

• and finally the data associated with each block.

It must be pointed out that the application knows the size of the blocks and the length of

the key, since the initialization code, generated by the application itself, contained that

data. RAW DATA BLOCKS KEYS DATA

block size

key

key

key

key

key

key length

key

key

key

key

key

exec_fcn IDs

filtering machine

data

data

data

data

data

Figure 32 - How applications parse data received by the filtering machine

6.8 Inizialization of the TME

79


This is the most important, and critical, phase to use the TME architecture. It can be

divided into three main parts:

1. a TME data block must be set as working and initialized with proper values in the

registers (e.g. number of entries and blocks in the table);

2. then it must be validated, that is to say the TME core has to perform a sanity check

on the values inserted in the TME data block registers, and, on success, it marks the

block as validated (i.e. usable by LOOKUPs and EXECUTEs);

3. an optional part, in which it is possible to force the insertion of entries in the lookup

table.

Prior to deal with these parts, it is necessary to point out what is a WORKING TME data

block and what is an ACTIVE TME data block; when the BPF machine is ‘turned on’,

that is to say when it starts to process the initialization code, all the TME data blocks are

not ready to be used, they are not ‘validated’. The initialization phase works on the

WORKING TME data block; a TME data block can always be set as WORKING, even if

it is not ready. When the block has been initialized and set to validated (that is to say it

is ready to be used), it can be set as ACTIVE, that is to say operative instructions can

work on it.

A final note: if a block is ACTIVE, it is also WORKING (i.e. if global TME register

active contains value n, register working contains value n, too).

TME data block initialization: first, we need to set a block as WORKING; this is

done by issuing instruction SET_WORKING. Then we need to put some values in the

registers of the TME data block we chose with SET_WORKING; this can be done in

two ways: either by initializing each register manually, with the instruction

SET_REGISTER_VALUE, or by using the special instruction INIT. This last

instruction inserts some default values in the TME data block registers; details on the

default values can be found in 12 - TME Extension Programming Manual. Two other

facts must be taken into account:

• it is always possible to issue INIT, and then change a register value (that was

initialized by INIT) through SET_REGISTER_VALUE;

• INIT instruction never initializes register KEY_LEN, so it is always necessary to set

this register manually (this is due to the fact that we cannot decide what is a default

value for the key length).

INIT accepts as argument the index of the TME data block to the initialized; it also sets

this block as WORKING. This command is the preferred method to initialize a block. It

is important to note that INIT sets a default lookup_fcn and a default exec_fcn. The

complete list of default values set by INIT is shown in Figure 33.


80

SET_REGISTER_VALUE accepts as argument the ID of the register to be written;

register A of the BPF core contains the value to be written.

data_blockX default values

FILLED_BLOCKS=0

OUT_LUT_EXEC=count_packetsKEY_LEN

MAX_FILL_STATE=15.000

REHASHING_VALUE=1

SHARED_MEMORY_BLOCKS=16.000

LOOKUP_CODE=normal_lut_w_insert

BLOCK_SIZE=64

EXTRA_SEGMENT_SIZE=0

LUT_ENTRIES=32.007

DEFAULT_EXEC=count_packets

LAST_FOUND

FILLED_ENTRIES=0 SHARED_MEMORY_BASE_ADDRESS

EXTRA_SEGMENT_BASE_ADDRESS

LUT_BASE_ADDRESS

Figure 33 – Default values of a TME data block

TME data block validation: in this part a TME data block, previously initialized, is

submitted to the TME core; it is done by issuing instructions VALIDATE. On its hand,

this command checks whether registers contain proper values. In particular, the most

important controls are:

• number of entries in the lut_segment, blocks in the

shared_memory_segment and key length must be greater than zero;

• IDs of lookup_fcns, and of exec_fcns must be valid;

• there must be enough memory to store the TME segments in the extended memory.

If the registers ‘pass’ the checks, the block is marked as validated (i.e. the appropriate

bit in the global TME register validated_blocks is set), and some read-only

registers are set to the appropriate values (these are the XXX_BASE_ADDRESS

registers). It sets the validated TME data block to ACTIVE, too.

VALIDATE accepts as argument the index of the TME data block to be validated,

register A of the BPF core contains the base address, in the extended memory, from

which the segments will be stored.

It is the only instruction that is able to set a TME data block as validated.

Optional part: in this part it is possible to perform additional operations to properly

initialize a TME data block; for example it is here that one can have some LOOKUPs to

force the insertion of some entries in the lookup table. In paragraph 6.5 we have

described a system to monitor TCP traffic on some specific ports; this behavior can be

achieved by creating an entry for each service (FTP, Web, Telnet…); this is done in this

last phase of the initialization phase, by having set a lookup_fcn that is able to insert


81

new entries. When we have inserted all the entries corresponding to the interesting TCP

ports, we change the lookup_fcn to one that does not permit the insertion of new entries;

this is done through a SET_REGISTER_VALUE instruction.

Then the filter code will generate the key with the TCP port and performs a LOOKUP.

If the TCP port is not one of the interesting ones, the lookup_fcn will return

TME_FALSE (since it cannot insert new entries).

In theory it is possible to assign a different execute_fcn to every entry of the lookup

table; in practice, this behavior leads to the realization of a lookup table that is almost

impossible to understand at user level: how can the application distinguish the various

execute_fcns used for each block in the shared_memory_segment?

A detailed overview of the techniques to properly use lookup_fcns can be found in 7.4 -

LOOKUP callbacks.

Example 1: a system to monitor the traffic outgoing from each host in a LAN

This type of analysis can be done by creating a table in which the key is represented by

the Ethernet MAC source address of each packet, the EXECUTE must increment two

counters representing the packets and bytes for each entry.

1. SET_MEMORY 1500000

It resizes extended memory to 1.5 Megabytes, enough to store a TME

data block initialized with default values.

2. INIT 0

It initializes TME data block O with default values, the default

lookup_fcn allows the insertion of new entries, the default exec_fcn48 is

the one that manages packets and bytes counters. We didn’t issue

SET_WORKING since this is done by INIT command.

3. LD 2

It loads 2 in register A; this is the key length (2 32-bit words).


It sets register KEY_LEN to the value stored in register A, which is 2.

5. LD 8

It loads 8 in register A; this is the base address from which the TME

data block will be stored. It starts from byte 8, since the first 8 bytes are

used to store key passed to the LOOKUP.

6. VALIDATE 0

It validates TME data block 0.


48 The one that is assigned to an entry when this last one is created, stored in register DEFAULT_EXEC.


82

It instructs the TME that the application has to receive data from TME

data block 0.

8. RET 1

It returns success.

Some notes on this initialization code:

• the lookup_fcn and exec_fcn are set by the INIT, with default values;

• if some initialization instruction fails, implicitly the machine executes a ‘RET 0’

instruction;

• it is always necessary to resize memory to about 1.5 Megabytes if we use the

default values set by INIT instruction49.

The corresponding filter program is

1. LD P[6]

It loads the first 4 bytes of the destination MAC address from the packet

to register A.

2. EX_ST MEM_EX[0]

It stores register A to extended memory, starting at location 0.

3. LDH P[10]

It loads the last 2 bytes of the destination MAC address from the packet

to register A.

4. EX_STH MEM_EX[4]

It stores the 2 least significant bytes of register A to extended memory,

starting at location 4.

5. LOOKUP jt=6 jf=6 k=0

It performs the lookup in the table, k contains the base address of the

key in the extended_memory. jf and jt have the same value,

since we are not interested in performing a different action depending

on the result of the search.

6. EXECUTE k=0

It calls the exec_fcn; k contains the base address, in the

extended_memory, of some data that the exec_fcn may need to

receive.

7. RET 1

It returns success.

49 This is because the extended_memory is always allocated to its default size (64 kiloBytes) when the architecture is created, even if a BPF program does not use the TME co-processor; as a consequence, it would be memory consuming to allocate an initial amount of memory of 1.5 MegaBytes.


83

Example 2: a system to monitor the traffic directed to three specific hosts

This type of analysis can be done by creating a table in which the key is represented by

the destination IP address of each packet, again the EXECUTE must increment two

counters representing the packets and bytes for each entry.

In this example two lookup_fcns are used: the default one,

default_lut_w_insert, and default_lut_wo_insert; these two functions

implement a hash, and are almost the same; the only difference is that the former is able

to insert new elements in the table, the latter cannot.

1. INIT 0

It initializes TME data block O with default values, the default

lookup_fcn allows the insertion of new entries, the default exec_fcn is

the one that manages packets and bytes counters. We didn’t issue

SET_WORKING since this is done by INIT command.

2. LD 1

It loads 1 in register A; this is the key length (1 32-bit word).


It sets register KEY_LEN to the value stored in register A, which is 1.

4. LD 7

It loads 7 in register A50; this is the number of entries in the lookup table.

5. SET_REGISTER_VALUE TME_LUT_ENTRIES

It sets register LUT_ENTRIES to the value stored in register A, which is 7.

6. LD 4

It loads 4 in register A; this is the number of blocks in the

shared_memory_segment.

7. SET_REGISTER_VALUE TME_SHARED_MEMORY_BLOCKS

It sets register SHARED_MEMORY_BLOCKS to the value stored in

register A, which is 4.

8. LD 4

It loads 4 in register A; this is the base address from which the TME

data block will be stored. It starts from byte 4 since the first 4 bytes are

used to store key passed to the LOOKUP.

9. VALIDATE 0

It validates TME data block 0.


50 Actually we need only 3 entries, but using 7 ones guarantees that the table is always quite empty.


84

It instructs the TME that the application has to receive data from TME

data block 0.

11. LD 0x82C01405

It loads an IPv4 address (130.192.20.5) in register A.

12. EX_ST MEM_EX[0]

It stores that IP address to extended memory, at address 0.


We perform a LOOKUP to force the insertion of that IP address in the

lookup table.

14. LD 0x82C01719


15. EX_ST MEM_EX[0]




lookup table.

17. LD 0x82C028a3


18. EX_ST MEM_EX[0]




lookup table.

20. LD default_lut_wo_insert

It loads an exec_fcn ID in register A; this function in almost the same as

the one used by default, except for the fact that it cannot insert new

entries in the table, and therefore returns TME_FALSE if an insertion

is needed.

21. SET_REGISTER_VALUE TME_LOOKUP_CODE

It sets register LOOKUP_CODE to the ID stored in register A.

22. RET 1

It returns success.

The corresponding filter program is

1.LDH P[12]

It loads bytes 12 and 13 of the packet, in which the level-3 protocol type

is stored.

6.9 Third-parties extensions

85

2. JE 0x800 jt=3 jf=7

It checks if protocol type (stored in register A) is IP (0x800). If it is IP, then we

can perform LOOKUP and EXECUTE, otherwise we skip the packet.

3. LD P[30]

It loads the destination IP address from the packet to register A.

4. EX_ST MEM_EX[0]

It stores register A to extended memory, starting at location 0.


It performs the lookup in the table.

6. EXECUTE k=0

It calls the exec_fcn.

7. RET 1

It returns success.

6.9 Third-parties extensions

The TME architecture has been created in such a way that third parties can create

add-ons to exploit special features; in particular, developers can create their own

lookup_fcns and exec_fcns, and then include them in the original TME architecture.

6.10 Known limitations

At the moment each TME function checks every single parameter passed by the BPF

virtual processor, to avoid erroneous behaviors. What is lacking is a system that

validates the code executed by the BPF virtual processor itself, for example we must

check that STORE instructions always use valid addresses. The original BPF

architecture is shipped with a function that validates the original BPF filters,

bpf_validate, but this module is not able to check the newly added instructions (it

simply skips them).

Another limitation of the architecture is due to the fact that only two lookup_fcns and

two execute_fcns are defined at the moment. It would be very interesting to add new

callbacks to deal with more situations.

The initialization phase is quite complex, it would be interesting to modify it in order

to have a simpler one.

When we create open matrices, one of the main issues is that the number of occupied

entries always grows, in theory up to the maximum size of the table. Indeed, there are a

lot of cases in which it is important to maintain only the newer entries, while the older

ones can be deleted; at the moment, however, there is no such feature in the

architecture, for a particular reason: the technique of deleting the older entries depends

6.10 Known limitations

86

strictly on the algorithm used to manage the lookup table. In particular, if the table is

handled as a open addressed hash, the entries cannot be deleted, the possible ways to

clean the table are either to recreate a new table out of the existing one, or overwrite the

older entries when a collision occurs (that is to say, if a key collides to an older entry,

that entry is overwritten, rather then performing a rehashing).

Each of the solutions has its pros and cons:

• the former has the advantage that the table is cleaned from all the older entries, but it

has the big disadvantage that when the table is recreated, the TME architecture

cannot access the existing one, to update an entry of to create a new one. This

problem has the practical consequence that when the recreation occurs, the entire

BPF virtual processor cannot process packets;

• the latter has the advantage that the table is automatically cleaned when a new entry

must be added, so the BPF virtual processor is never blocked; the drawback is that

an older entry is deleted only if a to-be-inserted entry collides on it; as a

consequence, if a to-be-inserted entry collides to a non-erasable entry, and there are

not free entries, the entry will not be inserted.

7 TME – The implementation

The Table Management Engine architecture was developed using the C language, for

two reasons: the former is that the BPF architecture was written in this programming

language, and the latter is the fact that C is one of the most powerful and performant

language to develop code working at kernel level.

Moreover, the TME extension was inserted in the Windows porting of the BPF

architecture, WinPcap.

The modifications involved several parts of WinPcap, particularly the BPF machine,

the modules dealing the READ and OPEN calls requested by the user, and some

IOCTLs.

7.1 The MODE_MON working mode

A new working mode, MODE_MON, was added to the original set51, in order to deal

with the TME. This mode can set through the usual APIs of wpcap.dll and packet.dll,

that are pcap_setmode for wpcap.dll and PacketSetMode for packet.dll.

When WinPcap works in this mode, the user receives the data processed by the TME

core, in practice, as stated in chapter TME, the user receives the

shared_memory_segment of the ACTIVE_READ TME data block, in a user buffer.

However, it must be noted that the TME architecture can be used in the other working

modes without problems, too: for example, the TME extension can be used to classify

packets prior to capture them.

7.2 Data structures

Before dealing with the structures used by the TME, an important detail must be

pointed out. As described in the previous chapter, the extended_memory is managed

in network-byte-order, and the data associated with each TME data block (the

XXX_segments) are stored in this memory. A particular choice was taken to balance

performances and coherence: the lut_segment is represented in network-byte-order,

while the shared_memory_segment is represented in a hybrid way, in the sense

that the key is represented in network-byte-order, to preserve the order in which bytes

are stored in the packets, but the other parts of each block (for example timestamps and

counters) are represented in host-byte-order. The extra_segment has not a

51 The original working modes were MODE_CAPT, to capture packets, and MODE_STAT, corresponding to the statistical mode.

7.2 Data structures

88

particular byte-order. As a consequence, the data contained in the

shared_memory_segment should be considered opaque by the BPF virtual

processor.

In particular, here is a graphical representation of the content of each entry in the

lut_segment and of each block in the shared_memory_segment.

Entry

block offset

exec_fcn ID

The main data structures used by the TME extensions are defined in the header file

<tme.h>; the most important ones are:

• TME_CORE

• TME_DATA

• RECORD

Another structure, that contains the extended_memory, is MEM_TYPE, defined in

the header file <mem_type.h>.

7.2.1 TME_CORE

This structure contains all the informations of the TME extensions; it is used by all

the kernel driver functions of WinPcap that deal with the TME.

typedef struct __TME_CORE

{

uint32 working;

uint32 active;

uint32 validated_blocks;

uint32 active_read;

TME_DATA block_data[MAX_TME_DATA_BLOCKS];

} TME_CORE;

The four unsigned integers working, active, validated_blocks and

active_read correspond to the analogous register of the TME extensions;

Block

key

Timestamp

data

7.2 Data structures

89

block_data is an array of structures containing the registers of a TME data block,

each.

7.2.2 TME_DATA

This structure contains the registers of a TME data block.

typedef struct __TME_DATA

{

uint32 lut_entries;

uint32 max_fill_state;

uint32 rehashing_value;

uint32 key_len;

uint32 shared_memory_blocks;

uint32 filled_entries;

uint32 block_size;

uint32 extra_segment_size;

uint32 filled_blocks;

lut_fcn lookup_code;

uint32 default_exec;

uint32 out_lut_exec;

uint8 *lut_base_address;

uint8 *shared_memory_base_address;

uint8 *extra_segment_base_address;

uint8 *last_found;

} TME_DATA;

The unsigned integers lut_entries, max_fill_state, key_len,

rehashing_value, shared_memory_blocks, filled_entries,

block_size, extra_segment_size, filled_blocks correspond to the

analogous registers of a TME data block.

lookup_code contains a pointer to the lookup_fcn function used when the LOOKUP

instruction is issued; this variable does not contain the ID of the lookup_fcn, but rather a

‘direct’ pointer to it to improve performance.

default_exec and out_lut_exec contain the ID of exec_fcns, represented as

32-bit unsigned integers.

lut_base_address, shared_memory_base_address and

extra_segment_base_address contain the pointers to the starting physical

addresses of these segments. These are absolute pointer, that is to say they refer to

physical memory locations. To obtain the base address of these segments relative to the

extended_memory, it is sufficient to subtract the starting physical address of the

extended_memory, available to all the functions dealing with the TME_data

structure.

7.2 Data structures

90

last_found contains the absolute (i.e. physical) address of the last found entry in

the lookup table, as described in chapter 6 - The Table Management Engine (TME).

7.2.3 RECORD

This structure corresponds to an entry in the lut_segment.

typedef struct __RECORD

{

uint32 block;

uint32 exec_fcn;

}

RECORD;

block contains a pointer to a block in the shared_memory_segment; the pointer

is relative to the extended_memory.

exec_fcn contains the ID of the exec_fcn associated with the entry; we did not use a

function pointer here, but rather an ID because this field is directly accessible by the

instructions of the BPF machine, being stored in the extended_memory, so the BPF

instructions can maliciously modify it. Using an ID prevents hang-ups of the processor,

since every ID is checked by the TME co-processor.

NOTE: each of the fields is represented in network-byte-order; two functions52 are

provided to obtain and set the values of these fields in the usual host-byte-order; they

are SW_ULONG_AT53 to obtain the value, SW_ULONG_ASSIGN to set the value; they

are defined in <memory_t.h>.

7.2.4 MEM_TYPE

This general structure is the one used to contain the extended_memory.

typedef struct __MEM_TYPE

{

uint8 *buffer;

uint32 size;

} MEM_TYPE;

buffer contains a pointer to a physical location in memory, size contains the size, in bytes, of this buffer.

52 At the moment, they are inline functions. 53 SW stands for Swapped, ULONG means unsigned int.

7.3 New BPF instructions

91


The most important BPF instructions, that were added to deal with the TME, call

some back-end functions, defined in source file <tme.c>. The most important

functions correspond to the following BPF instructions:

• INIT

• VALIDATE

• LOOKUP

• EXECUTE.

In the following paragraphs we will describe what sits behind these instructions.

7.3.1 INIT

As stated in the previous chapter, this instruction initializes a TME data block with

default values. These default values are defined in <tme.h>, through some

#defines; their name is XXX_DEFAULT, where XXX is the name of a TME data

block register.

The back-end function called is init_tme_block:

uint32 init_tme_block(TME_CORE *tme, uint32 block)

{

…

//check if the block is usable, otherwise return ERROR

…

//the TME data block is zeroed

…

//the default values are assigned

…

//return SUCCESS

}

tme is a pointer to a TME_CORE structure containing all the data relative to the TME

co-processor; block contains the index of the to-be-initialized TME data block.

This function returns ERROR only if block contains an invalid value, otherwise it

returns SUCCESS.

7.3.2 VALIDATE

This instruction validates a TME data block; the corresponding back-end function is

validate_tme_block, defined in <tme.c>.


92

uint32 validate_tme_block(MEM_TYPE *mem_ex, TME_CORE

*tme, uint32 block, uint32 mem_ex_offset)

{

…

//check if the block is usable, otherwise return ERROR

…

//check if the registers in the to-be-validated TME

data block are valid, otherwise return ERROR

…

//calculate required memory; if it is not available,

return ERROR

…

//clean memory used by the TME data block

…

//initialize XXX_segment registers

…

//set the block as validated and as ACTIVE

…

//return SUCCESS

}

mem_ex is a pointer to the structure containing the extended_memory, tme

contains the data of the TME co-processor, block is an index to the to-be-validated

TME data block, mem_ex_offset is the base address (in the extended_memory)

from which the data relative to the TME data block will be stored.

7.3.3 LOOKUP

This instruction corresponds to function lookup_frontend. This function limits

itself to call the lut_fcn defined in the currently ACTIVE TME data block.

uint32 lookup_frontend(MEM_TYPE *mem_ex, TME_CORE

*tme,uint32 mem_ex_offset, struct time_conv *time_ref)

{

…

//call lut_fcn of the currently ACTIVE TME data block

}


contains the data of the TME co-processor, mem_ex_offset is the base address (in the

extended_memory) from which the key is stored, time_ref is an opaque structure

containing the data used to generate timestamps.

7.3.4 EXECUTE

This instruction corresponds to function execute_frontend. In practice, it calls

the exec_fcn associated to the entry pointed by register LAST_FOUND.

7.4 LOOKUP callbacks

93

uint32 execute_frontend(MEM_TYPE *mem_ex, TME_CORE *tme,

uint32 pkt_size, uint32 offset)

{

…

//if LAST_FOUND contains 0xFFFFFFFF (not found),

//execute the exec_fcn whose ID is OUT_LUT_EXEC

//and return

…

//check if entry at address LAST_FOUND contains valid

//data, otherwise return ERROR

…

//execute the exec_fcn associated with entry at address

//LAST_FOUND

}


contains the data of the TME co-processor, pkt_size contains the size of the packet,

offset is the base address (in the extended_memory) of some data that must be

passed to the EXECUTE callbacks by the BPF machine.


As described in the previous charter, LOOKUP instruction is a front-end to a

callback function, generally known as lookup_fcn. The prototype of this class of

function is the following:

typedef uint32 (*lut_fcn)(uint8 *key, struct __TME_DATA

*data,MEM_TYPE *mem_ex, struct time_conv *time_ref );

The parameters have the following meaning:

• uint8 *key: it is a buffer containing the key to enter the lookup table. This is the

key generated by the BPF machine. This buffer resides in the extended_memory,

so any modification alters the original key.

• TME_DATA *data: it is a pointer to the TME data block the function will work

on.

• MEM_TYPE *mem_ex: it is a pointer to the MEM_TYPE structure containing the

extended_memory.

• struct time_conv *time_ref: this structure contains some values used to

retrieve the timestamp of the packet. It must be considered opaque by the function; a

function is provided to obtain the timestamp in the usual format used by WinPcap,

that is GET_TIME.


94

The return value is a status one, three values can be used:

• TME_TRUE: it means that the search was successful; the LOOKUP instruction will

jump to jt instruction;

• TME_FALSE: it means that the search was unsuccessful; the LOOKUP instruction

will jump to jf instruction;

• TME_ERROR: an error occurred; the program flow of the BPF virtual processor ends

immediately, by executing an implicit ‘RET 0’ instruction.

In the next paragraph we will describe two predefined lookup_fcns, to give an example

of how such functions have to be written.

7.4.1 Two examples: default_lut_w_insert and default_lut_wo_insert

default_lut_w_insert is a function that manages the lookup-table as an open

addressed hash and it inserts a new entry in the table, if the searched key is not found.

uint32 normal_lut_w_insert(uint8 *key, TME_DATA *data, MEM_TYPE

*mem_ex, struct time_conv *time_ref)

{

uint32 i;

uint32 tocs=0;

uint32 *key32=(uint32*) key;

uint32 shrinked_key=0;

uint32 index;

/*records is the array containing the hash table */

RECORD *records=(RECORD*)data->lut_base_address;

uint8 *offset;

uint32 key_len=data->key_len;

/*the key is shrinked into a 32-bit value, prior to applying*/

/*the hashing function */

for (i=0; i<key_len;i++)

shrinked_key^=key32[i];

/*hashing function*/

/*the first index in the table is calculated*/

index=shrinked_key % data->lut_entries;

/*search loop, tocs is the number of iterations*/

/*they must be less or equal the number of filled entries*/

/*if it is greater, the searched key is certainly not present*/

while (tocs<=data->filled_entries)

{

/* if records[index].block=0, then the entry is free*/

if (records[index].block==0)

{

/*we can create a new entry only if there free blocks*/

/*in the shared_memory_segment; this is because each */

/*entry is associated with a block in that segment */

if (data->filled_blocks==data->shared_memory_blocks)

{

/*there are no free blocks, we set LAST_FOUND as NULL*/


95

/*that is the analogous of 0xFFFFFFFF in the physical*/

/*address space, we update the timestamp of the first*/

/*entry, that corresponds to an unsuccessful search */

/*and we return TME_FALSE */

GET_TIME((struct timeval *)(data->

shared_memory_base_address + 4*key_len),time_ref);

data->last_found=NULL;

return TME_FALSE;

}

/*offset contains the absolute pointer (i.e. in physical*/

/*memory) to the block associated with the newly created*/

/*entry*/

offset=data->shared_memory_base_address+

data->block_size*data->filled_blocks;

/*copy the key in the block*/

/*COPY_MEMORY is a macro, that permits the function to */

/*work either at kernel level and user level */

COPY_MEMORY(offset,key32,key_len*4);

/*put a timestamp in the block; it is immediately after */

/*the key */

GET_TIME((struct timeval *)(offset+4*key_len),time_ref);

/*updates the newly created entry with the address, in */

/*the extended memory, of the new block, written in */

/*network byte order */

/*SW_ULONG_ASSIGN is a macro to perform this assignment */

SW_ULONG_ASSIGN(&records[index].block,offset-mem_ex->

buffer);

/*assign the default exec_fcn ID to the entry, in network*/

/*byte order*/

SW_ULONG_ASSIGN(&records[index].exec_fcn,data->

default_exec);

/*increment the number of free entries*/

data->filled_entries++;

/*increment the number of filled blocks*/

data->filled_blocks++;

/*set register LAST_FOUND with the pointer to the newly */

/*created entry */

data->last_found=(PUCHAR)&records[index];

/*return TRUE */

return TME_TRUE;

}

/*the entry is not free, let’s check if it is the right one*/

offset=mem_ex->buffer+SW_ULONG_AT(&records[index].block,0);

for (i=0; (i<key_len) && (key32[i]==ULONG_AT(offset,i*4));i++);

if (i==key_len)

{

/* the key in the block matches the searched one, */

/* right entry */

/* we update the timestamp and LAST_FOUND register*/

GET_TIME((struct timeval *)(offset+4*key_len),time_ref);

data->last_found=(PUCHAR)&records[index];


96

/*return successful search*/

return TME_TRUE;

}

else

{

/*the entry is wrong, let’s apply rehashing */

index=(index+data->rehashing_value) % data->lut_entries;

/*and increment the number of iterations */

tocs++;

}

}

/* if we reach this point, the search loop was unsuccessful */

/* so we update the timestamp of the #1 block */

GET_TIME((struct timeval *)(data->

shared_memory_base_address+4*key_len),time_ref);

/* and set the LAST_FOUND register to NULL, that is the */

/* analogous of 0xFFFFFFFF in the physical address space*/

data->last_found=NULL;

/* and finally return unsuccessful search */

return TME_FALSE;

}

normal_lut_wo_insert is almost the same, the only difference is that when the

hashing function hits a free entry, no insertion is performed, but a TME_FALSE value is

returned. In practice, the code marked with a vertical dashed line is substituted with a

‘return TME_FALSE;’ line of code.

7.4.2 Some tips for writing new LOOKUP callbacks

It is not so difficult to write new lookup_fcns, to implement new search techniques.

However, it is important to follow some guidelines:

• A function can read all the registers of the TME data block, but it can modify only

some of them, that are filled_entries, filled_blocks and last_found;

in particular, registers containing the base addresses of the segments,

XXX_base_address, must be never touched. When a new entry is filled in the

lut_segment, filled_entries must be incremented by one; when a new

block is used in the shared_memory_segment, filled_blocks must be

incremented by one. As described in chapter 6 - The Table Management Engine

(TME), register last_found must be used to store the physical address of the

found entry.

• In order to exchange the byte order, from little-endian to big-endian and viceversa,

some functions are provided, they are defined in <memory_t.h>. These functions

are written in Assembler, to guarantee good performances.

• In order to put or update the timestamp of an entry, a function is provided,

GET_TIME, defined in <time_calls.h>; this function receives a void pointer to


97

the location in which the timestamp must be saved, and a pointer to a time_ref

structure; this last structure is an input parameter of all lookup_fcns.

• When a new entry cannot be inserted since the table is full, the function must return

TME_FALSE; it must not return TME_ERROR.

• It is important to remember that register key_len contains the length of the key in

terms of 32-bit words.

After a new lookup_fcn is created, it must inserted in the framework in this way:

1. The new function must be inserted in file <functions.c>.

2. We must assign a unique ID to this function; this is done through a #define in file

<functions.h>; we must include the prototype of the function in this file, too.

For example, if the function is called my_fcn, and its ID is 0x00000055, we

must modify the above mentioned file in this way

...

/* lookup functions */

#define NORMAL_LUT_W_INSERT 0x00000000

uint32 normal_lut_w_insert(uint8 *key, TME_DATA *data, MEM_TYPE

*mem_ex, struct time_conv *time_ref);

#define NORMAL_LUT_WO_INSERT 0x00000001

uint32 normal_lut_wo_insert(uint8 *key, TME_DATA *data, MEM_TYPE

*mem_ex, struct time_conv *time_ref);

#define MY_FCN 0x00000055

uint32 my_fcn(uint8 *key, TME_DATA *data, MEM_TYPE *mem_ex,

struct time_conv *time_ref);

...

3. Then we have to add a new case in the ‘lookup function mapper’; this function

returns the pointer of the lookup_fcn corresponding to the given ID. It is used by the

TME core. This mapper is in file <functions.c>. Let’s suppose we have to add

the above mentioned function, we have to modify the mapper in this way

...

lut_fcn lut_fcn_mapper(uint32 index)

{

switch (index)

{

case NORMAL_LUT_W_INSERT:

return (lut_fcn) normal_lut_w_insert;

case NORMAL_LUT_WO_INSERT:

return (lut_fcn) normal_lut_wo_insert;

case MY_FCN:

return (lut_fcn) my_fcn;

default:

return NULL;

}

7.5 EXECUTE callbacks

98

}

...


As explained, the EXECUTE instruction is a front-end to an exec_fcn callback. Here

is the prototype of this class of callbacks:

typedef uint32 (*exec_fcn)(uint8 *block, uint32 pkt_size,

struct __TME_DATA *data, MEM_TYPE *mem_ex, uint8

*mem_data);

The parameters have the following meaning:

• uint8 *block: it is a pointer to the block, in the shared_memory_segment,

associated with the entry at address LAST_FOUND. It points to a physical memory

location.

• uint32 pkt_size: it contains the size of the packet, as received by the NIC

interface.

• TME_DATA *data: it is a pointer to the TME data block the function will work

on.

• MEM_TYPE *mem_ex: it is a pointer to the MEM_TYPE structure that contains the

extended_memory.

• uint8 *mem_data: it points to a physical memory location (in the

extended_memory), that is passed directly from the EXECUTE front-end.

The returned value is a status one, two possible values are allowed:

• TME_SUCCESS: the function completed successfully;

• TME_ERROR: an error occurred; in this case the BPF machine will immediately

terminate the program flow by executing an implicit ‘RET 0’ instruction.

7.5.1 An example: count_packets

This is the simplest example of an exec_fcn; in practice, it stores two counters

in the block, incrementing them for each packet associated with that block.

typedef struct __c_p_data

{

struct timeval timestamp;

uint64 packets;

uint64 bytes;

}

c_p_data;

uint32 count_packets(uint8 *block, uint32 pkt_size, TME_DATA *data,

MEM_TYPE *mem_ex, uint8 *mem_data)

{


99

/*counters is the structure containing the counters*/

/*and the timestamp; since at the beginning of the */

/*block there is the key, this structure begins */

/*immediately after the key */

c_p_data *counters=(c_p_data*)(block+data->key_len*4);

/*now we increment the counters */

counters->bytes+=pkt_size;

counters->packets++;

/*and return success*/

return TME_SUCCESS;

}

7.5.2 Some tips for writing new EXECUTE callbacks

It is simple to write new EXECUTE callbacks, bearing in mind some details:

• All the registers of the TME data block must be considered read-only, generally; in

particular, as the LOOKUP callbacks, it is important not to modify the

XXX_base_address registers.

• The first part of each block contains the key, that identifies each block uniquely, and

the timestamp; as a consequence, this part must be never modified.

• It is important to remember that register key_len contains the length of the key in

terms of 32-bit words.

After a new exec_fcn is created, it must inserted in the framework in this way:

1. The new function must be inserted in file <functions.c>.

2. We must assign a unique ID to this function; this is done through a #define in file

<functions.h>; we must include the prototype of the function in this file, too.

Some exec_fcns may use a particular data structure to represent the bytes of each

block; it is convenient to define this structure into a separate header file, and insert

an #include near the ID definition. For example, if the function is called

my_exec_fcn, its ID is 0x00000024, and a data structure is defined in file

<my_exec_fcn.h>, we must modify the above mentioned file in this way

...

/* execution functions */

#define COUNT_PACKETS 0x00000000

#include "count-packets.h"

uint32 count_packets(uint8 *block, uint32 pkt_size, TME_DATA *data,

MEM_TYPE *mem_ex, uint8 *mem_data);

#define TCP_SESSION 0x00000800

#include "tcp-session.h"

uint32 tcp_session(uint8 *block, uint32 pkt_size, TME_DATA *data,

MEM_TYPE *mem_ex,uint8 *mem_data);

7.6 Interaction with the user

100

#define MY_EXEC_FCN 0x00000024

#include "my_exec_fcn.h"

uint32 my_exec_fcn(uint8 *block, uint32 pkt_size, TME_DATA *data,

MEM_TYPE *mem_ex,uint8 *mem_data);

...

3. Then we have to add a new case in the ‘execute function mapper’; this function is

analogous to the lookup function mapper, and it is defined in file

<functions.c>. Let’s suppose we have to add the above mentioned function,

we have to modify the mapper in this way

...

exec_fcn exec_fcn_mapper(uint32 index)

{

switch (index)

{

case COUNT_PACKETS:

return (exec_fcn) count_packets;

case TCP_SESSION:

return (exec_fcn) tcp_session;

case MY_EXEC_FCN:

return (exec_fcn) my_exec_fcn;

default:

return NULL;

}

}

...


7.6.1 Passing data to the user

An application running on top of the TME architecture needs to the receive the data

contained in the shared_memory_segment of one the validated TME data blocks,

in particular the filled blocks of this segment.

Two approaches can be followed to achieve this result:

• it is possible to use a shared buffer, that is to say we can share the buffer containing

the shared_memory_segment, which is at kernel level, with the user level

application. In practice, the application receives a pointer to the same buffer the TME

uses to store this segment;

• it is possible to copy the shared_memory_segment into a user buffer, provided

by the application.

Each of the solutions has its own pros and cons.

The former is a 0-copy system, and this is generally an excellent solution, since

copies are usually time-consuming; however, it has a big drawback: since the


101

application receives the same buffer on which the TME works, incoherences on the data

may happen; let’s suppose the application is reading some data and a new packet

appears on the network, it happens that the application is put in a sleep state, and the

BPF virtual processor processes the packet, thus modifying the data of the TME; when

the application is awakened, the data it has read previously can be meaningless.

There is also another problem, that is more infrequent, however: if the application is

able to modify the data contained in this shared buffer, the TME behavior, and possibly

the one of the entire operating system, can become unpredictable.

The latter solution does not have the above mentioned problem because the data of

the TME are copied into a user buffer; nevertheless, whenever the application needs to

receive some information, it is necessary to copy a certain of memory into the

application buffer, and this amount can be important; for example, a TME data block

initialized with default values may need to transfer up to 1 Megabyte54 to the

application.

This last solution is the one adopted by the implementation of the TME.

The implementation is not completely free from incoherence problems: the main issue

is that the process of copying the shared_memory_segment into the user buffer is

performed at a lower priority55 than the BPF virtual processor, that is to say the

receiving of a packet has the precedence on the copy process. As a consequence, the

copy process can be interrupted by the BPF processor, and incoherent data can be

present in the user buffer.

An hack was used to partially prevent this problem: the process of copy and the virtual

processor are locked, with one of the so called spinlocks56, in a particular way; in fact,

the process of copy is performed into blocks (the ones that form the

shared_memory_segment), before and after the copy of each block the lock is

acquired and released. This solution guarantees that the data stored in each block are

coherent, although coherence is not guaranteed between blocks.

The only solution to gain coherence between blocks is to acquire the lock before the

whole copy, and release it when the copy has ended; this solution has the big drawback

that, during the copy, the virtual processor is stopped, so the packets that arrive in this

interval are lost; in practice this solution violates completely the priorities of the various

threads (i.e. the read process gains a higher priority than the BPF virtual processor).

Here is a snippet of the code used in the copy process:

for (cnt=0;cnt<blocks;cnt++)

54 We need to transfer the shared_memory_segment, which is composed of 16000 blocks of 64 bytes each. 55 The copy runs at PASSIVE_LEVEL, in the application context, while the BPF virtual processor works at DISPATCH_LEVEL, in the system context. 56 A spinlock is a simple synchronization mechanism present in the Windows Operating Systems.


102

{

//we acquire the spinlock

NdisAcquireSpinLock(&Open->machine_lock);

//we copy the block

RtlCopyMemory57(UserPointer,tmp,block_size);

//we release the spinlock

NdisReleaseSpinLock(&Open->machine_lock);

tmp+=block_size;

UserPointer+=block_size;

}

To maintain compatibility with the existing read functions of the LIBPCAP library

(pcap_loop and pcap_dispatch), the user buffer contains also an header BPF, which

precedes the buffer copied from the shared_memory_segment of the TME.

7.6.2 Filter injection

The implementation of the TME was realized in such a way to maintain

compatibility with the original APIs used to inject the filter in the BPF virtual processor,

in practice with pcap_setfilter.

The problem is that it is necessary to pass not only the filter code to the driver, but the

initialization code, too; the chosen approach is to link the two parts, filter and

initialization, into a ‘hybrid’ filter, in this way: this hybrid filter is formed by the

concatenation of the two codes, separated by a particular instruction,

BPF_SEPARATION, whose opcode is 0xFF.

Here is a graphical explanation of this mechanism:

filter code

BPF_SEPARATION

initialization code

If the hybrid filter does not contain the BPF_SEPARATION instruction, it is considered

that the initialization phase is not needed; this solution guarantees compatibility with the

57 This is the kernel level counter-part of the usual memcpy API.

7.7 Implementation issues

103

original LIBPCAP compiler, used to generate the filter code from human readable

strings (e.g. ‘tcp port 80’). At the lower level, in the driver, the function that

handles the IOCTL used to inject the filter has been modified to separate the two parts;

this function performs the initialization phase, too; as a consequence, if this phase fails,

the IOCTL returns an unsuccessful result, and consequently pcap_setfilter API fails.


Several issues arose during the development of the extension; the majority of them

was due to the specific operating system on which the module was developed,

Microsoft Windows.

The main problem regards timestamps: as stated before, each block in the

shared_memory_segment has its own one, represented through a timeval58 structure;

the precision of this timestamp is the microsecond.

Microsoft developing tools offer a variety of APIs to obtain timestamps, the problem is

that the only one that guarantees the microsecond precision is

KeQueryPerformanceCounter; this function gives two 64-bit integers, representing the

number of ticks elapsed from boot time, and the number of ticks per second.

This API has a serious problem: it is very slow to return; some informal tests proved

that this function usually takes up to 3 µs to return; as a consequence, if we have to

generate 100.000 timestamps per second, the CPU is occupied up to 10-20% execute

this call. A further detail makes the things worse: in order to represent timestamps as

timeval structures, at least two 64-bit integer divisions are involved, and these divisions

are heavily CPU consuming59.

A partial solution to this problem is to use a particular high-performance counter that is

shipped with all 32-bit INTEL processors, the TimeStamp Counter (TSC). This counter

is incremented at every processor clock cycle, so the precision is the CPU frequency.

The IA3260 x86 assembler provides an instruction to obtain this timestamp, RDTSC,

which is very fast (1 clock cycle).

However the use of this counter, to generate timestamps, has some problems:

• it is necessary to know the exact processor frequency to convert this counter into a

timeval structure; unfortunately, the Microsoft APIs do not provide a mean to

obtain this frequency, at least in the documented ones;

• two 64-bit integer divisions are necessary to convert this timestamp to a timeval

structure; we need to perform this conversion to grant a coherence with the

58 This structure is formed by two fields, ts_sec and ts_usec, that represent the time elapsed since 1/1/1970, expressed in seconds and microseconds. 59 At least on 32-bit processors. 60 Intel Architecture 32 bits.


104

timestamps used by the LIBPCAP library, since these timestamps are passed

without modification to the user;

• this technique of generating timestamps is highly dependant on the particular

processor, since it works only on INTEL CPUs (or compatible ones, as AMD

ATHLON processors);

• the newer INTEL processors for laptops, the Mobile Pentium series, implement a

new technology, SpeedStep, by which the processor lowers its frequency when the

laptop uses the battery. This behavior can produce wrong timestamps if the

frequency is lowered when an instance of the driver is running. The solution to this

problem is to calculate the frequency whenever a driver instance is opened,

remembering that timestamps are wrong if the processor frequency changes when an

instance is open, that is to say, you must close all the open driver instances before

changing the CPU clock.

The current implementation calculates the processor frequency in a ‘dirty’ manner: we

sample the TSC and the timestamp given by the KeQueryPerformanceCounter at the

same time, after having raised the IRQL to the highest possible one, then we delay some

time, and repeat the sampling; we obtain the processor frequency through some 64-bit

integer divisions. The precision depends on the time elapsed between the two samples,

with a delay time of about 300 ms, the frequency is exact to the kiloHertz.

This technique is somewhat dirty since we have to raise the IRQL to the highest

possible, HIGH_IRQL, that is a strongly discouraged operation.

Because of the strong dependence on the processor, it is possible to compile two

versions of the driver containing the TME extension: a more performant one, that uses

the RDTSC to generate timestamps, and a less performing one, that uses

KeQueryPerformanceCounter API.

Another problem is related to way NDIS passes packets to the protocol drivers61: in

practice, the function that is called whenever a packet is received by the NIC card does

not always receive the entire packet, there are some cases in which it receives only a

fragment of it (for example 256 bytes); NDIS provides an API to transfer the remaining

part of the packet to a buffer, the problem is that this API is managed asynchronously

by the OS62. As a consequence, there are some cases in which it is difficult to analyze

the whole packet. At the moment, the problem is still open.

The last issue is related to the user buffer: wpcap.dll (the Windows version of

LIBPCAP) allocates a user buffer of 256 kilobytes, and this size is not modifiable, since

it is defined through a #define instruction. This buffer is large enough for packet

61 As described in charter Wincpap, the kernel part of the architecture is realized as a network protocol driver, and NDIS is the Windows framework on which network components sit. 62 That is to say, the function returns immediately, even if the transfer has not finished; a particular callback function, that protocol drivers must define, is called when the transfer ends.


105

capture, but it is too small to contain the data passed from the kernel driver, in practice

the shared_memory_segment of a TME data block: for example, initializing a

TME data block with the default values gives a segment of about 1 Megabyte.

This issue was solved by adding a new LIBPCAP API to resize the user buffer,

pcap_setuserbuffer.

8 Interacting with the TME

In the previous chapters the BPF filters and the initialization code that used the TME

extension were generated by hand; this approach is generally not good, since the

process of creating new filters is very difficult and time consuming. A better approach is

to have a system that generated the BPF filters in a simple way, from a high level

description of want the BPF filter has to do; in practice we need a compiler, that is able

to generate these low level filters starting from a human readable string describing what

we want to do.

The WinPcap library (wpcap.dll) is shipped with a compiler that is able to generate

the filter code, for capturing packets, from a high level description of the wanted

packets, for example from the string “ip”; in order to support the TME extension we

need to add some new syntactic statements to the high level language describing the

filters, and modify the compiler to support these statements and generate the appropriate

BPF filter code (and the initialization code, too).

The TME architecture is quite general, therefore in the next paragraphs we will

propose some modifications to the original WinPcap library to perform complex

statistics, only.

8.1 The SWITCH statement

The main types of monitoring analyses that were highlighted in chapter 3 -

Monitoring networks are the following:

• open matrices

• close matrices

• multiple analyses

• bucket analyses.

All these types can be declared at high level with a single new statement, SWITCH.

Open matrices: The SWITCH statement is used with this sintax

SWITCH(header_field1, header_field2,...)

header_fieldi specifies in a distinct way a field in the protocol headers which will

be used to generate the key to enter the table. For example, if we want to create an open

table in which the key is the source IP address of the packets, the syntax is

SWITCH(ip src host)

8.1 The SWITCH statement

107

If we want to create a multidimensional matrix, the various headers fields are separated

by a semi-colon (“;”). For example, if we want to create a 2-dimensional matrix

formed by the source and destination IP addresses, the syntax is

SWITCH(ip src host ; ip dst host)

Close matrices: The SWITCH syntax is the following

SWITCH(header_field=value1,value2,...)

header_field specifies a distinct field in the protocol headers, valuei are the

values of the header_field that must be monitored. For example, if we want to

create a close matrix on the source IP addresses 1.2.3.4, 4.5.6.7 and 3.8.9.2, the syntax

is

SWITCH(ip src host=1.2.3.4, 4.5.6.7, 3.8.9.2)

If we want to create multidimensional matrices, the syntax is similar to that one used for

open matrices: the various header fields are separated by a semi-colon. This is a small

example of a 2-dimensional close matrix

SWITCH(tcp src port= 1, 2, 3, 4;tcp dst port= 1, 2, 3, 4)

Multiple analyses: The SWITCH syntax is the following

SWITCH(condition1,condition2,...)

conditioni are the conditions that the packet must satisfy in order to be ‘monitored’

by the TME. A condition can be a protocol (e.g. “ip”) or a generic condition an

header field (e.g. “tcp src port=80”). For example, if we want to monitor the IP

traffic, the TCP one and the Web one, the syntax is

SWITCH(ip,tcp, tcp src port=80 or tcp dst port=80)

Bucket analyses: the SWITCH syntax is the following

SWITCH(header_field=interval1,interval2,...)

8.2 Interpreting data received by the TME coprocessor

108

header_field specifies a distinct field in the protocol headers, intervali

specifies an interval of the possible values assumed by the header_field, in the

form start_value-stop_value. The intervals can be not contiguous (e.g. 1-50,

55-100), but they must be not overlapping, otherwise the compiler does not compile the

statement. Here is a small example that deals with the packet length

SWITCH(pkt_len=1-128,129-256,257-512,513-1514)

Notes:

• it is not possible to combine different SWITCH statements relative to a different

monitoring analysis. For example, the following syntax is not acceptable, since it

combines a close matrix with an open one.

SWITCH(ip dst host= 2.2.3.4, 2.5.6.7; ip src host)

The following syntax is acceptable, because it combines two open SWITCHes to

create a 2-dimensional open matrix.

SWITCH(ip dst host;ip src host)

• The SWITCH statement can be preceded by a series of pre-filters that restrict the

traffic monitored by the TME; this pre-filtering uses the syntax of the original BPF

architecture. This is a small example

ip src host 130.192.1.2 SWITCH(ip dst host)

8.2 Interpreting data received by the TME coprocessor

In the previous chapters data received by the TME coprocessor, through the

dispatcher_handler function, were interpreted and printed on the screen by hand, that is

to say these functions knew the format of the received data, and they were able to

understand them correctly. The problem is that this approach requires a different

dispatcher_handler for each different BPF filter that can be injected in the BPF virtual

processor, and this impossible, and stupid, to realize.

Moreover, if a compiler is used to generate the BPF filters, it is impossible to know how

to interpret the data received prior to compiling the BPF code itself. As a consequence,

it is necessary to develop an API that is able to understand the data received, in such a

8.3 The pcap_keyparser() function

109

way to be independent from the particular filter that is compiled. In practice, the most

important feature of this new API is the ability to translate the keys, that identify each

block transferred from the TME to the user, into a human readable format. For example,

if the BPF filter creates a table in which each entry corresponds to an IP address, the

key is an IP address (e.g. 0x01020304); this new API must be able to translate this

raw sequence of bytes in a string like “IP 1.2.3.4”

The proposed architecture to deal with this problem is shown in Figure 34.

EXTENDEDCOMPILER

EXTENDEDCOMPILER

SWITCH(tcp src port=21,23,25,80)SWITCH(tcp src port=21,23,25,80)

struct

bpf_keyparser

To parse the key

struct

bpf_keyparser

To parse the keyBPF filterBPF filter

TME data

pcap_keyparser()pcap_keyparser()

BPF core

registers

extended_memory

TME core

packet

registers

basic

mem

ory

ctrl

TME co-processor

BPF + TME Figure 34 - The proposed architecture to interpret the TME data

Since the process of interpretation (parsing) highly depends on the BPF filter generated

by the compiler, the compiler has to create a particular data structure (struct

bpf_keyparser) too, together with the BPF code. This data structure, that must be

considered opaque to the user, will be used by the new API, pcap_keyparser(), to

parse the data received from the TME. This function has to be called inside the user-

defined dispatcher_handler to understand the received data.

8.3 The pcap_keyparser() function

The syntax of this API is the following

char *pcap_keyparser(char *buffer,

struct bpf_keyparser *keyparser,

char parsed_key[1024],

int *offset)

The meaning of the parameters is the following:

8.4 The key parsing process

110

• char *buffer: it is the buffer received by the dispatcher_handler, that contains

the data transferred by the TME coprocessor.

• struct bpf_keyparser *keyparser: it is a pointer to the data structure

created by the compiler to parse the data received by the TME.

• char parsed_key[1024]: the function will store the parsed version (e.g. “IP

1.2.34.5”) in this buffer; this buffer must be allocated by the application using

pcap_keyparser().

• int *offset: the function will store the offset from where the data coupled with

that key start (in the case of monitoring, these data are the byte and packet counters).

The value returned by the function is the pointer of the next block present in the buffer

received by the dispatcher_handler; a NULL pointer is returned if the parsed block is

the last one.

The scheme of a dispatcher_handler function using pcap_keyparser() is the

following:

void dispatcher_handler(u_char *user, const struct pcap_pkthdr *h,

const u_char *sp)

{

char parsed_key[1024];

char *tmp=sp;

int offset;

//the structure generated by the compiler is passed to the

//dispatcher handler through the ‘user’ parameter

struct bpf_keyparser *keyparser=(struct bpf_keyparser*)user;

while(tmp!=NULL)

{

char *tmp2=pcap_keyparser(tmp,keyparser,parsed_key,&offset);

... //code to display the data

tmp=tmp2;

}

}


The above described monitoring systems produce tables in which the keys have a

meaning that depend on the particular performed analysis. In practice we have the

following cases:

Open and close matrices: they use a key that represents a value of a precise field in the

protocol headers. For example the syntax “SWITCH(ip src host)” generates a

BPF filter whose key are IP addresses. In this case the pcap_keyparser() routine must

print the key in a more readable manner by formatting it, in a way similar to the printf()

C function. In practice, the new API must perform a translation like this one


111

key= “0x01020304” � string= “IP 1.2.3.4”

This parsing can be made with something like

printf(“IP %u.%u.%u.%u”,key[0],key[1],key[2],key[3]);

The data structure must therefore contain the formatting string, in the example the string

“IP %u.%u.%u.%u”. A graphical explanation is shown in Figure 35.

struct bpf_keyparserstruct bpf_keyparser

pcap_keyparserpcap_keyparser

key“Ox01020124”

key“Ox01020124”

format the key in this way

“ip source %u.%u.%u.%u”,

key[0],key[1],key[2],key[3]

Parsed key“ip source 1.2.1.36”

Parsed key“ip source 1.2.1.36”

Figure 35 - Key parsing for open and close matrices

Multiple analyses: in this case the keys do not represent a value of a particular field in

the protocol headers, rather they are a description of what is present in the block

associated with the key (e.g. a four characters string, as described in paragraph 3.3.2),

or an identification number, chosen by the compiler itself. As a consequence, it is

impossible to use the formatting string described previously. What we need is a small

table, in which it is saved the correspondence between the possible values of the keys,

and the strings that must be returned by the pcap_keyparser function. This approach is

shown in Figure 36.

Bucket analyses: In this case the key represents the limits of a bucket interval, but

these numbers are values of a particular header field, for example a TCP port. For this

reason we can use the method of the formatting string to translate the keys from raw

bytes to a human readable string.


112

struct bpf_keyparserstruct bpf_keyparser

pcap_keyparserpcap_keyparser

Key“Ox00000004”

Key“Ox00000004”

Parsed key“NetBios”

Parsed key“NetBios”

key parsed key

0x00000001 “IP”

0x00000002 “TCP”

0x00000003 “WEB”

0x00000004 “NetBios”

Figure 36 - Key parsing for multiple analyses

In order to integrate the two different method of key parsing into a single data

structure, the bpf_keyparser structure must contain a type field that indicates in

which way the key must be parsed. Moreover, this structure, generated by the compiler,

must be able to store both the structures used for the parsing, that is to say the

formatting string and the table of the used keys. A graphical representation is shown in

Figure 37.

A possible model of this data structure is the following

struct bpf_keyparser

{

int type;

char *formatting_string;

struct key_table *table;

}

in which the structure key_table contains the table of the correspondences

key�string.

When the structure ‘uses’ the key formatting technique, type contains the FORMAT

value, formatting_string contains the formatting string, and table is a NULL

pointer.

When the structure ‘uses’ the table technique to translate the keys, type contains the

TABLE value, formatting_string contains a NULL pointer, and table is a

pointer to the structure containing the table of the keys.


113


{

type=FORMAT/TABLE

...


{

type=FORMAT/TABLE

...pcap_keyparserpcap_keyparser

Key Key

Parsed keyParsed key

format the key in this way

“ip source %u.%u.%u.%u”,

key[0],key[1],key[2],key[3]

key parsed key

0x00000001 “IP”

0x00000002 “TCP”

0x00000003 “WEB”

0x00000004 “NetBios”

Figure 37 - The bpf_keyparser structure

9 A case study

At the moment of writing this thesis, only two LOOKUP callbacks are defined,

normal_lut_w_insert and normal_lut_wo_insert, described in the

previous chapter. Moreover, only two EXECUTE callbacks have been written, that are

count_packets, described in the previous chapter, and tcp_session, a function

that recognizes TCP flows, dealing with serial numbers and acknoledgement numbers,

giving the amount of bytes transferred in that TCP session. As a consequence, at the

moment the architecture is able to realize only monitoring systems. In the following

paragraphs we will propose some monitoring problems, and present the relative

solutions using the TME extension. Finally, we will show the performances of the TME

extension through one of these examples, particularly highlighting the role of the main

components of the architecture to the whole.

9.1 A system to monitor the traffic sent by each host in a network

We are interested to know the traffic sent by each host in a LAN; we use the MAC

source address to identify each host. We need to create an open matrix, since we do not

know the MAC address of each host, in which each cell is labeled with the MAC source

address, and the action for each cell is to ‘count packets’.

We do not know the exact number of hosts on the LAN; just to make the things easier,

we will use the default values to create the lookup table; these values create a lookup

table of 32.007 entries, and the number of usable blocks63 is 16.000; since we want to

create an open matrix, the LOOKUP callback will be the default one,

normal_lut_w_insert; the EXECUTE callback, associated to each newly created

entry, is count_packets, which is the default one.

Since the length of the key must be expressed in terms of 32-bit words,and the actual

length of the key is 6 bytes, this length is padded to 2 32-bit words.

Therefore, the initialization code has to resize the extended_memory to store the TME

table, which is about 1,3 Megabytes; since we use the default values, we have to issue

instruction INIT, then set the key length to 2; then we have to validate the TME data

block. The filter code will store the key in the first 8 bytes of the extended_memory, so

the TME data will start from byte 8.

63 Remember that each entry, in the lut_segment, is associated to a block, in the shared_memory_segment.

9.1 A system to monitor the traffic sent by each host in a network

115

The last thing to do is to instruct the TME the data passed to the user must be taken

from TME data block #0, this is done through the SET_ACTIVE_READ instruction.

The initialization code is the following:

1. SET_MEMORY k=1300000

2. INIT k=0

3. LD 2

4. SET_REGISTER_VALUE k=TME_KEY_LEN

5. LD 0

6. VALIDATE k=8


8. RET 1

The corresponding filter code has to extract the MAC source address from the packet

and save it into the first six bytes of the extended_memory, where we have decided

to store the key. This is done through a 32-bit and a 16-bit LOAD from the packet, and

through a 32-bit and a 16-bit STORE to the extended_memory.

Then we need to call the LOOKUP, to search the table for that particular MAC address,

and then EXECUTE, to increment the bytes and packets counters of the found entry.

Finally the code returns with a 0 value, through a RET instruction.

This is the filter code:

1. LD P[6]

2. EX_ST MEM_EX[0]

3. LDH P[10]

4. EX_STH MEM_EX[4]


6. EXECUTE k=0

7. RET 0

What we described so far is the program to be injected in the kernel driver, now we

have to deal with the user level counterpart. The following code is a very simple

example of a dispatcher_handler64 function, that limits itself to print out the

MAC address and the counters on the screen.

void dispatcher_mac(u_char *user, const struct pcap_pkthdr *h, const

u_char *sp)

64 As described in chapter 5 - The original BPF architecture and WinPcap, this is the function that LIBPCAP calls when the data are transferred from the driver to user level.

9.2 A system to monitor Web, FTP, SMTP and Telnet traffic

116

{

/*h is the header, sp is the buffer */

/*we calculate the number of transferred blocks by dividing */

/*the buffer size by the block size=64 bytes */

ULONG records=h->caplen/64;

ULONGLONG i;

c_p_data *d;

/*THIS IS THE FIRST BLOCK, OUT OF COUNTS-> PACKETS LOST */

/*BECAUSE THE TABLE IS FULL */

/*this offset=8 is necessary to skip the key, which is */

/*2 32-bit words long*/

d=(c_p_data*)(sp+8);

printf("\n-----------------------------------------\n");

printf("Out of counts\tpackets= %-I64u\tbytes= %-I64u\n ",

d->packets,d->bytes);

/*THIS IS THE LOOP FOR THE “NORMAL” BLOCKS */

for (i=1;i<records;i++)

{

/*curr contains the pointer of the i-th block */

PUCHAR curr=sp+i*64;

d=(c_p_data*)(curr+8);

printf("MAC source = %.2x:%.2x:%.2x:%.2x:%.2x:%.2x\t",

curr[0],

curr[1],

curr[2],

curr[3],

curr[4],

curr[5]);

printf("packets= %-I64u\tbytes= %-I64u\n",d->packets,d->bytes);

}

printf("-----------------------------------------\n");

}


We are interested to obtain statistical data relative to particular TCP services. This is

done through a close matrix, in which each entry corresponds to a TCP port, in

particular each one corresponds to one of the five interested ones (80�Web, 21�FTP,

20�FTP data, 25�SMTP, 23�Telnet). Here we can a table of 11 entries, and 6

blocks; we use a table with more entries than needed because we want to keep it quite

free, since the table is managed as a hash.

The true key length is 2 bytes, but it will be padded to 1 32-bit word.

It is not necessary to resize the memory, since 64kilobytes, the default size, are enough.

In the initialization phase we will initialize a TME data block with the default LOOKUP

callback, normal_lut_w_insert, that is able to insert to entries, then we will use

LOOKUP to force the insertion of the desired 5 entries, without performing a

subsequent EXECUTE, then we will change the LOOKUP callback to

normal_lut_wo_insert, that cannot insert new entries.


117

Here is the initialization code:

1. INIT k=0

2. LD 1


4. LD 11

5. SET_REGISTER_VALUE k=TME_LUT_ENTRIES

6. LD 6

7. SET_REGISTER_VALUE k=TME_SHARED_MEMORY_BLOCKS

8. LD 0

9. VALIDATE k=64

10. SET_ACTIVE_READ k=0

11. LD 20

12. EX_STH MEM_EX[0]


14. LD 21



17. LD 23



20. LD 25



23. LD 80



26. LD NORMAL_LUT_WO_INSERT

27. SET_REGISTER_VALUE k=TME_LOOKUP_CODE

28. RET 1

The filter code is the following:

1. LDH P[12]

2. JE 0x800 jt=3 jf=15

3. LDH P[20]

4. JSET 0x1fff jt=15 jf=5

5. LDB P[23]


118

6. JE 0x6 jt=7 jf=15

7. LDXB 4*(P[14]&0xf)

8. LDH P[X+14]

9. EX_STH 0

10. LOOKUP k=0 jt=14 jf=11

11. LDH P[X+16]



14. EXECUTE k=0

15. RET 0

In practice, we check that the packet contains an IP payload (instructions 1-2), then we

verify that it is not a fragment (instructions 3-4), and finally we test that is it a TCP

packet (instructions 5-6).

Afterwards, we load the IP header length with a special BPF instruction (7); then we

load the TCP source port and we perform a LOOKUP; if it is successful, we jump to

instruction 14, and we increment the counters with an EXECUTE; if the search (through

the LOOKUP) was not successful, we load the TCP destination port, and perform a

LOOKUP, again; at this moment, we always perform an EXECUTE; if even the second

search was not successful, the EXECUTE will increment the counters relative to ‘out-

of-counts’ block (the first block in the shared_memory_segment).

The flow chart of the algorithm is shown in Figure 38.

Is source

port known?

Increment

counters.

Is destination

port known?

Increment

‘out of counts’

counters.

Increment

counters.

YES

NOYES

NO

EXIT

EXITEXIT

Figure 38 – Flow chart of the system to monitor some known TCP ports

The following code is an example of dispatcher_handler function that prints out

the packets for each port, together with the timestamp of the block.


119


const u_char *sp)

{

ULONG i,records=h->caplen/64;

printf(“---------------------------------\n”);


{

PUCHAR block=sp+i*64;

/* we convert the TCP port from host-byte-order*/

/* to network-byte-order*/

USHORT port=block[0]*256+block[1];

/*the counter structure, c_p_data, starts 4 bytes after */

c_p_data *d=(c_p_data*)(block+4);

ULONG hours,mins,secs;

secs=d->timestamp.ts_sec%60;

mins=(d->timestamp.ts_sec/60)%60;

hours=(d->timestamp.ts_sec/3600)%24;

switch(port)

{

case 80: //web

printf(“Web traffic at %.2u:%.2u:%.2u\tPackets=%.I64u\n”,

hours,mins,secs,d->packets);

break;

case 21: //ftp

printf(“FTP traffic at %.2u:%.2u:%.2u\tPackets=%.I64u\n”,


break;

case 20: //ftp data

printf(“FTP data traffic at %.2u:%.2u:%.2u

\tPackets=%.I64u\n”,hours,mins,secs,d->packets);

break;

case 23: //telnet

printf(“Telnet traffic at %.2u:%.2u:%.2u\tPackets=%.I64u\n”,


break;

case 25: //smtp

printf(“SMTP traffic at %.2u:%.2u:%.2u\tPackets=%.I64u\n”,


break;

case 0: //out of counts block

printf(“Other traffic at %.2u:%.2u:%.2u\tPackets=%.I64u\n”,


break;

default:

break;

}

}

}

9.3 A system to monitor IP, TCP and UDP traffic

120


We are interested to know the overall traffic, in terms of bytes, the amount of IP traffic,

and in particular, the percentage of TCP and UDP traffic, relative to IP traffic. This case

is different from the previous ones, since we want to acquire data relative to

heterogeneous aspects of network traffic; the problem can be solved by using a sort of

string as a key to enter the lookup table, as presented in paragraph 3.2.3 - Multiple

analyses.

The used algorithm is shown in Figure 39.

LOOKUP,

key=‘ALL’

EXECUTE

Is IP?

LOOKUP,

key=‘IP’

EXECUTE

LOOKUP,

key=‘TCP’

EXECUTE

Is TCP?

YESNO

EXIT

EXIT

EXIT

YES

Is UDP?

LOOKUP,

key=‘UDP’

EXECUTE

EXIT

NO

NO

YES

Figure 39 – Algorithm used to realize a multiple level monitor

The initialization phase has to create a table of 9 entries (as usual, they are more than

needed, to achieve better performances using the hash search), and 5 blocks (4 blocks

for ‘ALL’, ‘IP’, ‘TCP’, ‘UDP’ plus one for the ‘out-of-counts’ block); the key

length is 4 bytes (the minimum one, 1 32-bit word).

Here is the initialization code:

1. INIT k=0

2. LD 1


4. LD 9

5. SET_REGISTER_VALUE k=TME_LUT_ENTRIES

6. LD 5

7. SET_REGISTER_VALUE k=TME_SHARED_MEMORY_BLOCKS


121

8. LD 0

9. VALIDATE k=4


11. RET 1

The filter code, that implements the algorithm shown in Figure 39, is the following:

1. LD 0x414c4c00 //string ‘ALL’

2. EX_ST MEM_EX[0]


4. EXECUTE k=0

5. LDH P[12]

6. JE 0x800 jt=8 jf=7

7. RET 0

8. LD 0x49500000 //string ‘IP’

9. EX_ST MEM_EX[0]


11. EXECUTE k=0

12. LDB P[23]

13. JE 0x6 jt=14 jf=19

14. LD 0x54435000 //string ‘TCP’

15. EX_ST MEM_EX[0]


17. EXECUTE k=0

18. RET 0

19. JE 0x11 jt=20 jf=24

20. LD 0x55445000 //string ‘UDP’

21. EX_ST MEM_EX[0]


23. EXECUTE k=0

24. RET 0

An example of dispatcher_handler function prints out the results is the

following:


const u_char *sp)

{

ULONG i,records=h->caplen/64;

ULONGLONG all=0,ip=0,tcp=0,udp=0;

9.4 Performances

122


{

PUCHAR block=sp+i*64;

c_p_data *d=(c_p_data*)(block+4);

if((block[0]==’A’) &&

(block[1]==’L’) &&

(block[2]==’L’) &&

(block[3]==’\0’))

all=d->bytes;

if((block[0]==’I’) &&

(block[1]==’P’) &&

(block[2]==’\0’) &&

(block[3]==’\0’))

ip=d->bytes;

if((block[0]==’T’) &&

(block[1]==’C’) &&

(block[2]==’P’) &&

(block[3]==’\0’))

tcp=d->bytes;

if((block[0]==’U’) &&

(block[1]==’D’) &&

(block[2]==’P’) &&

(block[3]==’\0’))

udp=d->bytes;

}

printf(“Total bytes = %.I64u\n”,all);

if(all!=0)

printf(“Percentage of IP %f%%\n”,(float)ip/(float)all);

if(ip!=0)

{

printf(“Percentage of TCP relative to IP

%f%%\n”,(float)tcp/(float)ip);

printf(“Percentage of UDP relative to IP

%f%%\n”,(float)udp/(float)ip);

}

}

9.4 Performances

9.4.1 The monitor system under test

The monitor system used for the test uses the TCP session ID as a key to enter the

hash table.

In particular, the key is composed by the IPv4 source and destination addresses, and

TCP source and destination ports, in this order; therefore the key is 12 bytes wide, that

is to say 3 32-bit words, as shown in the following table:

32 bits 32 bits 16 bits 16 bits

IP source IP destination TCP src TCP dest

9.4 Performances

123

We use an open matrix, and we will initialize the TME with the default values; the

initialization code used for the tests is the following:

1. SET_MEMORY k=1300000

2. INIT k=0

3. LD 3


5. LD 0

6. VALIDATE k=12

7. SET_ACTIVE_READ k=0

8. RET 1

The filter code has to check if the packet contains IP and TCP, and that is not an IP

fragment; on success, extract the IP addresses and TCP ports, to generate the key, to

enter the lookup table. Finally we perform a LOOKUP and an EXECUTE.

1. LDH P[12]

2. JE 0x800 jt=3 jf=18

3. LDB P[23]

4. JE 0x6 jt=5 jf=18

5. LDH P[20]


7. LD P[26]

8. EX_ST MEM_EX[0]

9. LD P[30]

10. EX_ST MEM_EX[4]

11. LDXB 4*(P[14]&0xf)

12. LDH P[X+14]


14. LDH P[X+16]



17. EXECUTE k=0

18. RET 0

The used dispatcher handler function simply calculates the total amount of packets

received, by adding the packet counters of the various blocks.

9.4 Performances

124

It prints out this sum, and the number of received blocks. The read process is executed

every five seconds. This is the code of this function:


const u_char *sp)

{

ULONG records=h->caplen/64;

ULONGLONG i;

ULONGLONG j=0;

c_p_data *d;

printf("\n-----------------\nRecords=%u\n",records);


{

PUCHAR curr=sp+i*64;

d=(c_p_data*)(curr+12);

j+=d->packets;

}

printf("Total packets=%.I64u\n",j);

}

9.4.2 Test bed

The tests were performed using two machines shipped with a copper Gigabit

Ethernet NIC each, connected with a UTP cross cable. The operating system was

Windows 2000 Service Pack 2. The TCP/IP stack was not bound to the NIC devices, so

the IP packets received by these NICs are processed only by the WinPcap driver. We

used the version of the driver that generates timestamps with the assembler instruction

RDTSC.

The test bed is shown in

Ethernet 1Gbps

Full duplex

Traffic

Generator

Sender Receiver

Pentium III - 500MHz256MB RAM, 13GB HD

3Com Gigabit Ethernet 3C996TWindows 2000 Professional, eng

Pentium II - 400MHz128MB RAM, 6.4GB HD

3Com Gigabit Ethernet 3C996TWindows 2000 Server, eng

Ethernet 1Gbps

Full duplex

Traffic

Generator

Sender Receiver

Pentium III - 500MHz256MB RAM, 13GB HD

3Com Gigabit Ethernet 3C996TWindows 2000 Professional, eng

Pentium II - 400MHz128MB RAM, 6.4GB HD

3Com Gigabit Ethernet 3C996TWindows 2000 Server, eng

Figure 40 –Test bed

The traffic was generated by a program based on a modified version of WinPcap that is

able to send packets at a specified rate, for testing purposes. It sends an IPv4 packet

9.4 Performances

125

with a TCP payload65 several times. Figure 41 and Figure 42 show the trend of the

maximum packet rate and the relating traffic (in Mbps) relative to different packet sizes.

Packet rate vs. packet size

0

20000

40000

60000

80000

100000

120000

140000

160000

0 200 400 600 800 1000 1200 1400

Packet size (in bytes)

Packet

rate

(in

pkts

/s)

Figure 41 – Packet rate vs. packet size

Traffic vs. packet size

0

100

200

300

400

500

600

700

0 200 400 600 800 1000 1200 1400 1600

Packet size

Tra

ffic

(in

Mb

ps)

Figure 42 – Traffic vs. packet size

The CPU load was measured through Task Manager, the CPU ticks were measured

through some particular counters that are implemented in the INTEL Pentium

processors, that can be enabled and disabled through a special assembler instruction,

that count the CPU ticks.

9.4.3 Test 1

The first test is used to show the overall performances of the TME architecture with

the BPF filter described in paragraph 9.4.1. It points out the dependency of CPU load

from packet rate.

Some millions packets were sent on the link, at different packet rates; the packet size

was always 100 bytes. The BPF filter is the one

65 In particular, MAC source is 2:2:2:2:2:2, MAC destination is 1:1:1:1:1:1, IP source is 1.2.3.4, IP destination is 5.6.7.8, TCP source port is 4500 and TCP destination port is 4444.

9.4 Performances

126

The result of this test is shown in Figure 43.

CPU vs. packet rate

149.300

225.000

0%

20%

40%

60%

80%

100%

0 50.000 100.000 150.000 200.000 250.000

packet rate

CP

U Theorical

Experimental

Figure 43 – CPU vs. Packet rate

It is clear that the CPU load grows linearly with the packet rate; the maximum packet

rate we reached during the tests was 145.000 packets/s, but the CPU load of the sender

machine was never at 100%. We do not know why we could not reach a higher packet

rate, we suspect that the limit is due either to the WinPcap send module, or to a physical

limit of the Gigabit NICS. In theory, if the trend of the curve in Figure 43 is linear, the

receive machine under test could manage a packet rate of up to 225.000 packets/s

without losses; it is expectable that a much faster processor, for example a Pentium 4

machine, is able to deal with a larger amount of packets per second.

9.4.4 Test 2

The second test points out the contributions of each component to the overall

performance of the system under test. We sent one thousand packets, and we measured

the CPU ticks used to process these packets, isolating the various parts that form the

monitoring system under test.

The test was performed in several steps:

• at first, the filter code was set to a ‘RET 0’ instruction. In practice, this shows the

CPU ticks used by the underlying network drivers to receive a packet; this is the

most variable information, since it depends on a variety of parameters, among which

the specific NIC and its drivers, the packet rate, the specific timings of the PCI bus.

• then we added the filter code that generates the key for the lookup instruction

(instructions 1-13 of the filter code); this test points out the contribute of the key

generation, that is to say the part in which the packet headers are parsed and the key,

formed by some fields of the headers, is created;

• afterwards we added the LOOKUP instruction; in this first case, the driver was

modified not to insert timestamps; this test gives the contribute of the LOOKUP

instruction, without the timestamping issues;

9.4 Performances

127

• then we re-enabled the timestamp generation, and performed the last test again; in

this way we can obtain the CPU ticks needed to add the timestamps; the tests were

made first using the RDTSC instruction to generate the timestamp, and then the

Windows API KeQueryPerformanceCounter.

• finally we added also the EXECUTE instruction, too; this test uses the normal filter

code previously described.

1. LDH P[12]

2. JE 0x800 jt=3 jf=18

3. LDB P[23]

4. JE 0x6 jt=5 jf=18

5. LDH P[20]


7. LD P[26]

8. EX_ST MEM_EX[0]

9. LD P[30]

10. EX_ST MEM_EX[4]

11. LDXB 4*(P[14]&0xf)

12. LDH P[X+14]


14. LDH P[X+16]


Key generation

16. LOOKUP k=0 jt=17 jf=17 LOOKUP

17. EXECUTE k=0 EXECUTE

18. RET 0 underlying drivers

The result of these tests is shown in Figure 44 and Figure 45.

CPU ticks of each component (RDTSC)

418

146

296

1241350

Underlying drivers

Key generation


Timestamp

EXECUTE


Figure 44 – CPU ticks of each component using RDTSC to generate timestamps

9.4 Performances

128

CPU ticks of each component

(KeQueryPerformanceCounter)

418

146

2156

124

1350

Underlying drivers

Key generation


Timestamp

EXECUTE


Figure 45 – CPU ticks of each component using KequeryPerformanceCounter to generate

timestamps

Several conclusions can be inferred from these tests:

• It is clear that generating timestamps with KeQueryPerformanceTimer is almost 10

times slower than using the RDTSC instruction. This is why it is so important to use

this assembler instruction, although it has the previously mentioned problems

related to the frequency calculation.

• Even using the RDTSC instruction, the CPU ticks needed are more than 10% of the

total CPU ticks needed: this is due to the fact that the process of timestamping

involves several 64-bit integer divisions, that are very slow on 32-bit processors.

• Using the RDTSC, less than 50% of the total CPU ticks is used by the WinPcap

monitoring system, the remaining part depends on the underlying NIC drivers.

• Using a processor with a clock of 500 MHz (like the one used for the tests) the

interval between the arrival of two packets must be at least 4.7 µs using the RDTSC

timestamps, and at least 8.4 µs using the KeQueryPerformanceCounter timestamps.

This means that the maximum traffic rate is about 215.000 pkts/s for the former

solution, and 119.000 pkts/s for the latter solution.

9.4.5 Test 3

This test wants to point out the performances of the used hash algorithm when one or

more hash collisions happen.

The test was performed in this way:

• at first we generated N different packets to generate N-1 collisions;;

• then we sent some packet bursts that involved one or more collisions with the

previously sent packets, and we measured the CPU load.

The packet burst was formed of ten million equal packets of 100 bytes, with a packet

rate of 100.000 packets/s; each different burst caused a fixed number of collisions for

9.4 Performances

129

each packet. The test was repeated with bursts that caused an increasing number of

collisions.

The results of the test are shown in Figure 46.

CPU ticks vs. hash collisions

2334 2405 2482 2545 2618 2731

0

500

1000

1500

2000

2500

3000

0 1 2 3 4 5

Hash collisions

CP

U t

icks

Figure 46 – CPU ticks vs. Hash collisions

The outcome of the test shows clearly that performances are not much lowered by

collisions; moreover we must point out a detail: usually two or more collisions happen

when the hash table is quite full; this can be avoided by using large hash tables, and

limiting the fill rate by using a number of blocks (in the

shared_memory_segment) which is about half the number of entries (in the

lut_segment).

9.4.6 Test 4

This last test wants to point out that performances do not depend on the packet size.

To demonstrate this fact, we sent some packet bursts with the same packet rate (50.000

packets/s) and different packet sizes. For a convenience in the measures, this test is

quoted in CPU load rather than CPU ticks.

The result of this test is shown in Figure 47.

9.4 Performances

130

CPU vs. packet size

24% 25% 25% 25% 25% 26% 27%

0%

20%

40%

60%

80%

100%

60 100 150 200 500 1000 1514

packet size

CP

U

Figure 47 – CPU vs. Packet size

In practice, the CPU load grows slightly; this is due to the fact the underlying network

drivers have to process a number of bytes per second that grows when the packet size

raises. Apart from this fact, the CPU load can be considered almost constant. This is due

to the fact that the packet is never copied, so the performances do not depend on the

packet length.

10 Future work

At the moment of writing this paper, the architecture is somewhat immature: the

proposed extension, TME, is totally written, but many features that are necessary to

build a real universal packet filtering architecture, have still to be realized. Here is a

list of the new features whose development is still in progress:

• New LOOKUP and EXECUTE callbacks. At the moment there are only two

LOOKUP callbacks and two EXECUTE callbacks; in particular, it is necessary to

add a new lookup_fcn to deal with ACLs. As regards EXECUTE callbacks, the

existing function to deal with TCP sessions (tcp_session) is quite rough, as it

cannot deal with the latest version of TCP (e.g. TCP SACK). Another interesting

feature, added through a LOOKUP callback, is the chance to perform pattern

matching, maybe in a way similar to snort.

• Opportunity to bridge packets from an interface to another after having filtered

them at kernel level. At the moment it is possible to perform this task at user

level, that is to say transferring packets received by an interface at user level and

then sending them out of another interface; this solution does not scale when the

packet rate grows, since the operation of receiving and sending each packet

involves at least two context switches and three or four copies of the packet.

• Test a kernel-buffered version of the driver. The current version of the driver

performs the packet filtering on the fly, by executing the BPF virtual processor,

that is to say without buffering them. Although a packet copy can be time

consuming, the buffering technique can make things better during network bursts.

• The problems related to timestamps were solved by using the RDTSC assembler

instruction. The open issues are how to obtain the processor frequency in an

elegant manner (the current technique is too complex and ‘dirty’), and how to

behave on particular processors (in particular the Intel processors implementing

the SpeedStep technology).

• Implement logging capabilities. This feature is quite simple to realize, since we

can exploit the circular buffer, usually used to store packets at kernel level, and

inserting a new working mode, MODE_LOG.

• On-the-fly filter modification. It would be very interesting to have the chance to

modify the packet filtering system while it is running; for example, if we are

monitoring some precise TCP ports (a close matrix), we want to add a new TCP

port to the matrix. In practice we want to inject new snippets of initialization

code that modify the behavior of the currently working filter program.

132

• Cascading filters. At the moment each instance of the driver uses only one filter,

we have to open several drivers instances in parallel to manage multiple filters.

An optimization can be merging n filters into one big one, either in series or in

parallel, composed of modules, that correspond to the original filters, each one

with a different working mode, for example one in capture mode, one in

statistical mode, one in logging mode.

11 Related Work

11.1 The statistical mode of WinPcap drivers

WinPcap offers the chance to work in a particular mode, in which packets are not

captured and passed to a user level application, but rather counted, the statistical mode

(MODE_STAT).

From the BPF filters’ point of view, WinPcap works in the same way of the capture

mode: each packet is filtered by analyzing its protocol headers. The value returned by

the filter (which is a sort of Boolean function applied to the packet) tells if the packet

must be accepted or it must be discarded. The difference between the original capture

mode of WinPcap (and of LIBPCAP) and the statistical mode comes after this phase: in

capture mode the accepted packet is copied into a kernel buffer and then passed to the

user level application, in the statistical mode two counters, representing the number of

bytes and of packets, are incremented when the packet is accepted.

This mode performs a simple kernel level monitoring, that is to say packets are never

transferred up to user level, so the performances are generally excellent.

Another advantage of this working mode is that the filters are the same ones used in

capture mode; as a consequence, the compiler shipped with LIBPCAP (and therefore

with WinPcap) can be used seamlessly. The last advantage is that WinPcap is actually a

library, so users can write their own tools.

This working mode has a big drawback: it can take into account only one single

aspect of traffic at a time: if we want to analyze several aspects of network traffic (e.g.

three different TCP ports), we have to open several instances of the WinPcap driver,

each one considering a particular traffic feature (e.g. each instance filters a particular

TCP port). This approach does not scale when the number of driver instances grow, the

overhead becomes not negligible; moreover, the process of retrieving data from the

various driver instances to the application becomes quite time consuming, since a

context switch is involved for every single instance.

Another drawback (described in paragraph 3.3.1, too) is that it does not fit to open

matricial statistics, whose matrix dimension is not fixed.

11.2 NeTraMet

NeTramet is a free software developed by The University of Auckland's Information

Technology Systems & Services group, and it is the first implementation of the

Realtime Traffic Flow Measurement (RTFM) Working Group's Measurement

Architecture, outlined in RFC 1272, "Internet Accounting Background".

11.3 NTOP

134

It provides a means to obtain statistics about the so called flows, that is to say the

traffic, in terms of bytes and packets, from a source host to a destination host. The

sources and destinations of the flows can be identified at different OSI layers, in

particular at level 2 (data-link), 3 (network) and 4 (transport).

NeTraMet is able to work over Ethernet at data-link layer, over IP, DECnet phase IV,

IPX, Ethertalk and CLNS NSAP at network layer, and various transport protocols,

depending on the underlying network protocols.

It consists of two main parts, the meters, that interact directly with the network,

analyzing the packets flow, and the monitors, that provide an interface between the user

and the meters.

The meters are programmed through the so called rulesets; they are a sequence of

instructions to be executed by the meters’ Pattern Matching Engine.

NeTramet is shipped with a compiler that is able to translate rulesets written with

SRL, the Simple Ruleset Language, an high level procedural which enables users to

create effective rulesets without having to understand the details of the meters’

architecture. The SRL syntax is documented in RFC 2723.

A big advantage of this architecture is that rulesets can be written using a high level

language; the second advantage is that the monitors are able to generate detailed reports

on the data retrieved by meters. The last advantage is that in theory meters and monitors

can work on different machines, and the communication is obtained through the SNMP

protocol.

One of the main disadvantages of the NeTraMet architecture is that it is able to deal

with flows only, that is to say it gives information relative to packets going from a host

to another; we cannot obtain statistical like “web traffic”. Another issue is that on the

major part of the supported platforms, it is based on LIBPCAP; as a consequence, it

captures all the packets, bringing them to user level to perform the needed processing.

This approach generally cuts off performances greatly, since packets are copied at least

two times before reaching the so called meters.

11.3 NTOP

NTOP is a free network tool developed by Luca Deri, that is based on LIBPCAP.

This tool is able to show several aspects of network usage, either at command-line or

through a Web interface.

The main informations it can provide are:

• Data received and transmitted by each host of a network, divided into various

protocols, among which IP, TCP, UDP, IPX, ARP, ICMP, NetBios, OSPF, IGMP.

• Data transmitted over TCP and UPD, divided into application protocols, among

which FTP, Web, Telnet, Mail, NetBios, NFS.

11.3 NTOP

135

• The average and peak throughput of each host.

• The hosts’ activity in the various moments of the day.

• Global statistics, for example the amount of broadcast relative to the overall traffic,

the average packet size.

• Matrices relative to data exchanged between the various hosts of the LAN.

• Some informations relative to TCP session.

It is clear, from the above described features, that the strength of NTOP is that it can

offer a variety of statistics about network traffic within a single tool; moreover, the

statistics are accessible with a standard Web browser, even remotely.

However, this tool has a big drawback, too: it captures the first 350 bytes of the

packets, bringing them to user level, in order to obtain these statistics; this approach has

the same consequences that NeTraMet has: packets are copied several before they are

processed to obtain statistical data. In fact, this tool has proved not being performing on

heavy loaded network: in this case the most frequent issue are packet losses. Moreover,

the user cannot define new statistics.

12 TME Extension Programming Manual

Some new instructions have been introduced to the original BPF architecture, to deal

with the TME; basically, they can be divided in three main categories:

• extended memory management instructions

• TME operative instructions

• initialization instructions.

Each instruction is composed of four elements:

• code: it contains the opcode of the instruction;

• k: it contains a value passed to the instruction, for example in LOAD instructions it

can contain a constant or an address;

• jt: it contains the number of instructions to be skipped forward; it is used only in

JUMP on condition instructions; the program counter is incremented by this value

when the condition coupled with the instruction is true;

• jf: it is almost the same as jt, but the program counter is incremented when the

condition is false.

In the following chapters we will deal with each one, describing each opcode.

12.1 Extended memory management instructions

The extended architecture is shipped with some new memory, extended

memory, whose size, by default, is 64 kB.

New instructions permit 8-, 16- and 32-bit loads and stores, from/to registers A and

X, in absolute mode (i.e. parameter k in the instruction contains the absolute address in

memory), and relative mode (i.e. k contains the offset in memory, base address is stored

in X); there is also an instruction to dynamically resize the extended memory.

Data is represented in big-endian mode, to preserve the order in which bytes traverse

the wire. A more detailed discussion on this problem can be found in paragraph 6.1.1.

The new instructions are

• SETMEMORY

• EX_LD

• EX_LDH

• EX_LDB

• EX_LD_REL

• EX_LDH_REL

• EX_LDB_REL

• EX_LDX

• EX_LDXH

• EX_LDXB


137

• EX_ST

• EX_STH

• EX_STB

• EX_ST_REL

• EX_STH_REL

• EX_STB_REL

• EX_STX

• EX_STXH

• EX_STXB

12.1.1 SETMEMORY

This instruction resizes the extended memory; k contains the new size, in bytes;

on failure (e.g. memory not available), the machine flow is interrupted and the virtual

processor returns 066.

NOTES:

• it is not possible to set a size of 0 bytes;

• when memory is resized, the content of memory, previous to the operation, is lost;

• SETMEMORY is not allowed in filter programs.

OPCODE Registers

involved

Allowed in filter

programs.

On failure

SETMEMORY BPF_MISC|BPF_TME|

BPF_SETMEMORY

Flow interrupted,

return 0.

12.1.2 EX_LDx

These instructions load a value from memory to register A; k contains the absolute

address from where to load; x identifies how many bytes to load: nothing means a 32-bit

word, h means a 16-bit word, stored in the least significant part of A, b means a 8-bit

word, stored in the least significant part of A.

OPCODE Registers

involved

Allowed in filter

programs.

On failure

EX_LD

M[k+3]

M[k+2]

M[k+1]

M[k]

A

BPF_LD|BPF_MEM_EX_IMM|

BPF_W

A � Flow interrupted,

return 0.

EX_LDH

M[k+1]

M[k]

0

0

A


BPF_H


return 0.

66 The BPF virtual processor that executes the instructions returns 0 immediately if an error occurs, for example if the opcode is unknown, the instruction tries to read or write in a non-existent memory location.


138

EX_LDB

M[k]

0

0

0

A


BPF_B


return 0.

12.1.3 EX_LDx_REL

These instructions load a value from memory to register A; here k contains the

relative address from where to load; register X contains the base address. x identifies

how many bytes to load: nothing means a 32-bit word, h means a 16-bit word, stored in

the least significant part of A, b means a 8-bit word, stored in the least significant part of

A.

OPCODE Registers

involved

Allowed in

filter

programs.

On failure

EX_LD_REL

M[X+k+3]

M[X+k+2]

M[X+k+1]

M[X+k]

A

BPF_LD|BPF_MEM_EX_REL|

BPF_W

A, X � Flow

interrupted,

return 0.

EX_LDH_REL

M[X+k+1]

M[X+k]

0

0

A


BPF_H

A, X � Flow

interrupted,

return 0.

EX_LDB_REL

M[X+k]

0

0

0

A


BPF_B

A, X � Flow

interrupted,

return 0.

12.1.4 EX_LDXx

These instructions are the same as EX_LDx, except for the fact the value is stored in

register X.

OPCODE Registers

involved

Allowed in

filter

programs.

On failure

EX_LDX

M[k+3]

M[k+2]

M[k+1]

M[k]

X

BPF_LDX|BPF_MEM_EX_IMM|

BPF_W

X � Flow

interrupted,

return 0.


139

EX_LDXH

M[k+1]

M[k]

0

0

X


BPF_H

X � Flow

interrupted,

return 0.

EX_LDXB

M[k]

0

0

0

X


BPF_B

X � Flow

interrupted,

return 0.

12.1.5 EX_STx

These instructions store a value from register A to memory; k contains the absolute

address in where to store; x identifies how many bytes to store: nothing means a 32-bit

word, h means that the least significant 16 bits of A are stored, b means that the least

significant 8 bits of A are stored.

OPCODE Registers

involved

Allowed in

filter

programs.

On failure

EX_ST

M[k+3]

M[k+2]

M[k+1]

M[k]

A

BPF_ST|BPF_MEM_EX_IMM|

BPF_W

A � Flow

interrupted,

return 0.

EX_STH

M[k+1]

M[k]

A


BPF_H

A � Flow

interrupted,

return 0.

EX_STB

M[k]

A


BPF_B

A � Flow

interrupted,

return 0.

12.1.6 EX_STx_REL

These instructions store a value from register A to memory; k contains the relative

address in where to store, register X contains the base address; x identifies how many

bytes to store: nothing means a 32-bit word, h means that the least significant 16 bits of

A are stored, b means that the least significant 8 bits of A are stored.

12.2 TME operative instructions

140

OPCODE Registers

involved

Allowed in

filter

programs.

On failure

EX_ST_REL

M[X+k+3]

M[X+k+2]

M[X+k+1]

M[X+k]

A

BPF_ST|BPF_MEM_EX_REL|

BPF_W

A, X � Flow

interrupted,

return 0.

EX_STH_REL

M[X+k+1]

M[X+k]

A


BPF_H

A, X � Flow

interrupted,

return 0.

EX_STB_REL

M[X+k]

A


BPF_B

A, X � Flow

interrupted,

return 0.

12.1.7 EX_STXx

These instructions are the same as EX_STx, except for the fact that they use value in

register X.

OPCODE Registers

involved

Allowed in

filter

programs.

On failure

EX_STX

M[k+3]

M[k+2]

M[k+1]

M[k]

X

BPF_STX|BPF_MEM_EX_IMM|

BPF_W

X � Flow

interrupted,

return 0.

EX_STXH

M[k+1]

M[k]

X


BPF_H

X � Flow

interrupted,

return 0.

EX_STXB

M[k]

X


BPF_B

X � Flow

interrupted,

return 0.


This class of instructions permits the interaction with the TME core during the

filtering process; they are

• LOOKUP

• EXECUTE

• SET_ACTIVE.


141

12.2.1 LOOKUP

This instruction performs a lookup in the table associated with the currently ACTIVE

TME data block; this instruction is actually a front-end to a lookup-fcn, in particular that

one with ID stored in register TME_LOOKUP_CODE of the current active TME data

block. This instruction is implemented as a jump, if the lookup-fcn returns TME_TRUE,

the instruction condition is true (so program counter jumps to value jt of instruction);

if it returns TME_FALSE, the instruction condition is false (so program counter jumps

to value jf of instruction). LOOKUP needs a key to perform its tasks, that must be

stored in the extended memory, and k must contain the base address of the key.

If no TME data block is currently active, this instruction fails.

lookup-fcns usually save the address of the entry satisfying the key in the

TME_LAST_FOUND register of the currently ACTIVE TME data block, or

0xFFFFFFFF if the search was not successful. Nevertheless it is up to the particular

lookup-fcn to use to behave this way.

OPCODE Registers

involved

Allowed in

filter

programs.

On failure

LOOKUP

BPF_MISC|BPF_TME|

BPF_LOOKUP

� Flow interrupted.

Returns 0.

NOTES:

• at the moment, normal_lut_w_insert lookup-fcn returns TME_FALSE if the

table is full (i.e. it’s not possible to insert new entries); normal_lut_wo_insert

lookup-fcn returns TME_FALSE if the key provided was not found in the table;

• this instruction always refers to the ACTIVE TME data block;

• this instruction can be used during initialization phase to force the creation of some

entries in the lookup table.

12.2.2 EXECUTE

This instruction executes the code associated to the entry stored at address pointed by

register TME_LAST_FOUND of the currently ACTIVE TME data block (usually set by

a previous call to LOOKUP); it is actually a front-end a front-end to an exec-fcn.

The exact operations it performs are:

1. if no block is currently active, fail;

2. look at value stored in register TME_LAST_FOUND, it contains the address of

the entry found in the last call to LOOKUP, 0xFFFFFFFF otherwise:

• if it contains 0xFFFFFFFF, execute exec-fcn whose ID is stored in register

TME_OUT_LUT_EXEC, on the first block of the shared_memory_segment

(the one reserved for this special case);


142

• otherwise, first check if it’s a valid address (it must be in the lut_segment),

then execute exec-fcn whose ID is stored at that location+4, on block whose

address is stored at that location (see Figure 48).

NOTE: a check is made on the validity of the block addresses (they must be in

the shared_memory_segment), prior to the call of the specific exec-fcn.

block address

exec-fcn ID

31 16 15 0

UPFE_LAST_FOUND

0x00004F30

0x00004F30

0x00004F34

0x00004F38

0x00004F3C

Figure 48

The exec-fcns may need to receive also some data, so k must contain the base

address for these data in the extended memory.

OPCODE Registers

involved

Allowed in

filter

programs.

On failure

EXECUTE

BPF_MISC|BPF_TME|

BPF_EXECUTE


Returns 0.

NOTE: this instruction refers always to the ACTIVE TME data block.

12.2.3 SET_ACTIVE

This instruction activates TME data block k. This instruction fails if that block

cannot be activated, since either it was not validated upon initialization phase (i.e. bit k

in the VALIDATED_BLOCKS register is not set), or k is out of range (i.e. k is

greater/equal MAX_TME_DATA_BLOCKS).

This instruction is used in filter programs to switch between several active blocks.

Upon activation, register WORKING is set to k, too.

NOTE: when the machine is turned on, there is no block active.

OPCODE Registers

involved

Allowed in

filter

programs.

On failure

SET_ACTIVE

BPF_MISC|BPF_TME|

BPF_SET_ACTIVE


Returns 0.

12.3 TME initialization instructions

143


The new co-processor, TME, needs some instructions to be properly initialized; these

instructions have to be issued prior to the injection of the ‘filter’ program, in the

initialization phase; moreover most of these can be used only in this phase (i.e. they are

not legal in the filter code).

Here is the full list of them, in the next paragraphs we will show in detail each of

them:

• SET_WORKING

• INIT

• VALIDATE

• GET_REGISTER_VALUE

• SET_REGISTER_VALUE

• SET_ACTIVE_READ.

12.3.1 SET_WORKING

This instruction sets register WORKING to k. It fails if k is out of range (k must be in

the range 0 ÷ MAX_TME_DATA_BLOCKS-1). It can be issued only during initialization

phase, since it must be used to switch to a block that has not yet been validated (see

12.3.3).

OPCODE Registers

involved

Allowed in

filter

programs.

On failure

SET_WORKING

BPF_MISC|BPF_TME|

BPF_SET_WORKING

Flow interrupted.

Returns 0.

12.3.2 INIT

This instruction initializes TME data block k with default values, and makes it

working (i.e. sets register WORKING to k).

Here is a list of default values for each register:

REGISTER DEFAULT VALUE

TME_LUT_ENTRIES 32.007

TME_MAX_FILL_STATE 15.000

TME_REHASHING_VALUE 1

TME_KEY_LEN 01

TME_SHARED_MEMORY_BLOCKS 16.000

TME_FILLED_BLOCKS 0

TME_BLOCK_SIZE 64 bytes

TME_EXTRA_SEGMENT_SIZE 0 bytes


144

TME_LOOKUP_CODE default_lut_w_insert

TME_DEFAULT_EXEC count_packets

TME_OUT_LUT_EXEC count_packets

TME_LUT_BASE_ADDRESS not defined2

TME_SHARED_MEMORY_BASE_ADDRESS not defined2

TME_EXTRA_SEGMENT_BASE_ADDRESS not defined2

TME_LAST_FOUND 0

1 Key length must be set upon initialization. 2 These values are set when block is validated (through instruction

VALIDATE).

OPCODE Registers

involved

Allowed in

filter

programs.

On failure

INIT

BPF_MISC|BPF_TME|

BPF_INIT

Flow interrupted.

Returns 0.

12.3.3 VALIDATE

This instruction validates TME data block in register A, reserving memory from

address k; if the block is valid, it also makes it active (i.e. sets register ACTIVE to A).

This is the only instruction which can put block k in the pool of the validated ones (i.e.

the only one that can set bit A in register VALIDATED_BLOCKS).

The exact operations it performs are:

1. Check if registers contain values, in particular

• TME_KEY_LEN > 0

• TME_LUT_ENTRIES > 0

• TME_SHARED_MEMORY_BLOCKS > 0

• TME_BLOCK_SIZE > 0

• TME_LOOKUP_CODE must exist

• TME_OUT_LUT_EXEC must exist

• TME_DEFAULT_EXEC must exist.

2. Check if there is enough memory; the amount of required memory is calculated

according to the following:

required_memory = TME_LUT_ENTRIES * 4 +


145

TME_BLOCK_SIZE * TME_SHARED_MEMORY_BLOCKS +

TME_EXTRA_SEGMENT_SIZE

For example, if default values are used (by issuing instruction INIT),

required_memory will be

required_memory = (TME_LUT_ENTRIES) 32007 * 4 +

(TME_BLOCK_SIZE) 64 * (TME_SHARED_MEMORY_BLOCKS)

16000 +

(TME_EXTRA_SEGMENT_SIZE) 0 = 1.152.028 bytes

Since this data is stored in the extended memory, the validation succeeds only if k

+ required_memory is below (or equal) extended memory size. A small

example is shown in Figure 49.

3. If the previous steps succeed, the block is validated:

• bit A of register VALIDATED_BLOCKS is set to 1

• register ACTIVE is set to A

• register WORKING is set to A

• register TME_LUT_SEGMENT_BASE_ADDRESS is set to k

• register TME_SHARED_MEMORY_BASE_ADDRESS is set to

k + 4*TME_LUT_ENTRIES

• register TME_EXTRA_SEGMENT_BASE_ADDRESS is set to

k + 4*TME_LUT_ENTRIES + TME_BLOCK_SIZE*TME_SHARED_MEMORY_BLOCKS

• register TME_FILLED_BLOCKS is set to 1 (this is because shared memory

block #0 is used for ‘out-of-counts’ packets).

extended_memory

LUT_SEGMENT

EXTRA_SEGMENT

SHARED_MEMORY

_SEGMENT

0x00000000

0x00000040

0X00008040

0x0000ffff

32.768 bytes

(required memory)

64 bytes

65.536 bytes

(total memory)

Figure 49 – Validate memory check.


146

OPCODE Registers

involved

Allowed in

filter

programs.

On failure

VALIDATE

BPF_MISC|BPF_TME|

BPF_VALIDATE

A Flow interrupted.

Returns 0.

12.3.4 GET_REGISTER_VALUE

This instruction is used to retrieve the value of a register in the currently WORKING

TME data block (i.e. that one whose index is in the WORKING register).

It is possible to retrieve a value from all the registers, except from

TME_LOOKUP_CODE. This is due to the specific implementation of the TME, and will

be solved in the future releases. k contains the code of the register, and the retrieved

value is saved in register A. The instruction fails if k contains an unknown register code.

Valid register codes are

• TME_LUT_ENTRIES

• TME_MAX_FILL_STATE

• TME_REHASHING_VALUE

• TME_KEY_LEN

• TME_SHARED_MEMORY_BLOCKS

• TME_FILLED_ENTRIES

• TME_BLOCK_SIZE

• TME_EXTRA_SEGMENT_SIZE

• TME_FILLED_BLOCKS

• TME_DEFAULT_EXEC

• TME_OUT_LUT_EXEC

• TME_SHARED_MEMORY_BASE_ADDRESS

• TME_LUT_BASE_ADDRESS

• TME_EXTRA_SEGMENT_BASE_ADDRESS

• TME_LAST_FOUND_BLOCK.

NOTE: the exact values of these IDs are stored in a header file (*.h) which is by default

included in the projects using this extension. The use of the numeric IDs is strongly

discouraged, since they can change in the future releases of the TME.

OPCODE Registers

involved

Allowed in

filter

programs.

On failure

GET_REGISTER_VALUE

BPF_MISC|BPF_TME|

BPF_GET_REGISTER_VALUE

A � Flow

interrupted.

Returns 0.


147

12.3.5 SET_REGISTER_VALUE

This instruction is used to set a value in a specific register of the current working

TME data block (i.e. that one whose index is in the WORKING register).

Values to be set must be stored in A register, and k contains the register code. Only

some registers can be set, and some of them can be set only during the initialization

phase, as explained in the following table.

REGISTER

Allowed in filter

programs

Allowed during

initialization

phase

Notes

TME_LUT_ENTRIES � �

TME_MAX_FILL_STATE � �

TME_REHASHING_VALUE � �

TME_KEY_LEN � This register must be

set prior to call

VALIDATE

TME_SHARED_MEMORY_BLOCKS �

TME_FILLED_ENTRIES � �

TME_BLOCK_SIZE �

TME_EXTRA_SEGMENT_SIZE �

TME_FILLED_BLOCKS � � This register is set

only if the value is

greater than the old

one.

TME_DEFAULT_EXEC � � Only valid exec_fcn

IDs are accepted.

TME_OUT_LUT_EXEC � � Only valid exec_fcn

IDs are accepted.

TME_SHARED_MEMORY_BASE_ADDRESS

TME_LUT_BASE_ADDRESS

TME_EXTRA_SEGMENT_BASE_ADDRESS

The content of these

registers cannot be

changed, they are set

by VALIDATE

instruction.

TME_LAST_FOUND_BLOCK � �

TME_LOOKUP_CODE � � Only valid lookup_fcn

IDs are accepted.

Special care must be taken in changing register values after the TME data block has

been validated; although the TME core performs internal checks before every operative

instruction, putting malicious values in the registers can hang-up the entire processor.


148

OPCODE Registers

involved

Allowed in

filter

programs.

On failure

SET_REGISTER_VALUE

BPF_MISC|BPF_TME|

BPF_SET_REGISTER_VALUE

A � Flow

interrupted.

Returns 0.

12.3.6 SET_ACTIVE_READ

This instruction sets what TME data block will be passed to the user. It sets register

ACTIVE_READ to k, only if bit k in register TME_VALIDATED_BLOCKS is set (i.e.

only if block k has been previously validated).

OPCODE Registers

involved

Allowed in

filter

programs.

On failure

SET_ACTIVE_READ

BPF_MISC|BPF_TME|

BPF_SET_ACTIVE_READ

Flow

interrupted.

Returns 0.

Instruction Opcode Description

Registers

involved

Allowed

in filter

programs

SETMEMORY BPF_MISC|BPF_TME|BPF_SETMEMORY Resizes extended memory. NONE

EX_LD BPF_LD|BPF_MEM_EX_IMM|BPF_W Loads word from ext_mem to A A �

EX_LDH BPF_LD|BPF_MEM_EX_IMM|BPF_H Loads half-word from ext_mem to A A �

EX_LDB BPF_LD|BPF_MEM_EX_IMM|BPF_B Loads byte from ext_mem to A A �

EX_LD_REL BPF_LD|BPF_MEM_EX_REL|BPF_W Loads word from ext_mem to A A, X �

EX_LDH_REL BPF_LD|BPF_MEM_EX_REL|BPF_H Loads half-word from ext_mem to A A, X �

EX_LDB_REL BPF_LD|BPF_MEM_EX_REL|BPF_B Loads byte from ext_mem to A A, X �

EX_LDX BPF_LDX|BPF_MEM_EX_IMM|BPF_W Loads word from ext_mem to X X �

EX_LDXH BPF_LDX|BPF_MEM_EX_IMM|BPF_H Loads half-word from ext_mem to X X �

EX_LDXB BPF_LDX|BPF_MEM_EX_IMM|BPF_B Loads byte from ext_mem to X X �

EX_ST BPF_ST|BPF_MEM_EX_IMM|BPF_W Stores word from A to ext_mem A �

EX_STH BPF_ST|BPF_MEM_EX_IMM|BPF_H Stores half-word from A to ext_mem A �

EX_STB BPF_ST|BPF_MEM_EX_IMM|BPF_B Stores byte from A to ext_mem A �

EX_ST_REL BPF_ST|BPF_MEM_EX_REL|BPF_W Stores word from A to ext_mem A, X �

EX_STH_REL BPF_ST|BPF_MEM_EX_REL|BPF_H Stores half-word from A to ext_mem A, X �

EX_STB_REL BPF_ST|BPF_MEM_EX_REL|BPF_B Stores byte from A to ext_mem A, X �

EX_STX BPF_STX|BPF_MEM_EX_IMM|BPF_W Stores word from X to ext_mem X �

EX_STH BPF_STX|BPF_MEM_EX_IMM|BPF_H Stores half-word from X to ext_mem X �

EX_STB BPF_STX|BPF_MEM_EX_IMM|BPF_B Stores byte from X to ext_mem X �

LOOKUP BPF_MISC|BPF_TME|BPF_LOOKUP Performs a lookup in ACTIVE block NONE �

EXECUTE BPF_MISC|BPF_TME|BPF_EXECUTE Performs an execute in ACTIVE block NONE �

SET_ACTIVE BPF_MISC|BPF_TME|BPF_SET_ACTIVE Sets block k to ACTIVE NONE �

SET_WORKING BPF_MISC|BPF_TME|BPF_SET_WORKING Sets block k to WORKING NONE

INIT BPF_MISC|BPF_TME|BPF_INIT Initializes block to default values NONE

VALIDATE BPF_MISC|BPF_TME|BPF_VALIDATE Validates block k at address A A

GET_REGISTER_VALUE BPF_MISC|BPF_TME|BPF_GET_REGISTER_VALUE Retrieves a value from a TME block A �

SET_REGISTER_VALUE BPF_MISC|BPF_TME|BPF_SET_REGISTER_VALUE Sets a register in a TME block A �67

SET_ACTIVE_READ BPF_MISC|BPF_TME|BPF_SET_ACTIVE_READ Sets block k to be read NONE

67 With some limitations, see paragraph 12.3.5.

13 Bibliography

[1] S. McCanne and V. Jacobson, The BSD Packet Filter: A New Architecture for User-

level Packet Capture. Proceedings of the 1993 USENIX Technical Conference (San

Diego, CA, Jan. 1993), USENIX.

[2] Gary R. Wright, W. Richard Stevens, TCP-IP illustrated Volume II. Addison-

Wesley professional computing series.

[3] Microsoft Software Development Kit and Driver Development Kit Examples,

Microsoft Corporation.

[4] Microsoft Windows NT and Windows 2000 Driver Development Kit

documentation, Microsoft Corporation.

[5] Peter G. Viscarola, W. Anthony Mason, Windows NT Device Driver Development,

Macmillan Technical Publishing.

[6] Microsoft MSDN Library, Microsoft Corporation, May 2001.

[7] A. Begel, S, McCanne, S. L. Graham, BPF+: Exploiting Global Data-flow

Optimization in a Generalized Packet Filter Architecture, 1999.

[8] T. H. Cormen, C. E. Leiserson, R. L. Rivest, Introduzione agli algoritmi Volume 1,

Jackson Libri, 1994.

[9] Donald E. Knuth, The Art of Computer Programming, volume 3 Sorting and

Searching, Addison Wesley, 1981.

[10] M. Roesch, Snort – Lightweight Intrusion Detection for Networks, presented at

the USENIX LISA conference in November 1999.

[11] N. Brownlee, C. Mills, G. Ruth, RFC2063 - Traffic Flow Measurement:

Architecture, January 1997.

[12] Winpcap web site, available at http://www.netgroup.polito.it/winpcap


150

[13] IA-32 Intel Architecture Software Developer’s Manual, volume 2 Instruction Set

Reference, Intel Corporation, 2001.

[14] NeTraMet web site, available at

http://www2.auckland.ac.nz/net/Accounting/ntm.Release.note.html

[15] NTOP web site, available at http://www.ntop.org

An Architecture for Unified Packet Filtering · An Architecture for Unified Packet Filtering...

Documents

Transcript of An Architecture for Unified Packet Filtering · An Architecture for Unified Packet Filtering...