An Architecture for Unified Packet Filtering · An Architecture for Unified Packet Filtering...
Transcript of An Architecture for Unified Packet Filtering · An Architecture for Unified Packet Filtering...
Politecnico di Torino Facoltà di Ingegneria dell’Informazione
Corso di Laurea in Ingegneria Elettronica
An Architecture
for Unified Packet Filtering
Relatori: Candidato: Prof. Patricia Lago Gianluca Varenni Prof. Fulvio Risso
Novembre 2001
Table of contents
1 INTRODUZIONE................................................................................................... 6
1.1 IL CONCETTO DI PACKET FILTERING ................................................................... 6 1.2 APPLICAZIONI CHE SFRUTTANO IL PACKET FILTERING ....................................... 7
1.2.1 Sistemi di cattura ...................................................................................... 7 1.2.2 Sistemi di monitoring e motori statistici................................................... 7 1.2.3 Firewalls.................................................................................................... 8 1.2.4 Intrusion Detection Systems (IDS) ........................................................... 9
1.3 UNA CARATTERISTICA COMUNE: LA GESTIONE DI TABELLE ............................... 9 1.4 IL SISTEMA DI CATTURA WINPCAP................................................................... 10 1.5 IL TABLE MANAGEMENT ENGINE (TME) ........................................................ 11
1.5.1 Interazione con l’utente .......................................................................... 15 1.6 INTERAZIONE TRA IL TME E LA LIBRERIA LIBPCAP....................................... 16 1.7 IMPLEMENTAZIONE .......................................................................................... 18 1.8 PRESTAZIONI.................................................................................................... 19 1.9 SVILUPPI FUTURI .............................................................................................. 21
2 INTRODUCTION ................................................................................................ 23
2.1 GOALS OF THIS WORK ...................................................................................... 23 2.2 CHAPTERS’ ORGANIZATION ............................................................................. 24
3 MONITORING NETWORKS ............................................................................ 26
3.1 GENERAL FEATURES ........................................................................................ 26 3.2 TYPES OF MONITORING .................................................................................... 27
3.2.1 Single aspects of traffic........................................................................... 28 3.2.2 Traffic matrices....................................................................................... 28 3.2.3 Multiple analyses .................................................................................... 30 3.2.4 Logging capabilities................................................................................ 30 3.2.5 Flow analyses.......................................................................................... 30
3.3 REALIZATION APPROACHES ............................................................................. 30 3.3.1 N-dimensional matrices .......................................................................... 33 3.3.2 Multiple analyses .................................................................................... 37 3.3.3 General logging capabilities ................................................................... 39 3.3.4 Analyses on flows................................................................................... 39
3.4 CONCLUSIONS.................................................................................................. 40
4 FIREWALLS ........................................................................................................ 41
4.1 INTRODUCTION ................................................................................................ 41 4.2 PACKET FILTERING APPROACH ......................................................................... 43 4.3 STATEFUL INSPECTION APPROACH ................................................................... 44 4.4 ACCESS CONTROL LISTS .................................................................................. 45 4.5 REQUIREMENTS ............................................................................................... 46 4.6 NETWORK INTRUSION DETECTION SYSTEMS ................................................... 49 4.7 NETWORK ADDRESS TRANSLATORS ................................................................ 50 4.8 ACLS AND ROUTERS ........................................................................................ 51
3
5 THE ORIGINAL BPF ARCHITECTURE AND WINPCAP .......................... 52
5.1 COMPONENTS OF THE BPF ARCHITECTURE ...................................................... 52 5.1.1 Network tap and BPF virtual machine.................................................... 53 5.1.2 Kernel packet buffer ............................................................................... 55 5.1.3 LIBPCAP library .................................................................................... 55
5.2 INTERACTION BETWEEN LIBPCAP AND THE BPF VIRTUAL MACHINE ............. 56 5.3 IMPLEMENTATION UNDER MICROSOFT OPERATING SYSTEMS: WINPCAP ........ 57
5.3.1 Differences from the original BPF architecture...................................... 58
6 THE TABLE MANAGEMENT ENGINE (TME)............................................. 60
6.1 LIMITS OF THE ORIGINAL BPF ARCHITECTURE................................................. 60 6.1.1 Small and limited memory...................................................................... 60 6.1.2 Impossibility to modify the packet ......................................................... 61 6.1.3 Simple instruction set.............................................................................. 61
6.2 EXTENSIONS TO THE ORIGINAL BPF ARCHITECTURE........................................ 62 6.3 EXTENSION REQUIREMENTS ............................................................................. 63
6.3.1 Proposed extensions................................................................................ 64 6.4 EXTENDED MEMORY ........................................................................................ 64 6.5 TME................................................................................................................ 65
6.5.1 A general overview of the LOOKUP operation ..................................... 67 6.5.2 A general overview of the EXECUTE instruction ................................. 67 6.5.3 Lookup tables.......................................................................................... 68 6.5.4 LOOKUP functions ................................................................................ 70 6.5.5 EXECUTE functions .............................................................................. 71 6.5.6 TME data blocks..................................................................................... 71 6.5.7 What happens when a LOOKUP is performed....................................... 74 6.5.8 What happens when an EXECUTE is performed................................... 75
6.6 INTERACTION OF THE EXTENSIONS WITH THE ORIGINAL BPF ARCHITECTURE .. 76 6.6.1 Interaction with the extended memory ................................................... 77 6.6.2 Interaction with the TME........................................................................ 77
6.7 INTERACTION WITH APPLICATIONS................................................................... 77 6.8 INIZIALIZATION OF THE TME........................................................................... 79 6.9 THIRD-PARTIES EXTENSIONS............................................................................ 85 6.10 KNOWN LIMITATIONS....................................................................................... 85
7 TME – THE IMPLEMENTATION.................................................................... 87
7.1 THE MODE_MON WORKING MODE................................................................ 87 7.2 DATA STRUCTURES .......................................................................................... 87
7.2.1 TME_CORE ........................................................................................... 88 7.2.2 TME_DATA........................................................................................... 89 7.2.3 RECORD ................................................................................................ 90 7.2.4 MEM_TYPE........................................................................................... 90
7.3 NEW BPF INSTRUCTIONS ................................................................................. 91 7.3.1 INIT ........................................................................................................ 91 7.3.2 VALIDATE ............................................................................................ 91 7.3.3 LOOKUP ................................................................................................ 92 7.3.4 EXECUTE .............................................................................................. 92
7.4 LOOKUP CALLBACKS..................................................................................... 93 7.4.1 Two examples: default_lut_w_insert and default_lut_wo_insert ........... 94
4
7.4.2 Some tips for writing new LOOKUP callbacks...................................... 96 7.5 EXECUTE CALLBACKS................................................................................... 98
7.5.1 An example: count_packets............................................................ 98 7.5.2 Some tips for writing new EXECUTE callbacks................................... 99
7.6 INTERACTION WITH THE USER ........................................................................ 100 7.6.1 Passing data to the user ......................................................................... 100 7.6.2 Filter injection....................................................................................... 102
7.7 IMPLEMENTATION ISSUES .............................................................................. 103
8 INTERACTING WITH THE TME.................................................................. 106
8.1 THE SWITCH STATEMENT ............................................................................ 106 8.2 INTERPRETING DATA RECEIVED BY THE TME COPROCESSOR......................... 108 8.3 THE PCAP_KEYPARSER() FUNCTION................................................................. 109 8.4 THE KEY PARSING PROCESS............................................................................ 110
9 A CASE STUDY ................................................................................................. 114
9.1 A SYSTEM TO MONITOR THE TRAFFIC SENT BY EACH HOST IN A NETWORK..... 114 9.2 A SYSTEM TO MONITOR WEB, FTP, SMTP AND TELNET TRAFFIC ................. 116 9.3 A SYSTEM TO MONITOR IP, TCP AND UDP TRAFFIC ...................................... 120 9.4 PERFORMANCES ............................................................................................. 122
9.4.1 The monitor system under test.............................................................. 122 9.4.2 Test bed................................................................................................. 124 9.4.3 Test 1..................................................................................................... 125 9.4.4 Test 2..................................................................................................... 126 9.4.5 Test 3..................................................................................................... 128 9.4.6 Test 4..................................................................................................... 129
10 FUTURE WORK............................................................................................ 131
11 RELATED WORK......................................................................................... 133
11.1 THE STATISTICAL MODE OF WINPCAP DRIVERS ............................................. 133 11.2 NETRAMET ................................................................................................... 133 11.3 NTOP............................................................................................................ 134
12 TME EXTENSION PROGRAMMING MANUAL .................................... 136
12.1 EXTENDED MEMORY MANAGEMENT INSTRUCTIONS....................................... 136 12.1.1 SETMEMORY ..................................................................................... 137 12.1.2 EX_LDx................................................................................................ 137 12.1.3 EX_LDx_REL ...................................................................................... 138 12.1.4 EX_LDXx............................................................................................. 138 12.1.5 EX_STx ................................................................................................ 139 12.1.6 EX_STx_REL....................................................................................... 139 12.1.7 EX_STXx.............................................................................................. 140
12.2 TME OPERATIVE INSTRUCTIONS .................................................................... 140 12.2.1 LOOKUP .............................................................................................. 141 12.2.2 EXECUTE ............................................................................................ 141 12.2.3 SET_ACTIVE....................................................................................... 142
12.3 TME INITIALIZATION INSTRUCTIONS ............................................................. 143 12.3.1 SET_WORKING .................................................................................. 143 12.3.2 INIT ...................................................................................................... 143
5
12.3.3 VALIDATE .......................................................................................... 144 12.3.4 GET_REGISTER_VALUE.................................................................. 146 12.3.5 SET_REGISTER_VALUE................................................................... 147 12.3.6 SET_ACTIVE_READ.......................................................................... 148
13 BIBLIOGRAPHY........................................................................................... 149
1 Introduzione
Negli ultimi anni si è assistito ad una diffusione mondiale delle reti di calcolatori, in
particolare del fenomeno Internet; come conseguenza, il traffico presente sulle reti
informatiche ha avuto un aumento vertiginoso, con l’effetto che la richiesta di sistemi
per la loro realizzazione e il loro mantenimento, quali tools di cattura pacchetti, motori
statistici, firewalls, sistemi anti-intrusione, è esplosa.
In particolare le soluzioni che il mercato offre sono di due tipi: sistemi basati su
piattaforme hardware dedicate, e sistemi software, basati su tools che possono essere
eseguiti su normali PC o workstations, e utilizzano il loro collegamento normale alla
rete per eseguire i loro compiti. Le soluzioni hardware assicurano migliori prestazioni,
in generale, ma sono in generale limitate quanto ad espansibilità ed aggiornabilità, ed
inoltre il loro costo è spesso troppo elevato rispetto alle reali esigenze degli utenti. Le
soluzioni software, invece, nonostante in generale non raggiungano le stesse prestazioni
di quelle hardware, hanno grandi vantaggi, tra cui il loro facile sviluppo, il basso costo,
e molto spesso la loro estrema versatilità nell’adattarsi alle varie tipologie di rete
presenti sul mercato.
La maggior parte dei sistemi per la realizzazione e l’analisi di reti informatiche di solito
condivide una serie di funzionalità comuni, che va sotto il nome di packet filtering.
Questo lavoro di tesi si propone l’obiettivo di:
• individuare quali funzioni accomunano apparati di rete quali sistemi di cattura, tools
di monitoring e statistici, firewalls e intrusion detection systems;
• proporre una architettura software, integrata in un sistema esistente di cattura
pacchetti – WinPcap – che permetta lo sviluppo di nuovi tools di rete attraverso una
interfaccia comune alle varie tipologie di applicativi;
• realizzare questa architettura all’interno del driver di cattura pacchetti che sta alla
base dell’architettura WinPcap.
1.1 Il concetto di packet filtering
Con il termine di packet filtering si intende quel processo attraverso il quale i dati
che passano in una rete di calcolatori, i quali sono raggruppati in pacchetti, cioè in
blocchi, sono catturati e analizzati, generalmente andando ad controllare il valore di
particolari campi presenti negli headers, la parte iniziale di ogni pacchetto; gli headers
contengono le informazioni importanti circa il contenuto dei pacchetti stessi, ad
esempio da chi sono stati mandati e a chi sono diretti . Si parla di filtraggio perché i
1.2 Applicazioni che sfruttano il packet filtering
7
pacchetti che non passano i controlli sugli headers vengono scartati, altrimenti
subiscono una ulteriore elaborazione che è specifica per il particolare tool.
1.2 Applicazioni che sfruttano il packet filtering
1.2.1 Sistemi di cattura
Sono di solito i sistemi più semplici: si tratta di tools che sono in grado di catturare il
traffico che scorre sulla rete, trasferendolo all’utente per una successiva elaborazione o
visualizzazione. Le soluzioni software di cattura sono di solito formate da un modulo a
basso livello, inserito nel kernel del sistema operativo, che è a stretto contatto con la
scheda di rete, e da questa ottiene i pacchetti, e da un modulo di interfaccia tra il primo
modulo e le applicazioni.
Per avere buone prestazioni, si cerca di solito di limitare il numero di pacchetti trasferiti
a livello utente, il packet filtering viene qui usato per scremare i pacchetti transitanti
sulla rete, e passare all’applicazione a livello utente solo i pacchetti di effettivo
interesse; si opera in questa maniera perché è abbastanza raro che una applicazione
necessiti di tutti i pacchetti per le sue elaborazioni.
1.2.2 Sistemi di monitoring e motori statistici
Si tratta di tools che forniscono informazioni circa il traffico di rete sotto vari aspetti,
di norma in termini di byte e pacchetti al secondo. Possiamo individuare quattro grosse
categorie di statistiche:
• semplici: sono quelle che coinvolgono un singolo aspetto del traffico di rete, per
esempio il numero di pacchetti IP piuttosto che l’occupazione di banda relativa al
Web.
• matrici di traffico: ci interessa visualizzare una matrice del traffico scambiato tra
i vari hosts di una LAN, oppure la distribuzione dei protocolli trasportati su TCP
(e.g. Web, Telnet ed FTP).
• analisi multiple, per esempio siamo interessati ad ottenere il traffico totale, quello
IP e in particolare quello UDP all’interno di uno stesso grafico.
• analisi sui flussi (e.g. le sessioni TCP).
E’ abbastanza chiaro che, a parte la prima, tutte queste tipologie di analisi hanno
bisogno di tabelle per rappresentare i dati, in cui ogni cella rappresenta un particolare
aspetto del traffico:
• per le matrici di traffico e le analisi multiple ogni cella della tabella rappresenta
un singolo aspetto del traffico, come mostrato in Figura 1;
• per le analisi sui flussi, ogni cella rappresenta un diverso flusso, per esempio una
particolare sessione TCP.
1.2 Applicazioni che sfruttano il packet filtering
8
Web (80)
Mail (25)
Telnet (23)
FTP (21)
Data
packets=134
bytes=296485
Data
packets=93
bytes=59642
Data
packets=24
bytes=19215
Data
packets=16
bytes=4898
Figura 1 - Una semplice tabella relativa al traffico TCP
I tools esistenti di solito sono basati su un sistema di cattura pacchetti, da cui ricevono
tutto il traffico di rete, e quindi effettuano le elaborazioni statistiche completamente a
livello utente. Questa soluzione, benché di più semplice realizzazione, di solito non ha
delle prestazioni soddisfacenti, in quanto l’operazione di trasferire tutti i pacchetti di
rete a livello utente è sempre molto onerosa: tutti i sistemi di cattura infatti sono
architetture cosiddette 2-copy, cioè i pacchetti vengono copiati due volte prima di
arrivare all’utente, e questo è generalmente molto pesante.
1.2.3 Firewalls
Sono sistemi che si frappongono tra due reti di calcolatori, e sono l’unico punto di
contatto tra le due LAN. Questi sistemi sono progettati per isolare e proteggere una rete
dall’altra, e viceversa; non a caso il termine firewall significa “muro contro il fuoco”.
Esistono in generale due tipi di firewalls: i packet filtering firewalls, che lavorano sui
singoli pacchetti, e sono di solito trasparenti alle applicazioni, e gli application
gateways, che invece lavorano a un livello superiore, in pratica a livello di sessioni, e
non sono trasparenti alle applicazioni (i.e. le applicazioni devono essere modificate per
funzionare con questi apparati).
I packet filtering firewalls analizzano ogni pacchetto che arriva da una interfaccia di
rete, e decidono, in base a delle regole definite dall’utente, se un pacchetto deve essere
ritrasmesso su un’altra interfaccia di rete o deve essere scartato. Le regole definite
dall’utente sono in pratica delle condizioni sul contenuto degli headers dei pacchetti, per
cui il processo prima descritto non è altro che un packet filtering.
Le regole definite dall’utente sono le cosiddette ACL, Access Control Lists, cioè una
lista di filtri semplici, e più o meno omogenei, seguite da una azione da intraprendersi se
il pacchetti soddisfa quel filtro. Un esempio di una semplice ACL è il seguente
1.3 Una caratteristica comune: la gestione di tabelle
9
1. ip source 130.192.3.0/24 accept
2. ip dest 130.192.44.25 accept
3. ip source 130.192.2.44 and dest 130.192.11.43 accept
4. all deny
Questa ACL dice di accettare i pacchetti provenienti da particolari indirizzi IP sorgente
o destinazione, e di scartare gli altri.
1.2.4 Intrusion Detection Systems (IDS)
Questi sistemi sono una via di mezzo tra gli i motori statistici e i firewalls: sono usati
per proteggere le reti informatiche, in particolare segnalando i tentativi di attacco agli
hosts, come i firewalls, ma non sono in grado di scartare i pacchetti, e dal punto di vista
della rete sono più simili a dei sistemi di monitoring molto evoluti. Questi sistemi di
solito sono in grado di evidenziare dei tentativi di attacchi attraverso le cosiddette
signatures (firme), cioè delle sequenze di bytes predefinite che sono presenti nei
pacchetti ‘cattivi’. Per esempio un sistema di IDS è in grado di segnalare la presenza di
un virus in un file allegato ad un messaggio di posta elettronica cercando le signatures
dei virus all’interno dei payload dei pacchetti di posta elettronica (i.e. i pacchetti TCP
delle porte relative ai servizi POP, SMTP e IMAP).
Questi sistemi mantengono una tabella delle signatures ‘pericolose’, e controllano ogni
pacchetto alla ricerca di uno queste sequenze, attraverso il sistema del pattern matching,
cioè un sistema che è in grado di segnalare se una particolare stringa è presente
all’interno di un buffer di bytes.
Dal punto di vista utente, un IDS può segnalare la presenza di traffico ‘sospetto’ in vari
modi, per esempio mandando un messaggio WinPopup ad una stazione Windows,
oppure mantenendo un LOG file di questi tentativi di attacco.
1.3 Una caratteristica comune: la gestione di tabelle
Una delle caratteristiche che accomuna i vari tools prima citati (a parte i sistemi di
cattura puri) è la gestione di tabelle: i sistemi di monitoring hanno bisogno di matrici
per memorizzare le statistiche relative al traffico di rete, i firewalls hanno bisogno di un
metodo per mantenere in memoria le ACL, e queste possono essere rappresentate in una
tabella, i sistemi di intrusion detection necessitano di un database in cui memorizzare le
varie signatures degli attacchi. Un altro punto importante è il fatto che ogni tool compie
una azione diversa dopo aver ‘catalogato’ il pacchetto nella tabella: i sistemi statistici
devono incrementare dei contatori relativi a quel tipo di pacchetto (e.g. il byte-counter
relativo ai pacchetti Web), i firewalls ritrasmettono il pacchetto su un’altra interfaccia o
lo scartano, gli IDS devono segnalare all’utente la presenza di pacchetti ‘sospetti’.
1.4 Il sistema di cattura WinPcap
10
1.4 Il sistema di cattura WinPcap
WinPcap è una architettura di cattura pacchetti per i sistemi operativi Windows a 32
bit sviluppata presso il NetGroup del Politecnico di Torino. E’ basato sull’architettura
BPF – Berkeley Packet Filter – e sulla libreria LIBPCAP, sviluppate entrambe presso
l’università di Berkeley ed inclusa nei sistemi operativi BSD.
Si tratta di un sistema che permette la cattura pacchetti selettiva, cioè attraverso un
sistema di packet filtering che trasferisce solo i pacchetti utili alle applicazioni a livello
utente.
L’architettura WinPcap è composta di tre parti fondamentali, cioè da un kernel driver
e da due librerie a caricamento dinamico (DLL).
Il sistema, a basso livello, è composto da un kernel driver che si occupa del filtraggio
dei pacchetti e della memorizzazione in un kernel buffer; il sistema di filtraggio
funziona attraverso un processore virtuale, il processore BPF, che esegue un
programma, scritto in codice Assembler-like, su ogni pacchetto ricevuto; questo
programma analizza il pacchetto attraverso i suoi headers, e decide se il pacchetto è
accettato, e quindi va trasferito alla applicazione, o va scartato. Il programma eseguito
dalla macchina BPF, il cosiddetto filtro, è definito dall’utente.
A livello utente, WinPcap è costituito da una libreria a caricamento dinamico
(wpcap.dll), che esporta una serie di API che sono utilizzate per l’interazione con il
kernel driver; in particolare ci sono una serie di funzioni per aprire e chiudere le istanze
di cattura, per programmare la macchina BPF con l’opportuno filtro, e naturalmente per
ottenere i pacchetti catturati dal driver.
La libreria è dotata anche di un compilatore molto potente, che è in grado di generare i
filtri per la macchina BPF, che sono scritti in linguaggio macchina, a partire da stringhe
di testo che descrivono il traffico di interesse; nella Figura 2 è mostrato un esempio di
filtro in linguaggio ‘human-readable’ e il corrispondente filtro per la macchina virtuale
BPF.
“Ip and src host 10.10.10.10”
(000) ldh [12](001) jeq #0x800 jt 2 jf 5(002) ld [26](003) jeq #0x0A0A0A0A jt 4 jf 5(004) ret #1514(005) ret #0
pcap_compile
BPFVIRTUAL PROCESSOR
pcap_setfilter
Packets Output
Figura 2 - Un esempio di filtro.
1.5 Il Table Management Engine (TME)
11
La seconda libreria a livello utente è packet.dll; si tratta di una interfaccia tra il driver e
wpcap.dll, che è usata per rendere quest’ultima indipendente dalla particolare versione
di Windows.
Nel disegno in Figura 3 è rappresentata l’architettura nella sua interezza.
filter
Network Tap
Packet
CaptureDriver
Kernel
UserLevel
KernelBuffer
read IOCTL/write
Level
Packet.dll
Wpcap.dll
Application
Figura 3 - L'architettura WinPcap
La macchina virtuale BPF è dotata di:
• due registri a 32 bit, A e X;
• una piccola memoria RAM a 32 bit di 16 parole;
• un program counter implicito.
Il suo set di istruzioni permette:
• caricamenti (LOAD) dal pacchetto, e dalla memoria RAM;
• salvataggi (STORE) nella memoria RAM;
• salti (JUMP) condizionati e incondizionati, solo in avanti;
• operazioni aritmetiche su numeri interi;
• istruzioni di ritorno(RET); queste istruzioni sono quelle che fermano l’esecuzione
del filtro BPF; il valore ritornato indica se il pacchetto è accettato oppure no.
1.5 Il Table Management Engine (TME)
L’architettura BPF ha un set di istruzioni abbastanza limitato quando si vogliono
effettuare operazioni complesse; ad esempio la mancanza di salti all’indietro pone seri
vincoli circa la realizzazione di tabelle. Un’altra limitazione è legata alla memoria: essa
è infatti molto piccola, ed inoltre è indirizzata per parole di 32 bit, per cui non è adatta
per la memorizzazione di tabelle.
Come conseguenza, l’architettura originaria della macchina virtuale è stata estesa con
due blocchi: l’extended_memory e il TME.
L’extended_memory è una memoria di dimensione definibile dall’utente (è per
default di 64 kiloBytes, ma può essere modificata a run-time), indirizzabile per parole di
1.5 Il Table Management Engine (TME)
12
8 bit. Per gestire questa memoria sono state aggiunte delle nuove istruzioni di LOAD e
di STORE; in particolare, esistono istruzioni che lavorano a 8, 16 e 32 bit; i valori a 16
e 32 bit sono rappresentati in network byte order, per preservare l’ordine con cui i bytes
sono rappresentati nei pacchetti.
Il Table Management Engine (TME) è un coprocessore del processore BPF originale;
la sua funzionalità principale è quella di gestire tabelle multidimensionali in maniera
totalmente generica e programmabile dall’utente.
Le due estensioni sono mostrate graficamente in Figura 4.
BPF core
registers
extended_memory
TME core
packet
registers
basic
mem
ory
ctrl
TME co-processor
Figura 4 - Le estensioni alla macchina virtuale BPF
Il coprocessore TME usa la memoria estesa per memorizzare delle tabelle di lookup,
ogni cella della tabella è contraddistinta dalla key, cioè da un identificativo che la
distingue univocamente. Ogni cella è associata ad un blocco di memoria, di dimensione
uguale per tutte le celle, in cui è possibile memorizzare dei dati caratteristici di quella
cella: per esempio, un sistema statistico memorizza in questo blocco i contatori relativi
a pacchetti e byte contraddistinti da quella chiave.
Per esempio, supponiamo di volere realizzare un sistema per ottenere il traffico relativo
ai servizi Web, Telnet, FTP e Mail. La tabella che vogliamo realizzare è mostrata in
Figura 5.
In questo caso la key è rappresentata dalla porta TCP relativa al servizio desiderato, e
ogni blocco associato alla cella contiene i due contatori packets e bytes. Per gestire
questa situazione, il core BPF crea la key estraendo la porta TCP dal pacchetto, quindi
richiama il coprocessore attraverso l’istruzione LOOKUP.
1.5 Il Table Management Engine (TME)
13
Web (80)
Mail (25)
Telnet (23)
FTP (21)
Data
packets=134
bytes=296485
Data
packets=93
bytes=59642
Data
packets=24
bytes=19215
Data
packets=16
bytes=4898
Figura 5 - Una semplice tabella
Questa istruzione prende la key e fa una ricerca nella tabella, ritornando alla macchina
BPF originaria l’indice corrispondente alla cella nella tabella, la cui key è uguale a
quella fornita alla LOOKUP.
Il coprocessore TME in questa fase non incrementa i contatori relativi, questo viene
fatto attraverso l’istruzione EXECUTE: eseguendo questa istruzione, il core BPF passa
il controllo al TME, che incrementa i contatori relativi alla cella trovata durante l’ultima
esecuzione dell’istruzione LOOKUP.
L’istruzione LOOKUP: in realtà la gestione della istruzione LOOKUP da parte del
TME è molto particolare: la gestione della tabella viene infatti fatta tramite delle
callbacks, cioè delle funzioni definite dall’utente; in particolare, esse non sono altro che
funzioni scritte in linguaggio ANSI C che sono inserite in maniera trasparente nei
sorgenti dell’estensione TME, e vengono compilati quando il kernel driver (WinPcap)
che ospita il coprocessore viene compilato. E’ possibile utilizzare delle funzioni
predefinite, oppure si possono creare delle nuove funzioni e aggiungerle all’architettura
in maniera totalmente trasparente. La scelta della particolare funzione che verrà usata
dal TME quando viene eseguita l’istruzione LOOKUP è fatta in una fase preliminare,
prima che il sistema sia messo in funzione. Se durante la fase di inizializzazione non
viene scelta una callback precisa, l’architettura utilizza una funzione predefinita che
gestisce la tabella tramite un hash ad indirizzamento aperto.
Sempre durante la fase di inizializzazione è possibile definire anche altri parametri di
gestione della tabella, quali la sua dimensione e la lunghezza della key.
L’istruzione EXECUTE: come per l’istruzione LOOKUP, anche questa istruzione è
gestita tramite callbacks; in questo modo è possibile associare azioni diverse alla
tabella, a seconda del particolare utilizzo della stessa.
1.5 Il Table Management Engine (TME)
14
In pratica, dopo aver richiamato l’istruzione LOOKUP per effettuare la ricerca nella
tabella gestita dal TME, si richiama l’istruzione EXECUTE per eseguire una particolare
operazione sulla cella trovata con la LOOKUP.
La più semplice EXECUTE, che è quella predefinita, usa il blocco di memoria associato
alla key come due contatori a 64 bit che rappresentano i bytes e i pacchetti,
corrispondenti a quella key, che sono transitati sulla rete.
Come per l’istruzione LOOKUP, anche in questo caso la scelta della callback da usare
viene fatta nella fase di inizializzazione, in caso contrario viene utilizzata la funzione
predefinita che mantiene dei contatori relativi a pacchetti e bytes (come nell’esempio di
Figura 5).
Il processo eseguito dalle istruzioni LOOKUP ed EXECUTE è mostrato in Figura 6 e in
Figura 7.
right
VOID
BPF core
registers
TME core
registersbasic
mem
ory
ctrl
packet
extended_memory
key
VOID
right
VOID
wrong
wrong
VOID
2. LOOKUP call
1. BPF core copies from packet
to extended memory
3. search
4. TME core saves the address of the found
entry in a register of the
TME
5. UPFE core gives control
back to BPF core
wrong means ‘keys don’t match’
right means ‘keys match’VOID means a void entry
lookup table
Figura 6 - L'istruzione LOOKUP
right
VOID
BPF core
registers
TME core
registersbasic
mem
ory
ctrl
1. EXECUTE call
Packets = 175
176Bytes = 55434
55758
2. Read LUT entry in a
register of the TME
3. Execute the exec_fcn on the data present in that entry
4. TME core givescontrol back to BPF core
extended_memory
lookup table
Figura 7 - L'istruzione EXECUTE
1.5 Il Table Management Engine (TME)
15
La scelta di separare la gestione della ricerca nella tabella, cioè il compito effettuato
dalla LOOKUP, dalla azione effettuata sulla singola cella, cioè cosa avviene
richiamando la EXECUTE, è voluta: in questo modo è possibile riutilizzare lo stesso
algoritmo di gestione delle tabelle anche con funzioni EXECUTE diverse, e viceversa:
modificando la gestione della tabella (per esempio modificando l’algoritmo di hash) le
funzioni EXECUTE non devono essere modificate.
Il sistema è in grado di gestire tabelle multidimensionali collassandole in tabelle
unidimensionali: in questo caso le etichette che contraddistinguono ogni dimensione
della tabella sono concatenate a fornire una unica etichetta. Per esempio, se si vuole
gestire una tabella bidimensionale rappresentante il traffico tra i vari hosts di una rete
informatica, questa si può collassare in una tabella unidimensionale in cui ogni cella ha
una key che rappresenta la sequenza ordinata dell’indirizzo sorgente e destinazione. Con
questa tecnica è inoltre più facile gestire tabelle sparse (cioè con parecchie celle vuote),
in quanto è sufficiente salvare solo le celle realmente utilizzate.
Un’altra peculiarità dell’architettura è il fatto che il TME è in grado di gestire più
tabelle contemporaneamente, l’unico limite è dato dalla dimensione
dell’extended_memory, essendo le varie tabelle salvate in questa memoria.
1.5.1 Interazione con l’utente
L’interazione tra il TME e l’utente può essere divisa in due fasi: da un lato è
necessario inizializzare il TME in modo opportuno e utilizzare un filtro BPF adeguato,
dall’altro è necessario trasferire i dati contenuti nella tabella all’utente, per una
successiva post-elaborazione.
Dal punto di vista dell’inizializzazione, il TME viene pre-programmato attraverso
una serie di nuove istruzioni BPF che istruiscono il TME sulla tabella che deve gestire;
questo codice di inizializzazione viene passato nella macchina virtuale BPF nella stessa
maniera dei filtri BPF tradizionali; un esempio molto semplice di codice di
inizializzazione è il seguente
1. SET_MEMORY 1500000
2. INIT 0
3. LD 2
4. SET_REGISTER_VALUE TME_KEY_LEN
5. LD 8
6. VALIDATE 0
7. SET_ACTIVE_READ 0
8. RET 1
1.6 Interazione tra il TME e la libreria LIBPCAP
16
Questo codice di inizializzazione crea una tabella con 320071 celle, gestita con le
funzioni di LOOKUP e di EXECUTE di default; la key è lunga 2 parole da 32 bit (=8
bytes).
Il filtro BPF deve invece creare la key e quindi richiamare le istruzioni LOOKUP ed
EXECUTE. La chiave viene costruita estraendo alcuni campi dal pacchetto e salvandoli
in una particolare locazione di memoria nella extended memory. Un esempio è il
seguente:
1. LD P[6]
2. EX_ST MEM_EX[0]
3. LDH P[10]
4. EX_STH MEM_EX[4]
5. LOOKUP
6. EXECUTE
7. RET 1
Questo codice di filtraggio usa l’indirizzo MAC sorgente come key.
Per quanto riguarda il trasferimento dei dati a livello utente, il TME passa alle
applicazioni l’intera tabella, senza modifiche; se più tabelle sono utilizzate
contemporaneamente, durante la fase di inizializzazione è necessario scegliere quale di
queste deve essere trasferita all’utente (non è possibile trasferire più di una tabella
all’utente).
1.6 Interazione tra il TME e la libreria LIBPCAP
La libreria LIBPCAP, e la corrispondente DLL in ambiente Windows, wpcap.dll, è
dotata di un compilatore che è in grado di generare dei filtri BPF per la cattura pacchetti
a partire da stringhe di testo, e.g. ‘ip’; per generare i filtri BPF, e il corrispondente
codice di inizializzazione, che sfruttino l’estensione TME, è necessario modificare il
compilatore.
Poiché l’architettura TME è generica, ci siamo limitati a proporre una estensione al
compilatore che permetta di creare dei sistemi di monitoring complessi.
In particolare il compilatore esteso è in grado di tradurre in istruzioni BPF un nuovo
costrutto, switch, che è usato per dichiarare un sistema di monitoring.
1 Il valore non è casuale: si tratta di un numero dispari, e ciò garantisce migliori prestazioni nella gestione della tabella come un hash.
1.6 Interazione tra il TME e la libreria LIBPCAP
17
Un esempio di costrutto switch, che dichiara il sistema statistico presentato in Figura 5,
è il seguente
switch(tcp src port or tcp dst port= 80, 21, 23, 25)
Questo costrutto, dato in pasto al compilatore esteso, permette la creazione di una
tabella di 4 celle, in cui ogni cella rappresenta il traffico delle porte TCP relative a Web,
FTP, Telnet e SMTP. Il compilatore esteso, ricevendo questa stringa, genera il filtro
BPF (che costruisce la key e richiama la LOOKUP e la EXECUTE) e il codice per
l’inizializzazione del TME.
Dal punto di vista del trasferimento della tabella a livello utente, ogni cella è
identificata univocamente dalla key, che non è altro che una sequenza di bytes; è
necessario quindi fornire una API che sia in grado di interpretare questa sequenza di
bytes, in pratica una traduzione in formato leggibile, pcap_parsekey(). Per questo
motivo il compilatore esteso, oltre a generare codice BPF, fornisce anche una struttura
dati che viene utilizzata dalla pcap_parsekey per tradurre le chiavi in formato leggibili.
E’ necessario adottare una soluzione di questo genere perché le chiavi sono generate
direttamente dal codice BPF generato dal compilatore, che di conseguenza è l’unico a
conoscere il modo per ritradurre le chiavi in un formato leggibile.
Una descrizione grafica del sistema di interpretazione delle chiavi è mostrato in Figura
8.
EXTENDEDCOMPILER
EXTENDEDCOMPILER
SWITCH(tcp src port=21,23,25,80)SWITCH(tcp src port=21,23,25,80)
struct
bpf_keyparser
To parse the key
struct
bpf_keyparser
To parse the keyBPF filterBPF filter
TME data
pcap_keyparser()pcap_keyparser()
BPF core
registers
extended_memory
TME core
packet
registers
basic
mem
ory
ctrl
TME co-processor
BPF + TME Figura 8 - La decodifica delle keys
Per esempio, nell’esempio sopra riportato, pcap_parsekey() è in grado di effettuare
questa traduzione (la chiave è il numero 80 in esadecimale su 32 bit)
1.7 Implementazione
18
key=0x00000050 � “tcp src port or tcp dst port 80”
1.7 Implementazione
Il codice del TME è stato scritto interamente in linguaggio ANSI C, poiché questo è
in pratica l’unico linguaggio di programmazione utilizzabile per scrivere codice che è
eseguito a livello kernel; nella fattispecie, il codice è stato integrato all’interno del
driver di protocollo di WinPcap, nel quale è presente la macchina virtuale BPF.
E’ stato aggiunto un nuovo modo di funzionamento (working mode) – MODE_MON -
per permettere il trasferimento delle tabelle del TME a livello utente; dal punto di vista
della macchina BPF, infatti, il coprocessore TME può essere sempre utilizzato, ma i dati
ad esso relativi vengono portati a livello utente solo in modalità MODE_MON.
Il trasferimento dei dati dal coprocessore TME alle applicazioni è effettuato con un
approccio 1-copy: i dati presenti a livello kernel, nella tabella del TME, vengono copiati
in un buffer allocato dall’applicazione. Un’altra possibile soluzione implementativi era
quella di utilizzare un buffer condiviso tra il TME e l’applicazione; questa soluzione è
stata scartata perché i vantaggi, in termini di prestazioni, derivanti da tale soluzione
erano di gran lunga minori dei problemi implementativi e di concorrenza che ne
derivavano.
I principali problemi implementativi sono stati causati dalla particolare famiglia di
sistemi operativi, Windows; il principale problema che si è dovuto fronteggiare è
relativo ai timestamps: ogni cella delle tabelle del TME contiene un timestamp relativo
all’ultima volta in cui la stessa è stata aggiornata. L’unica API fornita dai tools di
sviluppo Microsoft che garantisca una precisione dell’ordine del microsecondo è
KeQueryPerformanceCounter, che tuttavia ha un problema molto pesante: ci vogliono
circa 3 µs per ottenere il risultato. Come conseguenza questa funzione rappresenta un
collo di bottiglia molto stretto per le prestazioni dell’architettura. Una soluzione
alternativa è quella di utilizzare una istruzione Assembler presente in tutti i processori
Intel a 32 bit: RDTSC; questa istruzione ritorna il numero di cicli di clock passati
dall’accensione della macchina, per cui è estremamente precisa, ed inoltre viene
eseguita in un solo ciclo di clock, per cui è pure velocissima. L’uso di questa istruzione
Assembler pone però alcuni problemi:
• È necessario sapere la frequenza del processore con buona precisione, e i sistemi
Windows non forniscono questa informazione.
• Questa istruzione è presente solo sui processori Intel (e sui ‘cloni’ AMD come il
processore Athlon).
1.8 Prestazioni
19
1.8 Prestazioni
Per misurare le prestazioni dell’architettura proposta, è stato realizzato un sistema di
monitoring basato sull’estensione TME, che gestisce una tabella la cui key è
rappresentata dalla sequenza ordinata di indirizzi IP sorgente e destinazione e porte TCP
sorgente e destinazione.
La tabella è stata inizializzata con i valori di default, in pratica si tratta di una tabella
di 32007 celle gestite mediante un hash, la funzione di EXECUTE è quella che conta i
pacchetti e i bytes per ogni singola cella.
La piattaforma di test è basata su due PC con processori Intel, operanti con sistema
operativo Windows 2000 Service Pack 2 Inglese. I due PC montano entrambi una
scheda di rete Gigabit Ethernet su rame, collegate tramite un cavo UTP incrociato. Su
entrambe le schede di rete è stato disabilitato il protocollo TCP/IP.
Questa configurazione è mostrata in Figura 9.
Ethernet 1Gbps
Full duplex
Traffic
Generator
Sender Receiver
Pentium III - 500MHz
256MB RAM, 13GB HD
3Com Gigabit Ethernet 3C996T
Windows 2000 Professional, eng
Pentium II - 400MHz
128MB RAM, 6.4GB HD
3Com Gigabit Ethernet 3C996T
Windows 2000 Server, eng
Ethernet 1Gbps
Full duplex
Traffic
Generator
Sender Receiver
Pentium III - 500MHz
256MB RAM, 13GB HD
3Com Gigabit Ethernet 3C996T
Windows 2000 Professional, eng
Pentium II - 400MHz
128MB RAM, 6.4GB HD
3Com Gigabit Ethernet 3C996T
Windows 2000 Server, eng Figura 9 - Piattaforma di test
La macchina usata come generatore di pacchetti è in grado di trasmettere una serie di
pacchetti IPv4 contenenti un payload TCP, uguali tra loro, a un packet rate costante e
predefinito, attraverso una versione modificata dello stesso driver WinPcap. I pacchetti
hanno una dimensione fissa di 100 bytes.
Le misure sono state effettuate usando il Task Manager di Windows 2000, per le
misure di carico di CPU, e usando alcuni contatori particolari di cui i processori Intel
Pentium sono dotati, per le misure dei cicli del processore.
Test 1: questo primo test vuole dare una visione globale delle prestazioni del sistema,
circa il packet rate massimo che è in grado di gestire.
In questo caso sono stati mandati dei burst di alcuni milioni di pacchetti tutti uguali a
differenti packet rate, ed è stato misurato il carico di CPU. Il risultato è mostrato nella
Figura 10. Il test è stato eseguito usando utilizzando la versione dei drivers che usa
l’istruzione RDTSC per generare i timestamps.
Prima di tutto occorre notare che il sistema di generazione pacchetti non è riuscito a
generare un numero di pacchetti maggiore di circa 150.000 pacchetti al secondo, anche
1.8 Prestazioni
20
se il carico di CPU non era mai al 100%; questo potrebbe essere dovuto a qualche limite
delle schede di rete o a qualche problema nei driver della scheda stessa o in WinPcap.
È interessante notare come l’andamento del carico di CPU sia lineare con il packet rate;
attraverso un semplice calcolo teorico è possibile dedurre che la macchina di test
sarebbe in grado di gestire un packet rate massimo di circa 225.000 pacchetti al
secondo.
CPU vs. packet rate
149.300
225.000
0%
20%
40%
60%
80%
100%
0 50.000 100.000 150.000 200.000 250.000
packet rate
CP
U Theorical
Experimental
Figura 10 - Carico di CPU vs. packet rate
Test 2: il secondo test vuole mettere in luce il contributo di ogni componente alle
prestazioni del coprocessore TME. Attraverso i contatori dei processori Intel Pentium
sono stati misurati i colpi di clock necessari per ognuna delle fasi di elaborazione del
pacchetto. I test sono stati effettuati sia con la versione del driver che usa RDTSC per
generare i timestamps, sia con la versione che usa l’API KeQueryPerformanceCounter.
I risultati sono mostrati in Figura 11 e Figura 12.
CPU ticks of each component (RDTSC)
418
146
296
1241350
Underlying drivers
Key generation
LOOKUP withouttimestamp
Timestamp
EXECUTE
Total CPU ticks = 2334
Figura 11 - Cicli di clock relativi a ogni fase di elaborazione del pacchetto
Timestamps generati tramite l’istruzione RDTSC
1.9 Sviluppi futuri
21
CPU ticks of each component
(KeQueryPerformanceCounter)
418
146
2156
124
1350
Underlying drivers
Key generation
LOOKUP withouttimestamp
Timestamp
EXECUTE
Total CPU ticks = 4194
Figura 12 - Cicli di clock relativi a ogni fase di elaborazione del pacchetto
Timestamps generati tramite l’API KeQueryPerformanceCounter
Alcune considerazioni possono essere tratte da questi grafici:
• Il calcolo dei timestamps usando KeQueryPerformanceCounter è quasi dieci volte
più lenta del calcolo con l’istruzione RDTSC; l’uso dell’API di Windows è quindi
certamente un collo di bottiglia per le prestazioni del sistema.
• Anche usando l’istruzione RDTSC, il processo di generazione dei timestamps
occupa circa il 10% del totale; questo è dovuto al fatto che è comunque necessario
effettuare alcune divisioni intere a 64 bit per ottenere i timestamps nel formato usuale
usato da WinPcap/LIBPCAP, operazioni che sono in generale molto lente sui
processori a 32 bit.
• E’ interessante notare come nel caso di Figura 11, più della metà dei cicli di clock
necessari per processare il pacchetto sono utilizzati dai drivers sottostanti WinPcap.
• Utilizzando un processore a 500 MHz (come quello della macchina di test), il tempo
minimo di interarrivo dei pacchetti deve essere di almeno 4.7 µs (utilizzando la
versione ‘RDTSC’), per cui il massimo packet rate che il sistema può elaborare è di
circa 215.000 pacchetti al secondo, un risultato comparabile con quello ottenuto
teoricamente nel test 1.
1.9 Sviluppi futuri
Al momento di scrivere questa tesi, l’architettura TME è per certi versi ancora
immatura: il core TME è stato scritto interamente, ma altre caratteristiche sono ancora
da perfezionare.
In particolare sono allo studio le seguenti modifiche e aggiunte:
• Nuove callbacks di LOOKUP e EXECUTE; in particolare sarebbe interessante
implementare una nuova LOOKUP che permetta di gestire in maniera performante
1.9 Sviluppi futuri
22
le ACL di un firewall; per quanto riguarda le EXECUTE, al momento esiste una
funzione che è in grado di gestire le sessioni TCP, riconoscendo i bytes trasmessi e
ricevuti nel payload TCP e lo stato della connessione attraverso i numeri di sequenza
e i flags. Questa funzione è tuttavia abbastanza rozza (non è in grado di gestire tutti i
casi possibili nella gestione delle sessioni, ed inoltre non è conforme alle ultime
versioni di TCP, come il TCP SACK). Inoltre bisogna ancora aggiungere una
funzione di LOOKUP, ed una eventuale EXECUTE, per eseguire il pattern
matching, necessario se si vogliono implementare dei sistemi IDS.
• Realizzare una versione del driver che bufferizzi i pacchetti a livello kernel. Questa
soluzione potrebbe garantire prestazioni migliori durante i bursts di rete.
• Risolvere in maniera definitiva i problemi relativi ai timestamps: al momento la
soluzione migliore è senz’altro quella che fa uso dell’istruzione Assembler RDTSC,
l’unico problema rimane quello di calcolare la frequenza del processore in maniera
precisa ed elegante.
• Implementare delle funzioni di logging all’interno della architettura estesa di
WinPcap (quella che fa uso del TME).
• Permettere la modifica ‘on-the-fly’ delle tabelle gestite dall’estensione TME: al
momento non è possibile modificare le tabelle che sono state create dal TME, è
necessario fermare il sistema e reinizializzare il coprocessore TME.
2 Introduction
In the last years we have assisted an enormous diffusion of computer networks, in
particular of Internet, that has entered in all the realities, from industry to schools.
In this scenario, the complexity of networks grows every day to grant the needs of final
users; as a consequence, there is an increasing need of network tools and components to
monitor, protect and enhance the functionality and security of networks.
The main tools required can be summarized in three categories:
• statistical and monitoring tools: they are systems that are able to give information
about network traffic at all levels, from the overall throughput to the number of
opened TCP sessions per second.
• components to protect network: they are tools that are able to filter the network
traffic directed to a LAN in order to prevent attacks conducted against its hosts. The
main tools of this category are firewalls and Intrusion Detection Systems (IDS).
• General capturing tools: these are components that are able to capture all the traffic
flowing through the network, transferring all the data to user applications for
subsequent processing.
Two approaches are usually available for these network tools: the former is based on
special purpose hardware, while the latter uses standard PCs or workstations to perform
its tasks, so the processing is mainly made in software.
The hardware solution usually has better performances, since it is a dedicated
architecture, but has the disadvantage that is hardly upgradeable and modifiable.
The software solution is less performing, particularly on slow machines, but it is
cheaper, it is faster to develop, and it is easy to upgrade. For this reason most of the
network tools use a software approach to perform these tasks.
2.1 Goals of this work
This work originates from the consideration that the above mentioned network tools
share a common part, the so called packet filtering, that is to say the process by which
each network packet is analyzed by looking at the values contained in the various
protocol headers, and processing packets that satisfy some criteria, that is to say that
contain some particular values in their headers.
The goals of this thesis is
• point out what features are needed for each network tool
• propose an architecture to perform all these tasks in a coherent and unified way
2.2 Chapters’ organization
24
• development of an high performing architecture that is able to perform these
tasks.
The first goal wants to point out that a lot of network tools share common features,
among which the most outstanding is packet filtering and lookup tables’ management.
The second goal wants to propose an extension to the original BPF/LIBPCAP
architecture (in particular the Windows porting, WinPcap), so that these features can be
obtained in an easy way, without caring of the details of the low level details (for
example the specific algorithms to handle lookup tables). Moreover the proposed
architecture wants to be fully integrated with the existing API of WinPcap, so that
programmers can integrate these new features in their tools seamlessly.
Last but not least, the implementation of such architecture must be realized in such a
way so that performances are good even over fast throughput network, for example up
to Gigabit Ethernet LANs.
2.2 Chapters’ organization
This work is organized into eight chapters plus one appendix. The following list
gives a concise description of each of them:
• Chapter 2 (this chapter): it gives a brief overview of the subject of this thesis and its
goals.
• Chapter 3: this chapter gives a detailed description of all the issues related to
monitoring and statistical tools; it gives also some hints on how to realize such
system in a performing way under WinPcap.
• Chapter 4: it gives an overview of the problems that arise when dealing with these
sort of network components; it describes also what are intrusion detection systems.
• Chapter 5: this chapter describes the original BPF architecture, developed by S. Mc
Canne and V. Jacobson, and the relative high level library, LIBPCAP. There is also a
quite detailed description of the Windows porting of this architecture, particularly
highlighting the differences from the original system.
• Chapter 6: it presents the proposed extension to the BPF, the Table Management
Engine (TME), a virtual co-processor to classify packets using lookup tables.
• Chapter 7: this chapter describes the real implementation of the proposed extension
in the existent WinPcap architecture, dealing with data structures, main functions,
and implementation issues.
• Chapter 8: it describes the way user level applications interact with the TME, in
particular how they create the BPF filters and how they interpret the data received by
the TME.
2.2 Chapters’ organization
25
• Chapter 9: it gives some examples of the use of the proposed extension to perform
the so called kernel monitoring, that is to say some examples of monitoring/statistical
tools that work mainly at kernel level using the TME. This chapter discusses also the
performances of such extension when used to monitor network traffic.
• Chapter 10: this chapter describes the possible future development of the proposed
architecture.
• Chapter 11: it describes some existent tools to perform network traffic monitoring.
• Chapter 12: this appendix is the programming manual of the extension; it describes
the BPF instruction set used to make the TME extension work.
3 Monitoring networks
In the last years the enormous spread of Internet has produced an increasing claim of
systems devoted to network statistics; with this term we intend every sort of information
regarding what happens on the network cable, from the number of collisions of an
Ethernet LAN, to the throughput of a Web session. In this context we will deal with the
statistics that can be inferred from data-link layer packets up to application layer, in
particular the so called ‘traffic monitoring’, that is to say statistics regarding bytes
transferred; we will not analyze the information regarding the physical layer of the OSI
model.
These systems are needed to perform various tasks, among which:
• enhance network topology and services, on the basis of bandwidth occupation for
each network protocol;
• point out abnormal situations, for example a large amount of broadcast traffic;
• find out suspicious traffic generated by some hosts; for example, a host that
generates a large amount of outbound traffic may be a symptom of an ‘hacked’
station, trying to perform attacks on other machines2;
• generate traffic logs for billing purposes;
• point out what are the bottlenecks of the network.
In the following paragraphs we will deal with the requirements of such systems, and
how they can be carried out.
3.1 General features
Monitoring systems must be able to perform a set of ordinary tasks to achieve the
functionalities previously highlighted; however, they share a common feature: they need
to classify packets depending on some fields of the protocol headers. For example if we
want to create a table of network usage divided into application protocols (e.g. Web,
Telnet, FTP and Mail traffic) we need to classify packets on the basis of the TCP port,
and then maintain a byte counter for each protocol.
Therefore we have a common part, in which packets must be interpreted, parsed, in
order to decide what protocols they are carrying.
After being parsed packets must be further classified and assigned to a precise category,
through the value contained in a particular header field (in the example above, the TCP
2 Usually, clients receive more data than they transmit, while server show the opposite behavior.
3.2 Types of monitoring
27
port). After this, an action is taken, for example a counter relative to that particular
value of the field is incremented (in the example above, we increment the byte counter
relative to particular TCP port).
The former part (parsing) is the one in common with usual packet filtering systems,
the latter is specific to monitoring systems.
As a consequence, these systems usually need to manage matrices, in which each
cell corresponds to a particular value of a field in the packet; in the example above, a 1-
dimensional matrix is needed, and each cell in the matrix is described (labeled) by the
TCP port number. This is shown in Figure 1.
Get port
25 (Mail)
23 (Telnet)
21 (FTP)
80 (Web)
25 (Mail)
23 (Telnet)
21 (FTP)
80 (Web)
Is IP? Is TCP?
Discard Discard
YESYES
NONO
Figure 1 – A monitoring system using matrices
Other features depend on what monitor systems have to do: for example, after
classifying a packet, the action to be taken can vary; we can be interested in counting
the packets with that particular feature (in the example above, the feature is the specific
application protocol); in other cases we need to perform some additional processing. For
example if we want to know the number of bytes transferred in a TCP session, first we
need to classify packets on the basis of the session ID3, and then we have to properly
manage other fields of the TCP header, like sequence number and flags.
Another requirement of such systems regards their performances-versatility ‘ratio’:
usually systems characterized by a great flexibility show lower performances compared
to special purpose ones.
The optimal system must be very fast, even with normal machines (intended as physical
computers running these applications), and yet versatile.
3.2 Types of monitoring
In the previous paragraphs we dealt with the basics of monitoring systems, pointing out
briefly what this class of systems are used for, and what characteristics are common to
them. Now it is time to give a list of what a monitoring system must be able to offer,
and then describe how to realize it.
The most valuable features that these systems must be able to deal with are:
• single aspects of traffic, for example the traffic directed to a precise host;
3 The session ID of TCP connections is formed by the IP source and destination addresses, and TCP source and destination ports.
3.2 Types of monitoring
28
• n-dimensional traffic matrices, for example the traffic matrix between each host and
another in a LAN, or, easier, the protocol distribution in a network;
• multiple analyses of traffic load, for example we want to know the amount of IP
traffic, that one of TCP and that one of TCP port 21, at the same time;
• general logging capabilities;
• analyses on flows, rather than packets.
3.2.1 Single aspects of traffic
This type of analysis involves only one sort of packets at a time; in this category we
can put all those statistics like ‘IP traffic’ or ‘Web traffic’ of ‘packets directed to host
BART’.
This kind is the simplest one, and can be realized with whatever system of packet
capture; for example, BPF4 usually captures packets on the net, after having filtered5
them, and the filter processes each packet looking at its headers. This process works
correctly but has very poor performance. If we want to monitor the packets efficiently,
we have to filter them, and increment some counters (e.g. a byte counter), when the
filter accepts the packet, instead of passing it to the user. This is shown in Figure 2.
PASS PACKET
TO THE USER
packet
FILTER
accept
drop
PACKETCAPTURE
SIMPLEMONITOR
INCREMENT
COUNTERS
packets++
Figure 2 – A simple monitor
3.2.2 Traffic matrices
This is an extension of the previous type of statistics: we want to create a n-
dimensional matrix, where each dimension represents a particular field in the packet
header, and each ‘position’ in a dimension stands for a value of that field. A good
example can be found in the next table, that shows a 2-dimensional matrix where each
cell represents a different IP source/destination address couple.
4 Berkeley Packet Filter. 5 Filtering is the process by which a packet is analyzed through its headers and accepted or rejected depending on a user-defined set of conditions.
3.2 Types of monitoring
29
Source
Destination 1.4.5.2 2.8.7.10 1.8.5.22 9.8.7.6 1.9.8.5
1.4.5.2 0 75 5 865 0
1.8.5.22 12 5 0 3 99
8.57.22.1 142 78 67 44 35
9.8.7.6 1 56 445 0 45
1.9.8.5 47 3778 4 0 0
Numbers in the cells represent the packet from host Source to host Destination.
One of the main characteristics of this type of monitoring is that the label that
identifies each cell is a particular combination of values of the same header fields; in the
example above all the cells represent IP source and destination addresses. This feature
will become important when we will deal with the realization of this class of
monitoring.
Moreover, this type of analysis can be divided into three groups:
• the so called close matrices, in which the size, intended as the number of elements
in each dimension, is fixed (e.g. sixteen rows and thirteen columns); this is the case
of statistics relative to particular values of the fields, for example particular TCP
ports, on which a precise application communicates (e.g. “monitor traffic on port 10,
11 and 12”).
• the so called open matrices, in which the size is not defined. This kind of matrices
can be used to create statistical data relating to traffic that is either not pre-definable
or simply too expensive to define. An example is a system monitoring the traffic
generated, or received, by each station on a LAN: it would be too expensive to
determine all the hosts on the net (especially for large ones) and then create a close
matrix. It is more useful to have a system that adds entries in the matrix whenever a
new host is discovered (e.g. when a packet from/to that host is seen).
• the so called bucket matrices. This type of matrix from the previous ones since each
cell corresponds to an interval, rather than a single value. An example of this type is
a matrix in which each cell corresponds to a different packet size, as shown in the
following table:
PACKET SIZE PACKETS
1 ÷ 64 243
65 ÷ 128 56
129 ÷ 256 3
257 ÷ 512 66
513 ÷ 1024 44
1025 ÷ 1514 7755
3.3 Realization approaches
30
3.2.3 Multiple analyses
In this class we can put those statistics that consider dissimilar aspects of network
traffic, that cannot be put in matrix like the one described previously. A good example
is a monitoring system that gives the amount of IP, TCP and web traffic; it is clear that
these three types of traffic are totally different: they even belong to different protocol
layers. It is quite usual to have analyses of traffic at different protocol layers, all at the
same time, since it can give the band occupation of a protocol in relationship with the
underlying protocols.
3.2.4 Logging capabilities
In this class we can put all those statistics that simply need to log some network
activity, that is to say that they do not need to have some sort of processing like
counting packets. For example we are interested in receiving a log message whenever a
TCP packet on port 4995 is seen on the network; the message could be something like
“Alert! TCP packet on port 4995”.
Another example of this type of monitoring, totally different from the previous one,
is when an application wants to receive only some excerpts of the packets (without any
further processing), for example a particular field in the packet, or a definite header (e.g.
the IP header).
3.2.5 Flow analyses
This class of analysis involves the reconstruction of traffic flows, prior to analyze
data and obtain some statistical information. The previous types of monitoring involved
simple mechanisms to create statistics because each packet is managed independently
from the others; in fact, the usual operations of such monitors are to classify packet (e.g.
IP packets, identified by their source address), and perform an action (e.g. increment
some counters). Here we have one more step: classify packets (e.g. by their TCP session
identification), reassembly packets to create a flow (e.g. by looking at the sequence
numbers in the TCP header of the packets), and then perform an action (e.g. look if the
stream contains a FTP file request, and then increment some counters).
3.3 Realization approaches
Several techniques can be used to performs these classes of analyses, all of them with
strengths and drawbacks. First of all we must distinguish between off-line and real-time
systems:
Off-line systems. They perform statistics on network traffic after having captured the
packets on a file. This type of systems is not-critical, since there are no timing issues
(we process a packet after another one has been processed). As a consequence, they are
always made up of three components: a packet capture program, that just dumps the
3.3 Realization approaches
31
packets to a file, a statistical engine, that reads packets from the dump files and
generates statistical information, and a GUI that shows the results received by the
statistical engine; we will not deal with this type of systems.
Real-time systems. They perform statistics while packets are flowing through the net;
these systems are really critical since packets are processed on-the-fly, and
consequently the timing issues are critical; here several approaches can be followed, and
they will be the topic of the next paragraphs.
Real-time systems with packet capture. One possible approach to realize real-time
monitor systems is to simply put together a packet capture program and a statistical
engine, like the ones described in the off-line systems, as shown in Figure 3; in practice,
we do not dump packets to file, but rather we feed the statistical engine with them.
Although it is very simple, this technique proves to be very inefficient: every packet is
copied several times before it reaches the statistical engine; in fact, usual packet capture
systems copy packets at least two times (2-copy system), before delivering them to the
application. Moreover statistical engines usually need only the first bytes of the packets
(the protocol headers), so there is a waste of memory and CPU resources if we give the
whole packet to the statistical engine.
Packet CaptureSystem
user level
kernel level
LAN
Statistical
Engine
packets
Statistical GUI
Buffer
Figure 3 – A statistical system running at user level
We can solve the last problem by capturing only the first bytes of the packet (if the
capture system is able to do so); this tip greatly decreases the amount of bytes copied.
This solution is, however, only a patch to the real problem: multiple copies are
useless and expensive; we must perform statistics without copying the packet, that is to
say when the packet is still in the network card.
3.3 Realization approaches
32
Real-time systems at kernel level without buffering. The above mentioned result can
be achieved by moving the statistical engine down to the kernel, combining it to the low
part of the capture system, the one that sits at kernel level and is closely coupled with
the network card. At user level, where we usually had the statistical engine, we will
have only a GUI that displays monitor data received by the statistical engine, as shown
in Figure 4. It is quite important to move the statistical engine down to the kernel for
several reasons:
• tasks running at kernel level have an higher priority than user level ones, so packet
processing has the precedence over user level applications;
• a statistical GUI needs to be refreshed quite seldom, so context-switches6, involved
whenever the GUI retrieves data from the engine, happen rarely;
• it is easier to interface directly with the packet capture system, since this last block
work mostly at kernel level (having to work strictly coupled with network adapters).
If we are able to combine the statistical engine with the capture system so that packets
are never copied, performances will surely increase.
This approach has its drawbacks: avoiding copies before feeding the statistical
engine involves the need to process each packet before the next one arrives, and this
requirement becomes critical when network bursts occur; in other words, the packet
processing phase must be very fast. The first mentioned technique (statistical engine at
user level), although time and CPU consuming because of the copies and context
switches, can show a better behavior during bursts since the copy itself guarantees a sort
of buffering when several packets arrive in a short amount of time; however, this is
guaranteed only if the buffer is able to contain several packets at the same time.
Packet CaptureSystem
user level
kernel level
LAN
StatisticalEngine
packets
Statistical GUI
statistics
Figure 4 – A statistical engine working at kernel level
Real-time systems at kernel level with buffering. There is also a third possible
approach, which mediates the previous described ones: it is possible to put the statistical
6 A context switch is involved whenever some data is passed beyond the user/kernel boundary.
3.3 Realization approaches
33
engine at kernel level, chaining it to the packet capture system through a kernel buffer.
This solution has the drawback that the packet is copied once, but the buffer itself
‘softens’ network bursts, a situation in which buffer-less solutions usually fail. This
technique is shown in Figure 5.
user level
kernel level
LAN
StatisticalEngine
packets
Statistical GUI
statistics
Buffer
Packet Capture
System
Figure 5 – A statistical engine running at kernel level with packet buffering
After we have discussed the place to put the statistical engine, it is time to look inside
this block: what has it to do?
3.3.1 N-dimensional matrices
One of the needs of such a system is to handle matrices of statistical data; this could be
achieved by putting several simple blocks in parallel, each block monitoring only one
particular feature of the net, like the ones described in paragraph 3.2.1. For example,
let’s suppose we want to monitor the traffic of some specific applications, like Web,
FTP, Telnet and mail. In practice we are interested in creating a matrix, like the
following
Traffic Bytes
Web 345.433
FTP 256.788
Telnet 12.777
Mail 8.908
We can put up four simple monitors, each one for a different service, as shown in
Figure 6; each of them must perform the following actions for every packet:
• parse protocols up to level 4 (TCP);
3.3 Realization approaches
34
• extract the field containing the TCP port, and check if corresponds to the interested
service, for example 80 as Web.
user level
kernel level
LAN
packets
Statistical GUI
statistics
accept
drop
accept
drop
accept
drop
accept
drop
Web MailTelnetFTP
Packet CaptureSystem
Figure 6 – Statistical engine realized with several simple monitors in parallel
This approach has a lot of drawbacks: the parsing phase is repeated four times for
each packet, although this is always the same, so this solution does not scale when the
number of elements grows; moreover, this technique does not fit to the open matrices,
because we do not know how many monitoring blocks are needed.
A better approach is to shrink the blocks into a macro-block, in which the first phase,
the so called parsing, is performed only once per packet; after, we can increment the
right counters by switching (i.e. choosing the right cell) on the field that discriminates
the various cells of the matrix.
With this technique, the example previously described could be realized in this way: in
the common, parsing, phase we check if the packet is a TCP one, then we extract the
TCP port number and we increment the counters in the cell of the matrix corresponding
to that port, as shown in Figure 7.
3.3 Realization approaches
35
LAN
packets
Packet CaptureSystem
is TCP?Give me TCP port
8.908Mail (25)
12.777Telnet (23)
256.788FTP (21)
345.433Web (80)
BytesTraffic
8.908Mail (25)
12.777Telnet (23)
256.788FTP (21)
345.433Web (80)
BytesTrafficTCP port
Statistical GUI
statistics
common part
(Parsing)
Figure 7 – A statistical engine that uses tables
There is another problem related to matrices: when their dimension is bigger than
one (two-, three-dimensional, and even more), matrices tends to become sparse. If we
implement such structures in the ‘normal’ way, memory occupation tends to increase
exponentially. This is a small example of why this happens: let’s suppose we want to
know the amount of traffic between each host and another on a network; usually a
machine does not communicate with all the others, but rather with a subset of them, for
example servers or routers. Nevertheless, if we create a ‘standard’ matrix, we will have
all the N2 possible combinations, where N is the number of hosts. If we have 100 hosts,
we will create 10.000 cells in the table.
A feasible solution to the problem can be the use of open matrices: in the example
above a row or a column is created only if a new source or destination is seen (in a
packet); however this technique proves to be inefficient, as shown in the following
table.
SOURCE
DEST. A B C D E F G H
A 0 3 0 14 0 17 0 0
D 0 0 58 0 0 2 0 0
E 57 0 0 0 0 0 4 0
F 0 0 87 0 863 0 0 0
G 90 877 24 7 24 0 0 12
H 0 0 0 0 0 0 87 0
I 0 0 0 0 0 0 9 0
3.3 Realization approaches
36
A packet from host G to host I arrived; since destination host I did not exist, a row
was added. The problem is that to add data relating traffic from G to I (which
corresponds to only one cell in the matrix) we added eight cells.
A better behavior can be obtained by modifying the matrix in two steps:
• first we transform the matrix into a 1-dimensional one, i.e. a single column; this can
be done by labeling all the cells of the columns in the new matrix with the ordered
concatenation of all the values of the single components in the old matrix. The tables
below show a small example of this technique:
A B
A1 1 2
B1 6 7
C1 4 3
• the second step to optimize memory occupation is to store only those cells that
contain valid data.
When a new value, for example a source address, must be inserted, only one cell is
added to the matrix, instead of an entire row, as shown in the previous example.
The drawback of such an approach is that we need an algorithm to find a cell in this
‘compressed’ table; since the labels that identify the cells are bytes, we can store this
column into an hash; this method assures a fast search7, in theory like a direct lookup in
a table.
The above described technique is not able to deal with bucket matrices; for this type
of statistics it is necessary to create an ad-hoc search algorithm, for example using the
dichotomic method; in practice we need to establish a relationship between each packet
size and the corresponding interval (the ‘bucket’); this bucket is the label that identifies
each cell of the table. This is shown in Figure 8. At this point we can use the bucket
label, together with other header fields, to create another N-dimensional matrix, and use
the normal search algorithms, as shown in Figure 9.
7 Proved that some conditions regarding hashes are satisfied.
A1-A 1
A1-B 2
B1-A 6
B1-B 7
C1-A 4
C1-B 3
3.3 Realization approaches
37
. . .
256
. . .
129
128
. . .
65
64
. . .
1
Packet size
. . .
256
. . .
129
128
. . .
65
64
. . .
1
Packet size
Bucket 5
1025�1514
Bucket 4
513�1024
Bucket 3
257�512
Bucket 2
129�256
Bucket 1
65�128
Bucket 0
1�64
Buckets
Bucket 5
1025�1514
Bucket 4
513�1024
Bucket 3
257�512
Bucket 2
129�256
Bucket 1
65�128
Bucket 0
1�64
Buckets
Figure 8 – How buckets can be managed
Enter table
Bucket 3
Bucket 2
Bucket 1
Bucket 0
Bucket 3
Bucket 2
Bucket 1
Bucket 0
Obtain Bucket
. . .
256
. . .
129
128
. . .
65
64
. . .
1
Packet size
. . .
256
. . .
129
128
. . .
65
64
. . .
1
Packet size
Bucket 5
1025�1514
Bucket 4
513�1024
Bucket 3
257�512
Bucket 2
129�256
Bucket 1
65�128
Bucket 0
1�64
Buckets
Bucket 5
1025�1514
Bucket 4
513�1024
Bucket 3
257�512
Bucket 2
129�256
Bucket 1
65�128
Bucket 0
1�64
Buckets
Figure 9 – The buckets system
3.3.2 Multiple analyses
Another requirement of the statistical engine is the chance to manage statistics on
different aspects of network traffic at the same time.
This feature could be exploited by putting several simple monitor blocks in parallel, in a
way similar to the one depicted in Figure 6; this approach has a big drawback: if the
data of the different blocks are not retrieved by the GUI at the same time, this
information can be incoherent. Here is a small example of this problem: let’s suppose
we want to monitor IP and TCP traffic, by counting packets of these types; at t=t0
packetsIP and packetsTCP are equal 59; the GUI begins to transfer statistical data, and
receives packetsIP=59; then a packet arrives, the counters, corresponding to IP and TCP
are updated8, and they both contain 60. The GUI continues to transfer data, retrieves
8 The GUI is interrupted since the blocks run at kernel level, so with an higher priority.
3.3 Realization approaches
38
packetsTCP=60, which is clearly incoherent, since TCP is transported over IP (and
consequently packetsTCP must not be greater than packetsIP).
The solution is to put the simple blocks into a bigger one, like we did for traffic
matrices, and then retrieve data from them ‘atomically’, that is to say while the
monitoring engine is blocked; this solution cannot be applied if we put several simple
monitor blocks in parallel, since they are completely independent. The difference is that
the various blocks are heterogeneous, while before they share the same protocol header
field to decide the right block (in the example shown in Figure 7 it was the TCP port).
A method to solve this problem is to generate the label that identifies each cell ‘by
hand’: let’s suppose we want to perform an analysis on IP, TCP and WEB traffic
• first we check if it is IP; if so, we increment the counters of the cell labeled with the
string “IP”;
• then we check if it is TCP; if so, we increment the counters of the cell labeled with
the string “TCP”;
• finally we check if it also part of WEB traffic, and increment counters of entry
labeled “WEB”.
The flow diagram of this technique is shown in Figure 10.
Is IP?
Increment
counters
relative to
key=‘IP’
YES
NO
YES
EXIT
NO
NO
YES
Is TCP?
Increment
counters
relative to
key=‘TCP’
Is WEB?
Increment
countersrelative to
key=‘WEB’
NO
Figure 10 – How to perform multiple analyses using a lookup table
The difference between this example and the one shown in Figure 7 is that there the
label identifying a cell is taken directly from the packet (it is the TCP port), while here
we must generate it manually.
With this technique, we can reuse entirely the same system proposed for traffic
matrices.
3.3 Realization approaches
39
3.3.3 General logging capabilities
It is quite simple to perform logging, especially if we want to have messages like “a
packet of type XYZ was seen”. In practice, we need a mechanism to signal and deliver
these messages to the user; this can be done with a buffer in which these messages are
stored, and a system to alert the user application that something new was put in it; this
alert could be a select() on a file under Unix platforms, or a named event under
Windows platforms.
Moreover, the buffer we need can be the one usually shipped with the packet capture
system9, that one used to store the packets to be passed to the user; in practice the user
application, which usually receive packets in a buffer provided by the capture engine,
now receives log messages in the same buffer.
Another feature is the possibility to pass some excerpts of packets to the user: again
we can use the packet capture buffer to store these excerpts, as they were log messages;
we have only to modify the capture system, so that it is able to save not only the whole
packet, but also a fragment of it (e.g. from byte 13 to byte 17).
3.3.4 Analyses on flows
Here the problems is only partially related to monitoring; in practice, we need a sort
of preprocessor that reconstructs flows in some manner, and then feed the statistical
engine with these data. A good starting point is to reassemble packets of a flow to fill a
buffer of a certain fixed size, for example 16 kiloBytes, and then pass it to the statistical
engine10; this is not true reconstruction (we subdivide the flow into big chunks), but this
approach has proved to be very useful in lots of situations. In fact, this is the technique
used by snort, one of the most used non-intrusive Intrusion Detection Systems for
small-medium networks.
In practice, the method used to reconstruct flows depends on to two parameters:
1. what we want to do over flows: if we want to catch the requests from a server, for
example a “GET / HTTP/1.0” request to a Web server, the above technique
proves to be good, since these requests are usually at the beginning of the flow;
2. the fact that it is impossible to realize a system that analyzes continuous flows of
bytes, the only possible manner is to split the stream into overlapping11 blocks of
data, and then feed the statistical engine with them.
After the reconstruction has been done in some manner, the previously mentioned
analyses (matrices, logging…) can be used.
9 Actually, there are some capture systems without buffering capabilities, whose performances are, however, very poor. 10 These 16kB blocks must be overlapping, to avoid that a searched string is split into two blocks. 11 Overlapping is needed to maintain some sort of continuity between the various blocks.
3.4 Conclusions
40
The actual problem dealing with flows rather than packets is that things become a lot
more complicated: in a packet we can identify an header, and therefore some fixed
fields, analyses perform actions on that basis; in a flow, it is almost impossible to
identify something fixed, it can be a file transfer, and therefore raw data, or an
application protocol, and it is very difficult to recognize it. In fact, it is not a case that
the usual analysis made on flow is pattern matching, that is to say a scanning in the
flow to find if a particular string is present.
3.4 Conclusions
After this discussion on the requirements of monitoring systems, it is clear that three
main elements are needed to build a powerful statistical system, that are:
• something to handle matrices, from now on called tables; with the proposed
approach, they must offer a fast search algorithm, they must be dynamic, in the
sense that new elements can be added at any time (to realize the so called open
matrices);
• a mechanism to offer logging capabilities; as stated before, the proposed approach
tends to reuse the buffering system of the packet capture system to perform this
task;
• a mechanism to reconstruct flows; actually, as stated before, this task is quite
independent from statistics, in the sense that reconstruction, alone, does not give any
statistical information; rather, it is a basis to perform a more specific monitoring.
4 Firewalls
4.1 Introduction
A firewall is a network component that sits between two LANs, with the aim of
selectively bridge packets from a LAN to another and viceversa.
Literally, a firewall is a brick wall that isolates and protects something from fire; in
the network context, a firewall is something that separates two or more LANs with
different security levels; in practice a firewall protects a more secured LAN from
malicious traffic originated from the less secured LAN.
In order to obtain this result, a firewall must be the only mean by which the two
LANs communicate, as shown in Figure 11; being a single point of failure12, much care
must be taken in projecting these systems.
LAN X LAN YFIREWALL
Figure 11 - Placement of a firewall
In general, firewalls can be divided into two categories: packet filtering firewalls,
that operate on packets and therefore are transparent to applications, and application
gateways, that are not transparent to hosts.
Packet filtering firewalls analyze each packet that has to go from a LAN to another,
deciding if it ‘permitted’ or not; application gateways are systems that work at
application level: a host (A) that needs to open a connection with another one (B) must
open a connection to the gateway, instead; on its hand the gateway opens a connection
with host B, thus enabling the communication between A and B.
There is also an intermediate class of systems, the so called Level 4-13
7 Switches: in fact
they are transparent to hosts applications, like the packet filtering firewalls, but they do
not work on packets, rather they work in a manner which is similar to the application
gateways (i.e. they do not limit to analyze packets, they can respond to hosts).
In this chapter we will deal only with packet filtering firewalls.
From the network point of view, for every packet received from a NIC interface, a
packet filtering firewall checks if the packet is ‘good’ or ‘bad’, that is to say if it can 12 A network component is a single point of failure when all the traffic must pass through it, so if it fails, the two parts are isolated. 13 Level refers to the layers of the OSI model, in particolar these switches work at transport, session, presentation and application level.
4.1 Introduction
42
pass or not; if the packet is ‘good’, it is sent out from an interface different from the one
it was received, otherwise it is discarded.
From the user level point of view, a firewall can perform a different action for every
‘good’ or ‘bad’ packet; for example it can capture, or simply log, ‘bad’ packets, and
maintain a counter of ‘good’ packets, or respond to a host with a control message (e.g a
TCP reset, or an ICMP xxx unreachable message).
We can subdivide packet filtering firewalls into two more categories:
• systems that analyze only the packet headers, usually up to transport layer of the
OSI model; since we will mainly deal with TCP/IP protocols, they evaluate up to the
TCP or UDP header;
• systems that analyze also the packet payload, and are usually able to deal with
flows, too, thus working with all the layers of the OSI stack; this technique is usually
known as stateful inspection.
Before discussing these two kinds of firewalls, let’s have a quick look at what we
have to protect. In general a firewall can be used to perform two different tasks: prevent
attacks and block some available services from a LAN to the outside world.
Attacks prevention: in general an attack to a machine happens when an entity of some
sort tries to have access to the resources of a machine, for several reasons, going from
theft of information, up to forcing the machine to behave in a wrong manner (for
example generating network traffic until the bandwidth saturation).
Attacks to machines connected to a network can be divided into two big categories:
• attacks that exploit the TCP/IP protocols themselves; in this category we can put
attacks like port scanning, SYN-flooding, hijacking… In practice they usually abuse
of the lack of controls and security of the protocols to hack systems.
• attacks that exploit bugs in the network applications. A common bug is when an
application is not able to deal with all possible messages it can receive from the
network, thus behaving strangely; the consequences are unpredictable: we can go
from a crash of the application, to the complete bandwidth saturation causes by an
application gone ‘crazy’. A good example of this type of attack is the one exploited
by virus nimda: this virus, hidden in a HTTP request to an IIS14 web server, is able
to make the web server itself generate an enormous amount of HTTP request to
randomly selected hosts, thus saturating the entire network. This type of attacks is
usually recognizable through the so called signatures: every attack is characterized
by a precise message that the attacker sends to its victim, the signature is a part of
this message that uniquely identifies the attack.
14 Internet Information Server, the Web Server developed by Microsoft.
4.2 Packet filtering approach
43
Service blocking: in this case we want that a service, available in one of the LANs, is
not used the other LAN. For example, let’s suppose we have a network formed by
several hosts, connected to the Internet, and we want to have something that prevent
some hosts from accessing web servers outside the network.
Network fingerprint: we want to identify the system (in terms of operative system and
specific release), on the basis of the packets it sends. It is an important feature to
prevent future attacks.
4.2 Packet filtering approach
As said before, these types of firewalls analyze traffic by inspecting the packet
headers up to level 4 of the OSI model, that is to say up to the TCP or UDP headers.
The usual approach followed by these components is to extract some fields of the
packets and decide if they are legal or not; in practice they perform a sort of packet
classification, depending on these fields, as shown in Figure 12
.
Packet FilteringSystem
Is packet of type X?
LAN X
packets
LAN Y
accepted packets
Figure 12 - Packet Filtering approach
For example, let’s suppose we have a LAN, connected to the Internet, and we want
to avoid that some hosts of the LAN access the Internet; we can place a firewall
between the LAN and Internet, and then create a rule like this “drop packets from hosts
X, Y, Z going outside”. This rule can be realized by classifying packets through their
source IP address, thus deciding if a packet can pass or has to be dropped.
In practice, every packet ingoing from a physical interface is classified through some
fields of the headers, that are usually
• IP source and destination addresses,
• transport protocol source and destination ports, in practice the TCP or UDP ports,
• connection state information, through the flags of the TCP header,
• sometimes packet length.
This kind of firewall is usually sufficient to perform service blocking and to block
simple attacks, but it is not able to deal with ‘application level’ attacks, recognizable
only by inspecting the whole content of packets.
4.3 Stateful inspection approach
44
4.3 Stateful inspection approach
These types of firewalls inspect the whole packets, not only the headers. The usual
analysis they perform is pattern matching, that is to say they search for a particular
given string in the packets, the above mentioned signature.
In practice, a firewall implementing this method is formed by a simple packet
filtering ‘engine’, that pre-filters packets, and then a pattern matching engine that
works on these packets, as shown in Figure 13. It is important to ‘pre-filter’ packet for a
simple reason, that is speed: in fact, pattern matching is a quite complex task, so its use
must be limited restricted groups of packets.
For example, let’s suppose we want to block packets trying to attack web servers with
virus XYZ; moreover, we know that the signature of the attack is the string “I am
virus XYZ”.
An approach is to inspect all the packets searching for that string; this is too expensive
when we are working on a heavy loaded network. Instead, we can pre-filter all the
packets going to a web server, and then perform pattern matching only on those ones.
Packet FilteringSystem
LAN X
packets
Pattern matchingSystem
LAN Y
accepted packets
Figure 13 - Stateful inspection approach
This type of firewall is used whenever one wants to protect a network from attacks to
applications; in fact, in this case the only method to recognize attacks is to search
packets for the signature, and therefore drop these packets, maybe logging them or
giving an alarm to the user.
Another detail must be pointed out in these firewalls: usually application attacks are
conducted through TCP connections, so the signature has to be searched in the flows,
rather than in the packets. Moreover, attackers can force the signature to be split into
two or more packets (for example by forcing the MTU15 to 1 byte); as a consequence,
sometimes it is not enough to look at the payloads of the packets, and it would be very
useful to have a system that reconstruct flows before feeding the pattern matching
engine.
15 Maximum Transfer Unit, it is the maximum number of bytes of the TCP payload.
4.4 Access Control Lists
45
4.4 Access Control Lists
In the above paragraphs, we said that packets must satisfy certain user-defined in
order to pass the firewall. In the firewalls world, rules are often called Access Control
Entries, ACE, and a group of ACEs is called Access Control List, ACL.
Every ACE corresponds to a particular type of packets, and it is composed of two main
parts: a key part and an action; the key part identifies uniquely a class of packets, and
the action is what we have to do if a given matches that key.
For example, an ACE can be
ip source 130.192.3.0 netmask=255.255.255.0 accept
The first part identifies all the packets originated from the subnet 130.192.3; if the
packet arrives from one of the hosts of that subnet, it must accepted, that is to say it can
pass through the firewall.
As said before, every ACL is formed by several ACEs; two approaches can be followed
in creating ACLs: the order-dependent mode and the order-independent one.
In an order dependent ACL, ACEs must evaluated in the order they are; when a packet
does not match an ACE, the evaluation skips to the next ACE; the process ends when
the given packet matches an ACE; this is a small example
1. ip source 130.192.3.0/24 accept
2. ip dest 130.192.44.25 accept
3. ip source 130.192.2.44 and dest 130.192.11.43 accept
4. all deny
Let’s suppose we have a packet to IP 130.192.44.25; the packet does not match ACE 1,
so ACE 2 is evaluated, and the packet is accepted.
The main feature of this type of ACL is that ACEs are not mutually exclusive, that is to
say a packet can match more than one ACE, but the taken action is always the one of
the first matching ACE. As a consequence, it is not possible to change the order of the
ACEs without modifying its behavior.
On the opposite, order-independent ACLs are characterized by the fact that ACEs are
mutually exclusive, that is to say a packet always matches one and only one ACE. As a
consequence, ACEs can be scrambled, without modifying the behavior of the ACE.
In practice, the existent firewalls always use order-dependent ACLs.
A typical feature of ACEs is that they usually work with masks: the key does not
correspond to a fixed value, but rather to an interval; since, usually, each ACE has a
different mask, this can give big problems for the implementation of such systems.
4.5 Requirements
46
4.5 Requirements
As stated before, there are two key elements to realize firewalls: a system to handle
ACLs, if we want to build a packet filtering engine, and something for pattern matching,
if we want to deal with stateful inspection.
First of all, we have to analyze each packet to see if it carries a protocol of interest,
and then we must extract those header fields that are used to match the ACEs of an
ACL. Usually, we extract source and destination addresses at data-link, network and
transport layer of the OSI model, in practice MAC and IP addresses and TCP or UDP
ports. It is not unusual to extract some flags of the headers, too, for example the
connection state flags of the TCP header. We can call these extracted fields the key of
the packet.
This task is quite simple to perform, since packet headers are usually in well known
positions, and it is the usual job of selective packet capture systems. An example, using
the ACL described in the previous paragraph, is shown in Figure 14.
82C0022C 82C00B2B
IP src=130.192.2.44IP dest=130.192.11.43
packet
Extract IP src and IP dest
82C0022C 82C00B2B
ACL
1. ip source 130.192.3.0/24 accept
2. ip dest 130.192.44.25 accept
3. ip source 130.192.2.44 and dest 130.192.11.43 accept
4. all deny
The packet matches the ACE n. 3 � the packet is accepted Figure 14 – Working principle of a packet filtering firewall
The problem is to realize a fast system to deal with ACLs, that is to say we need
something that is fast to match the ACEs.
The simplest possible approach is to keep ACLs as they are defined by the user, and
check each ACE in order, stopping at the first matched ACE. This approach can be used
with ordered and unordered ACLs, since unordered ones are just a special case of the
ordered ones. An approach of this type has the big advantage that user defined ACLs
can be used without any modification, but it has also two big disadvantages:
4.5 Requirements
47
• it does not scale when the number of ACEs grows, since in the worst case each ACE
must be checked;
• performances highly depend on the order of the ACEs: if an ACE matching with
90% of keys16 is put at the end of the ACL, 9 keys out of 10 have to be checked with
the previous ACEs before having a successful match. It is not a case that it is usually
advised to put ACEs, that match with a great percentage of packets, at the beginning
of the ACL.
Another idea could be to manage ACLs as lookup tables, considering each entry as
an ACE. This approach has two big problems:
• if the ACL is order dependent, several ACEs can match to a key, in this case we
have to consider only the first matching ACE, in order; in a lookup table there is a
1-to-1 correspondence between each key (and so each packet) and each entry in the
table, instead;
• each ACE is formed by two parts, that is to say a couple formed by a value and a
mask, that varies among ACEs: the matching algorithm must apply the mask to the
given key, and then check if the result is equal to the value associated with the mask.
As a consequence, because of the lack of a 1-to-1 correspondence between keys
(extracted from the packet headers) and ACEs, it is impossible to manage a lookup table
made of ACEs with the usual search algorithms.
The first problem can be solved by transforming ordered ACLs into unordered
ACLs; let’s suppose we have an ordered ACL with three ACEs,
ACE1 � A
ACE2 � B
ACE3 � C
where A, B and C represent a type of packets.
We can transform this ACL in the following one
ACE’1 � A
ACE’2 � (A’)B
ACE’3 � (A’)(B’)C
where (A’) means packets not of type A, and the product means the logical AND.
In practice, each new ACE is the product of the previous ACEs, complemented, with
the original ACE. The new ACL is equivalent to the original one for this reason: in the
16 Remember that there is an almost 1-to-1 correspondence between a packet and a key.
4.5 Requirements
48
original ACL, ACEi was evaluated only if the i-1 preceding ones where not satisfied,
that is to say if their complements were satisfied.
The three new ACEs are mutually exclusive, so the new ACL is order independent.
The second problem is much more complicated to solve; a good starting point is to
group the ACEs that have the same mask, and then apply a normal search algorithm on
each group. Let’s suppose we have the following ACEs (they are IPv4 subnets), in an
order independent ACL.
ACE1 � 130.192.28.0 mask 255.255.255.0
ACE2 � 130.192.134.24 mask 255.255.255.248
ACE3 � 130.193.233.0 mask 255.255.255.0
ACE4 � 10.11.128.0 mask 255.255.255.248
ACE5 � 130.192.16.128 mask 255.255.255.248
ACE6 � 84.112.11.0 mask 255.255.255.0
ACE7 � 130.192.65.0 mask 255.255.255.0
Since ACE1, ACE3, ACE6 and ACE7 (and also ACE2, ACE4 and ACE5) have the same
mask, we can group them into a ‘super’ ACE, as shown below
ACE’1 � mask=255.255.255.0 value1= 130.192.28.0
value2= 130.193.233.0
value3= 84.112.11.0
value4= 190.192.65.0
ACE’2 � mask=255.255.255.248 value1= 130.192.134.24
value2= 10.11.128.0
value3= 130.192.16.128
Then, we analyze each super ACL by first masking the given key, and then
performing an ordinary search into the values associated with that super ACL. The
process is repeated for each super ACE, until a match happens.
This approach gives good results only if an ACL contains few different masks repeated
several times in the ACEs; unfortunately, usual ACLs do not have this property.
The second requirement of a firewall is the ability to perform pattern matching:
the task is quite simple, but it is necessary to implement some sort of flow
reconstruction due to the problems described in paragraph 4.3. It is not necessary to
have a true flow reconstruction, rather we need to have a part of the flow large enough
to permit the pattern matching operation, that is to say at least the length of the
4.6 Network Intrusion Detection Systems
49
signature. Another requirement is needed from the flow reconstruction: we have said
that is enough to receive a block of the flow, but it also necessary that the various
blocks from a flow overlap, otherwise the signature can not recognize the signature
when it is divided into two blocks, as shown in Figure 15 and Figure 16.
signature
flow
not overlapping blocks
signature
The signature is splitted into two blocks, so it is not recognized by the pattern
matching systemsignature
compare
with
signature
compare
with
Figure 15 - Flow reconstruction without overlapping
signature
flow
overlapping blocks
signature
signature
The signature is not splitted into two blocks, so it is recognized by the pattern
matching system
Figure 16 - Flow reconstruction with overlapping blocks
4.6 Network Intrusion Detection Systems
Network intrusion detection systems, usually known as IDS or NIDS, are network
components between firewalls and monitoring systems; they are used to find out
network attacks, like firewalls, but they are not able to drop packets; in fact they work
like a monitoring system: they analyze the traffic present on a LAN, and signal the user
about malicious packets seen on the wire.
The most famous one is snort, a freeware program based on LibPcap; it is able to
detect a variety of network attacks by analyzing packets seen on the wire.
First of all, it is able to discover attacks through the signatures; in practice, it is able to
perform a flow reconstruction into blocks, and then do a pattern matching on these ones;
4.7 Network Address Translators
50
basically, it behaves as a stateful inspection firewall, deprived of the chance to drop
packets.
It is able to simply filter packets on the basis of their headers, too; in this case it behaves
as a packet filtering system, or a monitoring system.
Moreover it is capable of signaling the user of malicious traffic in several ways:
• by logging a message to the system logger17;
• by sending a WinPopup message to a Windows station;
• by writing a message to a log file.
The main difference between a monitoring system and an IDS is that the former
gives the user statistical data relative to the network traffic, while the latter has to signal
the user of malicious packets seen on the wire.
As a consequence, the requirements of such systems are almost the same as the ones
needed by monitoring systems and firewalls:
• classify packets on the basis of some fields of the headers;
• perform a sort of flow reconstruction to do pattern matching;
• a logging system to give the user some messages relating to malicious traffic;
• a means to signal the user of malicious traffic seen on the network.
4.7 Network Address Translators
A network address translator, usually known as NAT, is a network component that is
used to masquerade the hosts of a LAN to the outside world, usually Internet, by
changing the source and destination addresses of the packets. It always works at level 3,
that is to say it deals with IP addresses. As a consequence it is not a method for
protecting a network from external attacks, like firewalls do; nevertheless, it acts in a
way that is similar to firewalls: it captures network packets from an interface, analyzes
(and modifies) them and then it sends them out of another network interface. Moreover,
it is not unusual to have commercial firewalls that implement this feature.
In practice, each machine of the LAN is assigned to an IP address, which is usually
taken from the private ones18, for example 192.168.1.5; when a packet originating from
this host has to go outside the LAN, the NAT changes the source IP address with
another one, which has to be public.
The NAT can use two techniques to map the internal (private) to the public ones:
• a 1-to-1 correspondence; in this case a different public IP address correspond to
each private IP address;
17 Snort works under *nix operating systems, and it is therefore able to log messages in the syslog. 18 See RFC xxxx for more details on private IP addresses.
4.8 ACLs and routers
51
• a 1-to-n correspondence; all the private IP addresses correspond to a single public IP
address.
The first approach is quite simple to realize, since it is enough to keep a table of the
corresponding private and public addresses, like the one in the below table
Public IP Private IP
130.192.12.55 192.168.10.1
130.192.12.56 192.168.10.2
130.192.12.57 192.168.10.4
130.192.12.58 192.168.10.5
The second approach is a lot more complicated; this is because a packet arriving from
outside to the LAN has always the same IP address, so there is no mean to know to what
particular host of the LAN the packet is directed. To solve the problem, the NAT
modifies also the level 4 header, by changing the TCP source port of outgoing ports,
and destination port of ingoing packets; in practice it is the TCP port that identifies the
host originating the packet. It is clear that such an approach works only if the transport
layer is TCP (or UDP, with some limitations).
This second technique has a particular feature, that can be very useful: it is impossible
to open a TCP connection from outside the LAN to inside; this characteristic implicitly
guarantees that internal servers cannot be accessed, and consequently attacked, from the
outside.
It is quite simple to realize a NAT, the requirements are a lookup table to store the
couples of private – public addresses, and the chance to modify packets, rather
analyzing them, only. The former constraint is the one that was needed by firewalls and
monitoring systems, too; indeed, the latter is unique to NATs.
4.8 ACLs and routers
There are also other network components that make use of the ACLs, in a slightly
different way: routers. In this case the ACLs can be used to route packets out of a
particular interface on the basis of their source and destination addresses.
This is done for several reasons, one of them is Quality of Service: for example we
want to use a faster outbound interface for packets coming from a particular subnet.
5 The original BPF architecture and WinPcap
The Berkeley Packet Filter architecture was presented about ten years ago by Van
Jacobson and Steven McCanne of the Lawrence Berkeley Laboratory. The rationale of
the project was to realize an efficient system for user-level packet filter under BSD
Unix, that is to say a system to acquire packets from the network and to deliver them to
user level applications.
This project wants to pursue several requirements, among which:
• performances
• protocol and data-link independence
• high programmability.
The adopted philosophy is to minimize the number of packet copies by means of a
kernel agent, the kernel filter, which discards not interesting packets without copying
them. This approach minimizes the amount of context switches19, too, since the number
of packets passed to user level is limited to those ones of interest.
5.1 Components of the BPF architecture
The so called kernel filter is a system that analyzes each packet that traverse the net,
deciding if it must be copied to the user or has to be discarded through a series of user-
defined rules; an example rule can be ‘accept IP packets’. It is realized through the so
called network tap, that collects packets from the net, and a virtual register-based
processor, the BPF virtual machine, fed by the network tap, that executes a user defined
code, the BPF filter (the processed version of the user-defined rule) for each packet seen
on the wire; an example of the BPF filter is shown in Figure 17.
“Ip and src host 10.10.10.10”
(000) ldh [12](001) jeq #0x800 jt 2 jf 5(002) ld [26](003) jeq #0x0A0A0A0A jt 4 jf 5(004) ret #1514(005) ret #0
pcap_compile
BPFVIRTUAL PROCESSOR
pcap_setfilter
Packets Output
Figure 17 – How the BPF compiler works
19 Context switch is the process by which data traverse the user/kernel boundary, and it is performed whenever a packet must be transferred from the network to a user application.
5.1 Components of the BPF architecture
53
The second component is a kernel buffer in which packets, accepted by the kernel
filter, are stored before being passed to the user.
The third component is a user library, exporting a set of APIs20, that abstracts the
user from the low-level details of the architecture. For example, this library is shipped
with a compiler that creates the code, used by the BPF virtual machine, from a human
readable string describing the wanted packets (e.g. “ip”).
A graphical representation of such components is shown in Figure 18.
In the next paragraphs we will give a brief description of each of the components.
User codeCalls to libpcap
Otherprotocol
stacks
Network Interface Card (NIC) driver
filter1
Packets
Network Tap
Device Driver (BPF)
Kernel
Level
Network
User
Level
Hold buffer
Store buffer
KernelBuffers1
user-buffer1
Applications
Libpcap Library(usually included during compilation)
user-buffer2User code
(direct accessto the BPF)
Hold buffer
Store buffer
KernelBuffers2
filter2 ...
User codeCalls to libpcap
Figure 18 - The original BPF architecture
5.1.1 Network tap and BPF virtual machine
The network tap is closely linked with network cards (NICS), in particular with the
data-link layer; it collects all the packets from the net and feeds the BPF virtual
machine. In practice, when a packet is seen by the network card, the NIC triggers the
tap21, which in turn feeds the BPF machine.
The BPF virtual machine is the core of the whole filtering mechanism. It is
composed of two registers, the 32-bit accumulator A and an index register X and an
implicit program counter; it is equipped with a 16 words of a 32-bit memory. The to-be
processed packet is seen by the machine as a ROM memory.
Each instruction is composed of four elements:
code: it contains the opcode of the instruction;
k: it contains a value passed to the instruction, for example in LOAD instructions it can
contain a constant or an address;
20 Application Programming Interface. 21 The NIC triggers the usual protocol stacks, for example the TCP/IP stack, too.
5.1 Components of the BPF architecture
54
jt: it contains the number of instructions to be skipped forward; it is used only in
JUMP on condition instructions; the program counter is incremented by this value
when the condition coupled with the instruction is true;
jf: it is almost the same as jt, but the program counter is incremented when the
condition is false.
jt and jf are positive values, so, since they are relative skips of the program
counter, backward jumps are not realizable.
The instruction set is composed of six classes of instructions:
• LOAD instructions
- from the packet: it loads a byte, an half-word22 or a word, to register A, in direct
mode (i.e. k contains the offset from where load in the packet) or indirect mode
(i.e. register X contains the base address, and k the offset); a special LOAD
instruction is also available to load the HLEN23 field of the IP header to register X
with a single instruction;
- immediate: it loads k value directly to register A or X;
- immediate packet size: there is a special load instruction to put the packet length in
either register A or X;
- from memory: it loads a 32-bit word from memory to either register A or X, only in
direct mode.
• STORE instructions permit to store a value from registers A and X to a location in
memory. They work only in direct mode.
• JUMP instructions: BPF makes available a non-conditional jump and a series of
conditional jumps; the conditions compare register A with either k or register X; as
stated before, it is not possible to perform backward jumps.
• Arithmetic instructions: it is possible to perform a series of arithmetic (“+” “-”
“×” “/”) and logical (“AND” “OR” “XOR” “NOT”) operations on the
accumulator; the other operand can be either k or register X, the result is always
stored in register A. The processor supports only integer values.
• RETURN instructions: it is possible to return either k or A. This class instructions
interrupts the program flow immediately, the returned value has a special meaning:
0 means that the packet must be dropped;
-1 means that the entire packet must be passed to the user (by copying it to the
kernel buffer);
n>0 means that the first n bytes of the packet must be passed to the user. 22 An half-word is 16 bits wide. 23 Header Length. This instruction was added because otherwise the header length implied a lot of BPF instructions to be calculated.
5.1 Components of the BPF architecture
55
• Other instructions: two other instructions are defined, TXA and TAX, that transfer
the value of a register (A or X) to the other.
If the machine fails executing an instruction, for example because a division by zero
is attempted, the machine implicitly executes a “RET 0” instruction; If no code is
provided to the virtual machine, it implicitly executes a “RET -1” instruction.
5.1.2 Kernel packet buffer
This is the buffer in which packets are stored after being accepted by the virtual
machine, and before being passed to the user. This is one of the components that makes
the whole system efficient: during network bursts, it is usual that the application is put
in a wait state, since the tap has to process packets and runs at an higher priority, so
packets must be stored somewhere at kernel level.
The BSD realization of this buffer is based on two buffers, the store and the hold
buffer; packets accepted by the virtual machine are put in the store buffer, while user
applications receive the hold buffer; when the store buffer is full, and the hold one is
empty, they are exchanged, so that the application can receive new packets and the
virtual machine is able to store new accepted packets. The default size of each of the
two buffers is 4 kiloBytes.
Buffer size is one of the most critical details of the whole BSD architecture:
• a bigger buffer generally guarantees less lost packets and less context switches;
nevertheless, applications will receive packets less promptly, since they tend to
remain a long time in the store buffer;
• a smaller buffer solves the above mentioned problem of prompt receiving, but
packet losses and context switches generally increase, since the buffer is able to store
a lower amount of packets at a time, and consequently it becomes full early during
network bursts.
As described in a following paragraph, the Win32 porting of the architecture uses a
totally different realization approach of this buffer, to obtain better performances.
5.1.3 LIBPCAP library
This is the user level component of the architecture; this ANSI-C library provides a
series of APIs to interact with the kernel part of BPF. It hides the user from details like
system calls to set and retrieve data to and from the kernel part of the architecture,
5.2 Interaction between LIBPCAP and the BPF virtual machine
56
offering functions and data structures that are totally independent from the operating
system24 and from the various data-link layers the library is able to work on.
These APIs can be divided into three main groups:
• control APIs: there is a series of functions to open a capture instance, close it and set
various parameters:
• compiling APIs: the library is shipped with a special compiler that translates rules to
filter packets from a semi-human readable format to the code for the BPF machine.
Examples of these user rules are “ip”, “arp”, “tcp src port (21 or
80)”, “ether 1:2:3:4:5:6”; the generated code, the so called BPF filter, is
injected in the kernel through a control API;
• APIs to receive packets: the process of receiving packets is obtained through the so
called dispatcher_handler. This function is a callback, that is to say a function
defined by the user, whose prototype is fixed by the library itself; it is called
whenever a packet accepted by the BPF virtual machine is transferred at user level.
The dispatcher_handler receives a buffer containing the packet (or a fragment of it,
defined by the user), and a timestamp of it, as parameters. For example, a
dispatcher_handler can print out the whole packet on screen, or perform some
special processing. An application that wants to receive packets must define this
function and then call one of the two library functions devoted to packet receiving,
pcap_loop() and pcap_dispatch(), and pass them a pointer to the defined callback. In
turn, the receive function continuously transfer packets from kernel to user level and
calls the callback.
A detail must be pointed out: the library allocates a user level buffer to store packets
at user level; in practice, packets are first copied from the kernel buffer to the user
buffer, and then the dispatcher handler works on the packets stored in user level
buffer.
5.2 Interaction between LIBPCAP and the BPF virtual machine
The typical method by which a kernel component communicates with a user
application is a device, on which it is possible to perform OPEN, CLOSE, READ,
WRITE and IOCTL system calls. The virtual machine, running at kernel level, exports a
device (/dev/bpf) on which it is possible to perform OPENs to create a capture instance,
READs to retrieve packets from the kernel buffer, and IOCTLs, to set and query some
parameters, for example to inject the BPF code in the kernel.
24 During the years the architecture was gradually ported to a variety of operating systems, but the APIs exported by LIBPCAP remained the same, so that all applications written upon this library could be used without modifications.
5.3 Implementation under Microsoft Operating Systems: WinPcap
57
Therefore, LIBPCAP control APIs are wrapper functions to IOCTLs and OPEN
system calls.
The above mentioned pcap_loop() and pcap_dispatch(), used to receive packets,
perform two tasks: at first, they call READ to transfer data from the kernel buffer to the
user buffer, then they call the dispatcher_handler until the user buffer contains some
packets.
5.3 Implementation under Microsoft Operating Systems: WinPcap
The porting of the BPF architecture and LIBPCAP under Microsoft 32-bit operating
systems started about three years ago by the NetGroup at Politecnico di Torino, under
the project name of WinPcap.
Microsoft Windows provides an architecture for developing new network
components that interact with existing ones called NDIS25, as shown in Figure 19. NDIS
gives the possibility to create three kinds of components, to be more precise drivers,
that are miniport, intermediate and protocol drivers.
Miniport drivers are those ones that communicate with the NICs directly, thus they are
dependent on the particular network card; a driver of such type is, for example, that one
for the 3COM 3C509 NICs.
Intermediate Driver
NDIS Lower Edge
Network Interface Card
NIC Driver (miniport driver)
Protocol Driver
NDIS Intermediate Interface
NDIS Upper Edge
ND
IS
PacketsNetwork
Kernelspace
User
space
Figure 19 - The NDIS framework
Protocol drivers are those ones that implement a particular protocol stack, for example
the TCP/IP stack.
Intermediate drivers are a sort of mixture of the previous ones, and are used for
special purposes, for example to build personal firewalls26.
25 Network Driver Interface Specification. 26 They are agents that filter packets ingoing and outgoing from a host on the basis of several user-defined ACLs. They are used to protect PCs from attacks coming from the net.
5.3 Implementation under Microsoft Operating Systems: WinPcap
58
WinPcap is composed of three components: a protocol driver (packet.sys) and two
user level libraries, implemented as DLLs27, packet.dll and wpcap.dll.
The protocol driver contains the tap, the BPF virtual machine and the kernel buffer;
the first DLL, packet.dll, is OS dependent, and exports a series of OS independent APIs
used by the other DLL, wpcap.dll. It is actually a wrapper to the various OS specific
systems calls used to communicate with the protocol driver.
wpcap.dll is the implementation of the LIBPCAP library, and it is OS independent
since it relies on the APIs exported by packet.dll.
A graphical description of WinPcap is shown in Figure 20.
BPF
Protocol Driver
Packet.dll
Wpcap.dll
packets
Application
Figure 20 - WinPcap architecture
The architecture is similar to that one of the TCP/IP stack; in fact, this stack is
implemented through two components: a protocol driver (tcpip.sys) and a user level
DLL (winsock.dll) that communicates with the protocol driver, and exports a series of
APIs used by applications based on TCP/IP, the so called sockets.
5.3.1 Differences from the original BPF architecture
WinPcap has some differences from the original architecture; however, all of them
are either implementation modifications, thus transparent to the user level APIs, or new
features.
The main implementation difference regards the kernel buffer: WinPcap does not
implement two kernel buffers, store and hold, but a circular buffer; the user application
(through the LIBPCAP APIs) retrieves data from the head of the buffer, and the BPF
virtual machine inserts packets at the tail of the buffer, as shown in Figure 21. The
default size of the buffer is 1 Megabyte.
27 Dynamic link libraries.
5.3 Implementation under Microsoft Operating Systems: WinPcap
59
filter
Network Tap
Packet
CaptureDriver
Kernel
UserLevel
KernelBuffer
read IOCTL/write
Level
Packet.dll
Wpcap.dll
Application
Figure 21 – Interaction between the driver and the overlying DLLs
A feature that is closely linked to this buffer is the delayed write: packets are
transferred to user level only if the kernel buffer contains enough data (usually some
kiloBytes).
WinPcap offer a series of enhancements over the original BPF/LIBPCAP
architecture, that are:
• chance to send packets; it is also possible to instruct the driver to send multiple
copies of the same packet, for testing purposes;
• statistical mode; it is possible not to retrieve packets to user level, but rather count
packets and bytes accepted by the BPF machine, and then deliver only these counters
to applications. This operative mode can be used to perform statistics at kernel level,
i.e. without copying packets to user level and then processing them, thus obtaining
better results28.
The last difference regards the APIs themselves: as stated before, WinPcap is formed by
two user level DLLs, each one exporting a set of APIs. As a consequence, it is possible
to use the usual functions of the LIBPCAP library, exported by wpcap.dll, or a set of
lower level APIs, exported by packet.dll. Both the two sets of APIs are independent
from the various Windows flavors: at the moment of writing they support Windows 95,
98, ME, NT4, 2000, XP.
28 Packets are not copied in statistical mode (0-copy system).
6 The Table Management Engine (TME)
The goal of this chapter is to point out why the original BPF architecture is
insufficient to perform new tasks, like high performance monitoring, IDS29 or firewalls
and how it can be extended in order to offer these new functionalities, maintaining
backward compatibility with the original BPF system. Then this chapter will present a
detailed description of the proposed architecture.
6.1 Limits of the original BPF architecture
The BPF architecture has been proved being very performant, and yet quite simple to
use (this is also due to the excellent compiler, which translates filters from ‘human’
language, e.g. tcp port 80, to BPF assembly code); however, it was created with
the main purpose of selective packet capture; although it is possible to use it for other
tasks, it has some limits, among which:
• small and limited amount of memory in which store intermediate results.
• it is impossible to modify the packet
• simple instruction set, not tailored for complex tasks like packet classification.
In the following paragraphs we will explain why these limits represent a bottleneck
for the extended use of the architecture, and why an extension is needed.
6.1.1 Small and limited memory
Small RAM: the machine has 16 blocks of 32-bit words RAM, which is not enough for
any interesting task; this limit could be avoided by incrementing the number of words in
memory.
Limited RAM management: the machine has big limitations in the way memory is
managed, that restrict its use:
• it’s not possible to address it per byte (and consequently only 32 bit LOADs and
STOREs exist30);
• data is stored in host-byte-order, which is little-endian31 on INTEL processors;
unfortunately network-byte-order (the order in which bytes are transmitted on the
wire) is big-endian32; so, if we save parts of the original packet to RAM, it is always
necessary to swap bytes in order to obtain the original flow. The problem is
29 Intrusion Detection Systems 30 However, it is possible to perform 8- and 16-bit loads from the packet, that is contained into an ad-hoc ROM. 31 Bytes are saved, in consecutive memory locations, from the least to the most significant byte. 32 It is the opposite of little-endian, MSB comes first.
6.1 Limits of the original BPF architecture
61
important because the user expects to receive data extracted from the packet in the
order in which bytes where stored in the packet.
Another issue is that, whenever a field extracted from a packet, that is not 32-bit
aligned, is saved to memory, bytes are stored in an unusual way. For example, here
is how MAC address 10:20:30:40:50:60 is (usually) saved from the packet to
memory:
packet (network byte order)
Accumulator A Accumulator A
01 02 03 04 00 00 05 06
RAM (host byte order INTEL)
• STORE instructions work only in immediate mode (i.e. it’s not possible to use a
register, e.g. X, as base address);
• whenever the filter code is executed (i.e. for every packet received), this memory is
recreated (i.e. physically reallocated), and so it is stateless (content is not maintained
among filter runs).
These limits can be put off only by adding some more powerful memory.
6.1.2 Impossibility to modify the packet
There are some tasks that may need to modify the packet, for example a NAT needs
to modify source or destination IP addresses; the original instruction set does not allow
such operations.
6.1.3 Simple instruction set
Actually all the limitations of the architecture for complex tasks is its simplicity: the
instruction set was thought for ‘load and compare’ operations (the basic modus-
operandi of a selective packet capture system), but it fails when one wants to use it for
different tasks33. 33 However, this simplicity of the instruction set favors the code-check.
MAC destination MAC source
01 02 03 04 05 06 07 08 09 0A 0B 0C
04 03 02 01 06 05 00 00 Content
LOAD 32 bit LOAD 16 bit
STORE STORE
6.2 Extensions to the original BPF architecture
62
Here are just some limitations of the instruction set:
• it’s not possible to create loops, since backward JUMP instruction is not
implemented
• there is no flag register, so it is not possible to add two 64-bit integers using 32-bit
registers
• complex operations usually require more than two registers (A and X)
• lack of a stack (push – pop instructions).
As a consequence, it is very difficult to handle the problems that are needed for an
extended use of the BPF architecture, in a simple way; for example, one of the
requirements highlighted in the previous chapters is the chance to manage lookup
tables. In theory, it is possible to realize some types of tables in a direct manner, that is
to say implementing the lookup as a series of predetermined compare operations, as
shown in the pseudo-code
if (x=a0) increment counter0
else
if (x=a1) increment counter1
else
if (x=a2) increment counter2
else
...
where x is the value to be searched in the table, and ai are the various values contained
in the table.
Such an approach can be used only in some cases, for example order dependent ACLs,
but cannot be applied in others, for instance if we have to deal with the open matrices
described in chapter 3 - Monitoring networks.
The problem is that without backward jumps, or special instructions for this particular
purpose, there is no way to manage tables.
6.2 Extensions to the original BPF architecture
As a consequence to the above mentioned limitations of the original BPF
architecture, we need to extend the system in some manner.
The proposed technique is to ship the original BPF processor with some extensions,
that is to say some new parts that permit to perform the tasks we have dealt in the
previous chapters, to realize the so called ‘architecture for universal packet filtering’.
In the following paragraphs we will first highlight the requirements of such
extensions, and we present the proposed architecture.
6.3 Extension requirements
63
6.3 Extension requirements
Since it must be used for various different tasks, the extension of the BPF machine
needs to satisfy, and preserve, some requirements, that are basically:
• some stateful memory
• opportunity to manage search tables
• highly programmable.
Other requirements, that make the architecture practical to use, are
• ease of use
• extensibility by third-parties.
Stateful memory: we need to add some powerful memory, shared between the original
machine and the new extensions.
This memory will be used to store lookup tables and to exchange data between the two
components (BPF machine and extension).
Search tables: since almost all the tasks concerning packet filtering require to classify
packets, it is very useful to have a look-up table coupled with a fast search algorithm.
Here are just some examples of how this structure can be efficiently used:
• multi-dimensional statistics: it is needed to count packets on the basis of some
header fields (e.g. IP source address, or TCP port). If we have a lot of different
categories (e.g. each host on the network is a different entry) the table becomes
large, so the search has to be implemented efficiently, for example using an hash;
• firewalls: basically these systems determine if a packet can go forward or has to be
dropped by looking at the headers, through ACLs. Moreover a good firewall must be
able to execute a different operation for each different packet (e.g. the user must be
warned that a particular type of packet was seen, another type of packet must be
analyzed deeply in order to modify the table itself), so it would be very interesting to
couple each entry, that corresponds to an ACE, with some execution code that is
executed whenever a packet ‘satisfies’ an entry.
Highly programmable: one of the goals of this extension is that it must be open, in the
sense that we must be able to program the machine to perform the most various tasks.
As a consequence, it must not be a special-purpose system, but rather an open
architecture, a framework.
Ease of use: although it has to be highly programmable, it must still be simple to use,
and also safe: if you make a mistake using it, the extension must either signal it or work
6.4 Extended memory
64
in a wrong manner, but the machine (intended as the workstation running the system,
too) must not hang up.
Extensibility by third parties: last but not least, third-parties must have the chance to
extend the architecture with new features without modifying the original framework, i.e.
they must be able to add new functions to process and/or classify data without
modifying the basic structure of the architecture.
6.3.1 Proposed extensions
The proposed architecture of the extensions is shown in Figure 22; it is composed of
two parts:
• the extended_memory and
• the TME (Table Management Engine).
In the following paragraphs we will deal with the two blocks, describing how they
are realized, how they work and interact with each other and with the original BPF
engine.
BPF core
registers
extended_memory
TME core
packet
registers
basic
mem
ory
ctrl
TME co-processor
Figure 22 - The BPF machine with the new extensions
6.4 Extended memory
This new block consists of a 8-bit word addressable memory, with 32-bit wide
addresses; it is stateful (i.e. content is maintained through the life of the processor); it is
directly accessible by all the cores (BPF core and TME core).
We added some new instructions to the original set of the BPF machine, basically
there are 8-, 16- and 32-bit LOADs and STOREs, in immediate mode (to/from registers
6.5 TME
65
A and X) and relative mode (using X register as base address and A as
destination/source). All the data in this memory is saved in network-byte-order (big-
endian), to preserve the order in which bytes are represented on the wire. When 8- and
16- bit LOADs are used, data is saved in the least significant part of register A (or X),
and when 8- or 16-bit STOREs are involved, only the first one or two least significant
bytes of A (or X) are saved to memory.
A more detailed description of the new instructions can be found in 12 - TME
Extension Programming Manual.
6.5 TME
This is the main part of the extension; it’s implemented as a co-processor that
communicates with the BPF core, the main processor.
It’s composed of a core, some general registers, and some ‘blocks’ of special registers,
as shown in Figure 23.
TME core
registers
validated_blocks
active
working
active_read
data_block0
data_block3data_block2
data_block1
data_blockNN
Figure 23 – The TME co-processor
The main purpose of the TME is the handling of lookup tables, with a special key
feature: each entry in the table is associated to a different action (intended as processing
task), that is executed after the search (the LOOKUP) is performed; the action is the so
called exec_fcn; moreover each entry is coupled with some memory (usually 64 bytes)
that can be used to store useful data for the processing.
Let’s suppose we want to acquire some statistical data of particular classes of traffic,
for example Web, Telnet, FTP and Mail traffic. To achieve this result, we create a table,
and each entry corresponds to a particular TCP port.
6.5 TME
66
For every packet received, we check first if it is TCP, then we search the entry in the
table corresponding to the TCP port of the packet34 (the TCP port is the key to enter the
table); when the LOOKUP returns, we run the code associated with that entry, by
calling instruction EXECUTE35: if we want to count packet, there will be something like
“increment counters”, and these counters will be stored in the block of memory
associated with that entry.
Web (80)
Mail (25)
Telnet (23)
FTP (21)
Data
packets=134
bytes=296485
Data
packets=93
bytes=59642
Data
packets=24
bytes=19215
Data
packets=16
bytes=4898
Figure 24 – A simple lookup table
The advantage of the architecture is that the code associated with each entry can do, in
theory, whatever sort of processing, even modifying the original packet (e.g. if we
implement a NAT36, we need to modify some fields, like IP source), or signaling the
user that a special packet (or a sequence of them) was received. Moreover EXECUTEs
and even LOOKUPs can be chosen from a set of predefined ones, or they can be
defined by the user, since they are ANSI-C callback functions. This behavior gives great
flexibility to the TME extension.
As seen, the process is always split into two parts: LOOKUP and EXECUTE; although
they could have been put together, the approach we followed has some benefits:
• the algorithm used to perform LOOKUPs is independent from the EXECUTEs, so it
is possible to change it (for example, to increase performances) without modifying
these last ones;
• usually LOOKUP is quite general, and can be used with a variety of EXECUTEs,
and this approach avoids code duplication;
34 Actually, the packet contains two ports, source and destination, therefore we suppose to choose that one with number below 1024 (i.e. Reserved ports). 35 This instruction executes the exec_fcn. 36 Network Address Translator.
6.5 TME
67
• there are some cases in which a LOOKUP is enough, particularly when we want to
know whether an entry is present in the table or not (e.g. if we implement a bridge
we need only to look if a MAC address is present in the lookup table or not).
6.5.1 A general overview of the LOOKUP operation
Figure 30 gives a graphical explanation of what happens when a BPF program calls
the LOOKUP instruction.
right
VOID
BPF core
registers
TME core
registersbasic
mem
ory
ctrl
packet
extended_memory
key
VOID
right
VOID
wrong
wrong
VOID
2. LOOKUP call
1. BPF core copies from packet
to extended memory
3. search
4. TME core saves the address of the found
entry in a register of the
TME
5. UPFE core gives control
back to BPF core
wrong means ‘keys don’t match’
right means ‘keys match’VOID means a void entry
lookup table
Figure 25 - How a LOOKUP is performed
Here is an brief description of each step:
1. previous to the call to LOOKUP, the BPF core has to generate the key to enter the
lookup table, so it parses the packet (i.e. it recognizes what is the type of the
packet), and copies some fields of the headers from it to the extended memory
2. the BPF core executes LOOKUP: it triggers the TME core, that is given control;
3. on its hand, the TME core performs the search, and
4. stores37 the pointer to the matching entry in a register of the TME;
5. the TME core gives control back to the BPF core.
6.5.2 A general overview of the EXECUTE instruction
Figure 26 illustrates what happens when an EXECUTE instruction is performed.
37 as stated before, this is recommended, but not compulsory.
6.5 TME
68
right
VOID
BPF core
registers
TME core
registersbasic
mem
ory
ctrl
1. EXECUTE call
Packets = 175
176Bytes = 55434
55758
2. Read LUT entry in a
register of the TME
3. Execute the exec_fcn on the data present in that entry
4. TME core givescontrol back to BPF core
extended_memory
lookup table
Figure 26 - How the EXECUTE instruction works
1. The BPF core executes the EXECUTE instruction giving control to the TME core;
2. on its hand, the TME core reads a register of the TME, that contains a pointer to an
entry in the lookup table; this register is the one used by LOOKUP instruction;
3. the TME core executes the code associated with the EXECUTE (the exec_fcn); this
function works on the data of that entry, an ID of the to-be-executed function is
stored in the entry itself;
4. when the TME core has finished executing the code associated with EXECUTE
instruction, it gives control back to the BPF core.
6.5.3 Lookup tables
In the preceding paragraphs we gave an small overview of the whole
architecture, pointing out what LOOKUP and EXECUTE are, and how they work, in
brief.
We have considered that the data associated with each entry was stored in the entry
itself; the proposed architecture is slightly different, because the lookup table is actually
split up into two parts: lut_segment38 and shared_memory_segment39;
lut_segment is divided into entries (that can be considered the true entries of the
lookup table); shared_memory_segment is divided into blocks.
38 LUT is an abbreviation for lookup table. 39 Its name is derived from the fact that its content is shared between the TME and user applications.
6.5 TME
69
EXECUTE_ID1block_ offset1
EXECUTE_ID2block_ offset2
EXECUTE_ID3block_ offset3
EXECUTE_ID4block_ offset4
EXECUTE_ID5block_ offset5
EXECUTE_ID5block_ offset6
EXECUTE_ID6block_ offset7
EXECUTE_ID7block_ offset8
EXECUTE_ID8block_ offset9
EXECUTE_ID1block_ offset1
EXECUTE_ID2block_ offset2
EXECUTE_ID3block_ offset3
EXECUTE_ID4block_ offset4
EXECUTE_ID5block_ offset5
EXECUTE_ID5block_ offset6
EXECUTE_ID6block_ offset7
EXECUTE_ID7block_ offset8
EXECUTE_ID8block_ offset9
lut_segment
dataKey1
dataKey2
dataKey3
dataKey4
dataKey5
dataKey6
dataKey7
dataKey8
dataKey9
datakey10
dataKey1
dataKey2
dataKey3
dataKey4
dataKey5
dataKey6
dataKey7
dataKey8
dataKey9
datakey10
shared_memory_segment
Figure 27 - Relationship between an entry and a block
lut_segment contains the entries of the lookup table, but each entry contains only a
pointer to a block in the shared_memory_segment and the ID of the EXECUTE
associated with that entry.
A shared_memory_segment block contains the key of the entry that points to it, a
timestamp and the data associated with that entry (in the example above, it contains the
counters bytes and packets).
User applications are interested in receiving the data contained in the blocks, so they
will receive the shared_memory_segment.
We followed this approach for a particular reason: an efficient lookup table requires
a good search algorithm, a excellent choice is to manage the table as an open-addressed
hash; this technique proves to be very fast if the table is quite free (i.e. if the number of
free entries is at least half the size of the table). Having a single table (like the one
described in the example) has two drawbacks:
1. in a hash, the filled entries are not consecutive, that is to say they are interleaved by
empty entries. The process of passing the table to the user involves the copy of the
entries into a user level buffer. Consequently, we can choose two approaches: copy
all the entries of the table, thus disregarding of their fill state40, or check each entry
before copying it to the user buffer. Both the approaches are not performant: the
former copies useless entries, and they are usually more than 50% of the table, the
latter imposes to perform a fill state check for each entry, which is quite CPU
consuming.
2. as stated before, good performances are achieved if the table is quite free; if we have
only one table, the waste of memory is not negligible. For example, let’s suppose we
have a table of about 32.000 entries, each entry occupying 64 bytes; if we keep the
40 If they are filled or not.
6.5 TME
70
table free for at least an half, the table occupies about 2 Megabytes of memory, but
the amount of memory ‘used’ by free entries is at least 1 Megabyte. Moreover, the
problem gets more serious if we manage several tables at the same time.
Splitting the lookup table into two ones solves the problem in this manner: if we create
less blocks (in the shared_memory_segment) than entries (in the
lut_segment), we implicitly guarantee that the lookup table is never fully occupied,
since the number of occupied entries is always less or equal the number of blocks. Two
consequences come from this approach: we grant better performances with hashes,
since the table is never totally occupied, and we use less memory; for example, a table
of about 32.000 entries, but with only 16.000 blocks of 64 bytes occupies about 1
Megabyte, which is half the memory occupied by the same entries in a unique table;
moreover, if we store the filled blocks at the beginning of the
shared_memory_segment, the process of passing data to the user is much more
straight-forward, since we have to copy subsequent bytes.
Since EXECUTEs may need to store some specific, but yet general, data, there is
also another segment, extra_segment, that is totally general.
Another key feature of the architecture is that these segments used to store the table
(lut_segment, shared_memory_segment and extra_segment) are located
in the extended memory, thus permitting the BPF core to modify data in the table
(although this behaviour is strongly discouraged).
6.5.4 LOOKUP functions
LOOKUP instruction is a front-end to a callback function, generally called
lookup_fcn, which is, as discussed above, either written by the user or chosen from a
series of standard ones. The callbacks are called by the architecture itself when a
LOOKUP instruction is processed. Several lookup_fcns are needed, since each of them
can serve a different purpose, for example one to deal with masks, one to deal with
buckets, and then a general one, useful for common lookups.
These lookup_fcns receive some information on the table they are working on, for
example key length and table size (they receive the currently active TME data block,
described in the following paragraph) and a pointer to a location in the extended
memory, from which the key (to search in the table) is stored. Only this last parameter
is passed to the function by the LOOKUP (in form of a parameter of the LOOKUP
instruction call), the others are passed by the TME architecture itself.
Usually, lookup_fcns store a pointer to the found entry in the LAST_FOUND register
of the TME, or 0xFFFFFFFF if the search was not successful; this is the recommended
modus operandi of these callbacks, but is not compulsory (i.e. one can create a lookup-
6.5 TME
71
fcn that doesn’t use LAST_FOUND register, bearing in mind that EXECUTE instruction
cannot be used afterwards since this last one always refers to this register).
6.5.5 EXECUTE functions
EXECUTE instruction is a front-end to a callback function, generally called
exec_fcn, which is, as discussed above, either written by the user or chosen from a set of
standard ones, and called by the architecture itself when an EXECUTE instruction is
processed. These exec_fcns receive some information on the table they are working on,
the same ones received by lookup_fcns, a pointer to the block in the
shared_memory_segment to deal with (this is the block associated with the entry
found with the last call to LOOKUP), some information on the packet the machine is
currently processing, and a pointer to a location in the extended memory, to pass
additional information to the function. While the last parameter, in the previous list, is
passed through the parameters of the front-end EXECUTE (i.e. EXECUTE instruction
needs as parameter a pointer to a location in the extended memory), the other
parameters are passed to exec_fcns by the TME architecture.
The interaction between LOOKUPs and EXECUTEs is achieved via register
LAST_FOUND, that contains the pointer (in the extended memory) to the entry which
matched the key during the last call to LOOKUP41. In practice, the lookup_fcn stores
the pointer to the found entry in this register, and the exec_fcn retrieves this pointer to
know which block it has to deal with.
When EXECUTE instruction is processed, the TME reads LAST_FOUND register and
runs the exec_fcn whose ID is stored in the entry at that address, on the block (in the
shared_memory_segment) stored in that entry42.
6.5.6 TME data blocks
In the above paragraphs we dealt with the working principle of the architecture, limiting
ourselves to one lookup table. Actually the architecture is able to deal with several
tables or, to be more precise, TME data blocks. Each TME data block contains one
entity of the previous described type, and blocks are totally independent (i.e. they do not
share anything), as shown in Figure 28. Each block can be activated and used
independently from the others; since we can use only one block at a time, there is a
special instruction to switch between a pool of usable ones (the validated blocks).
41 If the search was not successful, this register contains a special value, 0xFFFFFFFF. 42 If the last search was not successful (i.e. LAST_FOUND register contains 0xFFFFFFFF), the exec_fcn ID to be executed is stored in another register, OUT_LUT_EXEC, and this last function works always on the first block in the shared_memory_segment.
6.5 TME
72
extended_memory
TME block
LUT segment
shared segment
extra segment
TME block
LUT segment
shared segment
extra segment
registers
validated_blocks
active
working
active_read
data_block0
data_block3data_block2
data_block1
registers
validated_blocks
active
working
active_read
data_block0
data_block3data_block2
data_block1
TME core
Figure 28 - TME data blocks
Some global TME registers are used to manage the blocks, that are
• active: it contains the index of the currently active TME data block (i.e. the block
used when a LOOKUP or an EXECUTE is performed);
• active_read: it contains the index of the TME data block to be passed to the
user. As stated before, only the shared_memory_segment is passed to the user;
• working: it contains the index of the TME data block on which the initialization
instructions work, before this block is validated;
• validated_blocks: it is a flag register; if bit n is set, then TME data block n is
validated, and can be used to perform LOOKUPs and EXECUTEs.
6.5 TME
73
Each TME data block is associated with a collection of registers, containing data
specific to that block, for example the number of entries or blocks in the tables coupled
with it.
data_blockX
FILLED_BLOCKS
OUT_LUT_EXECKEY_LEN
MAX_FILL_STATE
REHASHING_VALUE
SHARED_MEMORY_BLOCKS
LOOKUP_CODE
BLOCK_SIZE
EXTRA_SEGMENT_SIZE
LUT_ENTRIES
DEFAULT_EXEC
LAST_FOUND
FILLED_ENTRIES SHARED_MEMORY_BASE_ADDRESS
EXTRA_SEGMENT_BASE_ADDRESS
LUT_BASE_ADDRESS
Figure 29 - TME data block registers
Figure 29 shows a detailed view of all these registers, here is a description of each one:
• LUT_ENTRIES: it contains the number of total entries in the lookup table (which is
stored in the lut_segment); it can be set only during the initialization phase of
the TME;
• MAX_FILL_STATE: it is a upper bound to the fill rate of the lookup table, for
example if this value is set to 1500, the number of filled entries must remain under
1500; it can be used by LOOKUP instruction, but at the moment is not used;
• REHASHING_VALUE: it contains a value that can be used by LOOKUP instruction
in the search algorithm; if the lookup table is managed as an hash, this value is used
when a collision happens;
• KEY_LEN: it contains the length of the key used to identify the entries in the lookup
table; it is represented in terms of 32-bit words43 (e.g. KEY_LEN=3 means 3*4=12
bytes);
• SHARED_MEMORY_BLOCKS: it contains the number of total blocks in the
shared_memory_segment; the first block is always reserved, and contains data
relative to unsuccessful LOOKUPs;
• FILLED_ENTRIES: it contains the number of filled_entries in the lookup_table;
43 The key length is in terms of 32 bit words to allow 32-bit comparisons during the access to the table. This is an optimization for 32-bit processors, that are able to perform comparisons with one assembler instruction with words up to 32 bits long.
6.5 TME
74
• BLOCK_SIZE: it contains the size of each block in the
shared_memory_segment; it takes into account also memory occupied by the
key;
• EXTRA_SEGMENT_SIZE: it contains the size, in bytes, of the extra_segment;
• FILLED_BLOCKS: it contains the number of occupied blocks in the
shared_memory_segment; it is initialized to 1, since the first block, as stated
before, is reserved;
• LOOKUP_CODE: it contains the ID of the lookup function (the so called lookup_fcn)
called by the front-end instruction LOOKUP in that TME data block;
• DEFAULT_EXEC: it contains the ID of the exec_fcn that is coupled to an entry,
when this last one is created by a LOOKUP;
• OUT_LUT_EXEC: it contains the ID of the exec_fcn that is executed when an
EXECUTE is processed and register LAST_FOUND contains the special value
0xFFFFFFFF (i.e. when last LOOKUP was not successful);
• LUT_BASE_ADDRESS: it contains the base address of the lut_segment; it is a
read-only register;
• SHARED_MEMORY_BASE_ADDRESS: it contains the base address of the
shared_memory_segment; it is a read-only register; • EXTRA_SEGMENT_BASE_ADDRESS: it contains the base address of the
extra_segment; it is a read-only register; • LAST_FOUND: it contains a pointer to an entry in the lookup table, or
0xFFFFFFFF if last LOOKUP was not successful.
6.5.7 What happens when a LOOKUP is performed
Figure 30 gives a detailed explanation of what sits behind the LOOKUP front-end.
6.5 TME
75
right
VOID
BPF core
registers
TME core
registersbasic
mem
ory
ctrl
packet
extended_memory
key
EXTRA SEGMENT
SHARED
MEMORY SEGMENT
LUT
SEGMENT
VOID
right
VOID
wrong
wrong
VOID
2. LOOKUP call
1. BPF core copies from packet
to extended memory
3. search
4. TME core saves address of the entry in
TME_LAST_FOUND
5. TME core gives control
back to BPF core, with a
return value TME_TRUE (entry found) wrong means ‘keys don’t match’
right means ‘keys match’VOID means a void entry
TME
data block
Figure 30 - How a LOOKUP is performed
Here is an brief description of each step:
1. previous to the call to LOOKUP, the BPF core has to generate the key to enter the
lookup table, so it parses the packet (i.e. it recognizes what is the type of the
packet), and copies some fields of the headers from it to the extended memory in a
location it decides (i.e. the key must not be stored in a particular memory location,
since a parameter in the LOOKUP call contains the base address of the key), the
starting location is decided when the filter code running is generated (either by hand
or by a compiler);
2. the BPF core executes LOOKUP: it triggers the TME core, that gains control;
3. on its hand, the TME core executes the lookup_fcn whose ID44 is stored in register
LOOKUP_CODE; this lookup_fcn performs the search, and
4. stores45 the pointer to the matching entry in LAST_FOUND register, and return a
status code, which can be TME_TRUE, if the search was successful, TME_FALSE, if
the search was unsuccessful, or TME_ERROR if an error occurred;
5. the TME core gives control back to the BPF core, together with the return value of
the lookup_fcn; since the LOOKUP instruction is implemented as a conditional
JUMP, the BPF core jumps to a program instruction depending on this last value.
6.5.8 What happens when an EXECUTE is performed
Figure 30 gives a graphical explanation of what sits behind the EXECUTE front-end.
44 The conversion between IDs and functions is realized through a “function mapper”, that is to say a function that, given an ID, returns the corresponding function pointer. 45 as stated before, this is recommended, but not compulsory.
6.6 Interaction of the extensions with the original BPF architecture
76
right
VOID
BPF core
registers
TME core
registersbasic
mem
ory
ctrl
extended_memory
EXTRA
SEGMENT
SHARED
MEMORY
SEGMENT
LUT
SEGMENT
TMEdata block
1. EXECUTE call
Packets = 175176
Bytes = 5543455758
2. Read LUT entry at
address pointed by TME_LAST_FOUND
block address
exec_fcn ID
3. Execute exec_fcn with ID “exec_fcn ID”, on data at
address “block address”
4. TME core givescontrol back to BPF core
Figure 31 - How the EXECUTE instruction works
5. The BPF core executes the EXECUTE instruction by giving control to the TME
core;
6. on its hand, the TME core reads register LAST_FOUND, that contains a pointer to an
entry in the lookup table;
7. the TME core executes the exec_fcn whose ID46 is stored in that entry, on the block,
in the shared_memory_segment, whose pointer is stored in that entry; in the
example in Figure 31, the exec_fcn increments two counters representing packets
and bytes associated with that entry;
8. when the exec_fcn is completed, the TME core gives control back to the BPF core.
6.6 Interaction of the extensions with the original BPF architecture
The interaction between the original BPF core and the extensions (extended memory
and TME) is achieved through some new instructions that were added in the original
instruction set. Moreover, since the TME architecture needs some initialization before it
can be used, a new phase, initialization, was added; this phase is performed when the
filter is injected in the BPF processor, but prior to putting the BPF machine in the state
of receiving packets.
46 Again a function mapper is used for the conversion between IDs and functions.
6.7 Interaction with applications
77
6.6.1 Interaction with the extended memory
The new instructions that deal with this extension permit the resizing47 of this
memory (which is by default of 64 kilobytes), LOADs and STOREs.
In particular
• LOAD instructions can transfer data from extended memory to registers A and X of
the BPF core, in chunks of 8, 16 and 32 bits (usually referred as bytes, half-words
and words), in immediate mode (instruction contains the absolute address from
where load) and relative mode (instruction contains an offset to the location from
where load, X contains the base address);
• STORE instructions can transfer data from registers A and X of the BPF core to the
extended memory, in chunks of bytes, half-words and words, in immediate and
relative mode.
6.6.2 Interaction with the TME
The interaction with the TME is quite complex, mainly because the architecture is
very powerful. The instructions that were added can be divided into two main groups:
operative and initialization instructions.
Operative instructions: there are three instructions, that are LOOKUP, EXECUTE and
SET_ACTIVE. LOOKUP and EXECUTE are the front-ends described in the previous
chapters; SET_ACTIVE is used to switch between validated TME data blocks.
Initialization instructions: there are some instructions that, basically, allow the BPF
core to initialize a TME data block with default values, get and set the values of the
registers of a block, validate a block. This last operation is the most important: it checks
if the registers in a block are coherent, initializes others registers (e.g.
XXX_BASE_ADDRESS registers), and finally mark the block as validated (by setting
the appropriate bit in the global register VALIDATED_BLOCKS). A more detailed
description of the initialization phase can be found in paragraph 6.8, while details on
the instructions can be found in 12 - TME Extension Programming Manual.
6.7 Interaction with applications
The interaction with applications can be divided into two parts: users need to instruct
the BPF core (the so-called filter injection), and to receive data from it.
Filter injection is almost the same as the original BPF architecture, with only one
difference: we need to inject also the initialization code to the system.
Indeed, data is passed to the user in a ‘raw’ mode: the application receives a copy of
the blocks in the shared_memory_segment without any further formatting (a 1-to-
1 copy is performed). As stated before, it is possible to handle several TME data blocks
47 This operation is allowed only during the initialization phase.
6.7 Interaction with applications
78
at the same time; however, the application must receive the data corresponding to only
TME data block; the TME data block, that must be passed to the user, is set in the
global TME register ACTIVE_READ. Therefore, during initialization phase it is
necessary to issue a special instruction, SET_ACTIVE_READ, to instruct the TME core
on which block has to be passed to the user; if this instruction is not instructed, the TME
core passes block #0 to the user.
Furthermore, TME copies only those blocks, in the shared_memory_segment, that
contain valid data; in particular it copies the first n blocks, where n is the value
contained in FILLED_BLOCKS register of the ACTIVE_READ TME data block.
Since, as pointed out before, data passed to the application highly depends on the filter
injected in the filtering machine, the only way to correctly understand them is through
the knowledge of the filter itself. The usual process to parse information received is
shown in Figure 32:
• the application receives a buffer of raw bytes from the TME,
• since it knows the dimension of the blocks, it can divide the buffer into blocks,
• then, through the knowledge of the key length, it can interpret the key associated
with each block,
• and finally the data associated with each block.
It must be pointed out that the application knows the size of the blocks and the length of
the key, since the initialization code, generated by the application itself, contained that
data. RAW DATA BLOCKS KEYS DATA
block size
key
key
key
key
key
key length
key
key
key
key
key
exec_fcn IDs
filtering machine
data
data
data
data
data
Figure 32 - How applications parse data received by the filtering machine
6.8 Inizialization of the TME
79
6.8 Inizialization of the TME
This is the most important, and critical, phase to use the TME architecture. It can be
divided into three main parts:
1. a TME data block must be set as working and initialized with proper values in the
registers (e.g. number of entries and blocks in the table);
2. then it must be validated, that is to say the TME core has to perform a sanity check
on the values inserted in the TME data block registers, and, on success, it marks the
block as validated (i.e. usable by LOOKUPs and EXECUTEs);
3. an optional part, in which it is possible to force the insertion of entries in the lookup
table.
Prior to deal with these parts, it is necessary to point out what is a WORKING TME data
block and what is an ACTIVE TME data block; when the BPF machine is ‘turned on’,
that is to say when it starts to process the initialization code, all the TME data blocks are
not ready to be used, they are not ‘validated’. The initialization phase works on the
WORKING TME data block; a TME data block can always be set as WORKING, even if
it is not ready. When the block has been initialized and set to validated (that is to say it
is ready to be used), it can be set as ACTIVE, that is to say operative instructions can
work on it.
A final note: if a block is ACTIVE, it is also WORKING (i.e. if global TME register
active contains value n, register working contains value n, too).
TME data block initialization: first, we need to set a block as WORKING; this is
done by issuing instruction SET_WORKING. Then we need to put some values in the
registers of the TME data block we chose with SET_WORKING; this can be done in
two ways: either by initializing each register manually, with the instruction
SET_REGISTER_VALUE, or by using the special instruction INIT. This last
instruction inserts some default values in the TME data block registers; details on the
default values can be found in 12 - TME Extension Programming Manual. Two other
facts must be taken into account:
• it is always possible to issue INIT, and then change a register value (that was
initialized by INIT) through SET_REGISTER_VALUE;
• INIT instruction never initializes register KEY_LEN, so it is always necessary to set
this register manually (this is due to the fact that we cannot decide what is a default
value for the key length).
INIT accepts as argument the index of the TME data block to the initialized; it also sets
this block as WORKING. This command is the preferred method to initialize a block. It
is important to note that INIT sets a default lookup_fcn and a default exec_fcn. The
complete list of default values set by INIT is shown in Figure 33.
6.8 Inizialization of the TME
80
SET_REGISTER_VALUE accepts as argument the ID of the register to be written;
register A of the BPF core contains the value to be written.
data_blockX default values
FILLED_BLOCKS=0
OUT_LUT_EXEC=count_packetsKEY_LEN
MAX_FILL_STATE=15.000
REHASHING_VALUE=1
SHARED_MEMORY_BLOCKS=16.000
LOOKUP_CODE=normal_lut_w_insert
BLOCK_SIZE=64
EXTRA_SEGMENT_SIZE=0
LUT_ENTRIES=32.007
DEFAULT_EXEC=count_packets
LAST_FOUND
FILLED_ENTRIES=0 SHARED_MEMORY_BASE_ADDRESS
EXTRA_SEGMENT_BASE_ADDRESS
LUT_BASE_ADDRESS
Figure 33 – Default values of a TME data block
TME data block validation: in this part a TME data block, previously initialized, is
submitted to the TME core; it is done by issuing instructions VALIDATE. On its hand,
this command checks whether registers contain proper values. In particular, the most
important controls are:
• number of entries in the lut_segment, blocks in the
shared_memory_segment and key length must be greater than zero;
• IDs of lookup_fcns, and of exec_fcns must be valid;
• there must be enough memory to store the TME segments in the extended memory.
If the registers ‘pass’ the checks, the block is marked as validated (i.e. the appropriate
bit in the global TME register validated_blocks is set), and some read-only
registers are set to the appropriate values (these are the XXX_BASE_ADDRESS
registers). It sets the validated TME data block to ACTIVE, too.
VALIDATE accepts as argument the index of the TME data block to be validated,
register A of the BPF core contains the base address, in the extended memory, from
which the segments will be stored.
It is the only instruction that is able to set a TME data block as validated.
Optional part: in this part it is possible to perform additional operations to properly
initialize a TME data block; for example it is here that one can have some LOOKUPs to
force the insertion of some entries in the lookup table. In paragraph 6.5 we have
described a system to monitor TCP traffic on some specific ports; this behavior can be
achieved by creating an entry for each service (FTP, Web, Telnet…); this is done in this
last phase of the initialization phase, by having set a lookup_fcn that is able to insert
6.8 Inizialization of the TME
81
new entries. When we have inserted all the entries corresponding to the interesting TCP
ports, we change the lookup_fcn to one that does not permit the insertion of new entries;
this is done through a SET_REGISTER_VALUE instruction.
Then the filter code will generate the key with the TCP port and performs a LOOKUP.
If the TCP port is not one of the interesting ones, the lookup_fcn will return
TME_FALSE (since it cannot insert new entries).
In theory it is possible to assign a different execute_fcn to every entry of the lookup
table; in practice, this behavior leads to the realization of a lookup table that is almost
impossible to understand at user level: how can the application distinguish the various
execute_fcns used for each block in the shared_memory_segment?
A detailed overview of the techniques to properly use lookup_fcns can be found in 7.4 -
LOOKUP callbacks.
Example 1: a system to monitor the traffic outgoing from each host in a LAN
This type of analysis can be done by creating a table in which the key is represented by
the Ethernet MAC source address of each packet, the EXECUTE must increment two
counters representing the packets and bytes for each entry.
1. SET_MEMORY 1500000
It resizes extended memory to 1.5 Megabytes, enough to store a TME
data block initialized with default values.
2. INIT 0
It initializes TME data block O with default values, the default
lookup_fcn allows the insertion of new entries, the default exec_fcn48 is
the one that manages packets and bytes counters. We didn’t issue
SET_WORKING since this is done by INIT command.
3. LD 2
It loads 2 in register A; this is the key length (2 32-bit words).
4. SET_REGISTER_VALUE TME_KEY_LEN
It sets register KEY_LEN to the value stored in register A, which is 2.
5. LD 8
It loads 8 in register A; this is the base address from which the TME
data block will be stored. It starts from byte 8, since the first 8 bytes are
used to store key passed to the LOOKUP.
6. VALIDATE 0
It validates TME data block 0.
7. SET_ACTIVE_READ 0
48 The one that is assigned to an entry when this last one is created, stored in register DEFAULT_EXEC.
6.8 Inizialization of the TME
82
It instructs the TME that the application has to receive data from TME
data block 0.
8. RET 1
It returns success.
Some notes on this initialization code:
• the lookup_fcn and exec_fcn are set by the INIT, with default values;
• if some initialization instruction fails, implicitly the machine executes a ‘RET 0’
instruction;
• it is always necessary to resize memory to about 1.5 Megabytes if we use the
default values set by INIT instruction49.
The corresponding filter program is
1. LD P[6]
It loads the first 4 bytes of the destination MAC address from the packet
to register A.
2. EX_ST MEM_EX[0]
It stores register A to extended memory, starting at location 0.
3. LDH P[10]
It loads the last 2 bytes of the destination MAC address from the packet
to register A.
4. EX_STH MEM_EX[4]
It stores the 2 least significant bytes of register A to extended memory,
starting at location 4.
5. LOOKUP jt=6 jf=6 k=0
It performs the lookup in the table, k contains the base address of the
key in the extended_memory. jf and jt have the same value,
since we are not interested in performing a different action depending
on the result of the search.
6. EXECUTE k=0
It calls the exec_fcn; k contains the base address, in the
extended_memory, of some data that the exec_fcn may need to
receive.
7. RET 1
It returns success.
49 This is because the extended_memory is always allocated to its default size (64 kiloBytes) when the architecture is created, even if a BPF program does not use the TME co-processor; as a consequence, it would be memory consuming to allocate an initial amount of memory of 1.5 MegaBytes.
6.8 Inizialization of the TME
83
Example 2: a system to monitor the traffic directed to three specific hosts
This type of analysis can be done by creating a table in which the key is represented by
the destination IP address of each packet, again the EXECUTE must increment two
counters representing the packets and bytes for each entry.
In this example two lookup_fcns are used: the default one,
default_lut_w_insert, and default_lut_wo_insert; these two functions
implement a hash, and are almost the same; the only difference is that the former is able
to insert new elements in the table, the latter cannot.
1. INIT 0
It initializes TME data block O with default values, the default
lookup_fcn allows the insertion of new entries, the default exec_fcn is
the one that manages packets and bytes counters. We didn’t issue
SET_WORKING since this is done by INIT command.
2. LD 1
It loads 1 in register A; this is the key length (1 32-bit word).
3. SET_REGISTER_VALUE TME_KEY_LEN
It sets register KEY_LEN to the value stored in register A, which is 1.
4. LD 7
It loads 7 in register A50; this is the number of entries in the lookup table.
5. SET_REGISTER_VALUE TME_LUT_ENTRIES
It sets register LUT_ENTRIES to the value stored in register A, which is 7.
6. LD 4
It loads 4 in register A; this is the number of blocks in the
shared_memory_segment.
7. SET_REGISTER_VALUE TME_SHARED_MEMORY_BLOCKS
It sets register SHARED_MEMORY_BLOCKS to the value stored in
register A, which is 4.
8. LD 4
It loads 4 in register A; this is the base address from which the TME
data block will be stored. It starts from byte 4 since the first 4 bytes are
used to store key passed to the LOOKUP.
9. VALIDATE 0
It validates TME data block 0.
10. SET_ACTIVE_READ 0
50 Actually we need only 3 entries, but using 7 ones guarantees that the table is always quite empty.
6.8 Inizialization of the TME
84
It instructs the TME that the application has to receive data from TME
data block 0.
11. LD 0x82C01405
It loads an IPv4 address (130.192.20.5) in register A.
12. EX_ST MEM_EX[0]
It stores that IP address to extended memory, at address 0.
13. LOOKUP jt=14 jf=14 k=0
We perform a LOOKUP to force the insertion of that IP address in the
lookup table.
14. LD 0x82C01719
It loads an IPv4 address (130.192.23.25) in register A.
15. EX_ST MEM_EX[0]
It stores that IP address to extended memory, at address 0.
16. LOOKUP jt=17 jf=17 k=0
We perform a LOOKUP to force the insertion of that IP address in the
lookup table.
17. LD 0x82C028a3
It loads an IPv4 address (130.192.40.163) in register A.
18. EX_ST MEM_EX[0]
It stores that IP address to extended memory, at address 0.
19. LOOKUP jt=20 jf=20 k=0
We perform a LOOKUP to force the insertion of that IP address in the
lookup table.
20. LD default_lut_wo_insert
It loads an exec_fcn ID in register A; this function in almost the same as
the one used by default, except for the fact that it cannot insert new
entries in the table, and therefore returns TME_FALSE if an insertion
is needed.
21. SET_REGISTER_VALUE TME_LOOKUP_CODE
It sets register LOOKUP_CODE to the ID stored in register A.
22. RET 1
It returns success.
The corresponding filter program is
1.LDH P[12]
It loads bytes 12 and 13 of the packet, in which the level-3 protocol type
is stored.
6.9 Third-parties extensions
85
2. JE 0x800 jt=3 jf=7
It checks if protocol type (stored in register A) is IP (0x800). If it is IP, then we
can perform LOOKUP and EXECUTE, otherwise we skip the packet.
3. LD P[30]
It loads the destination IP address from the packet to register A.
4. EX_ST MEM_EX[0]
It stores register A to extended memory, starting at location 0.
5. LOOKUP jt=6 jf=6 k=0
It performs the lookup in the table.
6. EXECUTE k=0
It calls the exec_fcn.
7. RET 1
It returns success.
6.9 Third-parties extensions
The TME architecture has been created in such a way that third parties can create
add-ons to exploit special features; in particular, developers can create their own
lookup_fcns and exec_fcns, and then include them in the original TME architecture.
6.10 Known limitations
At the moment each TME function checks every single parameter passed by the BPF
virtual processor, to avoid erroneous behaviors. What is lacking is a system that
validates the code executed by the BPF virtual processor itself, for example we must
check that STORE instructions always use valid addresses. The original BPF
architecture is shipped with a function that validates the original BPF filters,
bpf_validate, but this module is not able to check the newly added instructions (it
simply skips them).
Another limitation of the architecture is due to the fact that only two lookup_fcns and
two execute_fcns are defined at the moment. It would be very interesting to add new
callbacks to deal with more situations.
The initialization phase is quite complex, it would be interesting to modify it in order
to have a simpler one.
When we create open matrices, one of the main issues is that the number of occupied
entries always grows, in theory up to the maximum size of the table. Indeed, there are a
lot of cases in which it is important to maintain only the newer entries, while the older
ones can be deleted; at the moment, however, there is no such feature in the
architecture, for a particular reason: the technique of deleting the older entries depends
6.10 Known limitations
86
strictly on the algorithm used to manage the lookup table. In particular, if the table is
handled as a open addressed hash, the entries cannot be deleted, the possible ways to
clean the table are either to recreate a new table out of the existing one, or overwrite the
older entries when a collision occurs (that is to say, if a key collides to an older entry,
that entry is overwritten, rather then performing a rehashing).
Each of the solutions has its pros and cons:
• the former has the advantage that the table is cleaned from all the older entries, but it
has the big disadvantage that when the table is recreated, the TME architecture
cannot access the existing one, to update an entry of to create a new one. This
problem has the practical consequence that when the recreation occurs, the entire
BPF virtual processor cannot process packets;
• the latter has the advantage that the table is automatically cleaned when a new entry
must be added, so the BPF virtual processor is never blocked; the drawback is that
an older entry is deleted only if a to-be-inserted entry collides on it; as a
consequence, if a to-be-inserted entry collides to a non-erasable entry, and there are
not free entries, the entry will not be inserted.
7 TME – The implementation
The Table Management Engine architecture was developed using the C language, for
two reasons: the former is that the BPF architecture was written in this programming
language, and the latter is the fact that C is one of the most powerful and performant
language to develop code working at kernel level.
Moreover, the TME extension was inserted in the Windows porting of the BPF
architecture, WinPcap.
The modifications involved several parts of WinPcap, particularly the BPF machine,
the modules dealing the READ and OPEN calls requested by the user, and some
IOCTLs.
7.1 The MODE_MON working mode
A new working mode, MODE_MON, was added to the original set51, in order to deal
with the TME. This mode can set through the usual APIs of wpcap.dll and packet.dll,
that are pcap_setmode for wpcap.dll and PacketSetMode for packet.dll.
When WinPcap works in this mode, the user receives the data processed by the TME
core, in practice, as stated in chapter TME, the user receives the
shared_memory_segment of the ACTIVE_READ TME data block, in a user buffer.
However, it must be noted that the TME architecture can be used in the other working
modes without problems, too: for example, the TME extension can be used to classify
packets prior to capture them.
7.2 Data structures
Before dealing with the structures used by the TME, an important detail must be
pointed out. As described in the previous chapter, the extended_memory is managed
in network-byte-order, and the data associated with each TME data block (the
XXX_segments) are stored in this memory. A particular choice was taken to balance
performances and coherence: the lut_segment is represented in network-byte-order,
while the shared_memory_segment is represented in a hybrid way, in the sense
that the key is represented in network-byte-order, to preserve the order in which bytes
are stored in the packets, but the other parts of each block (for example timestamps and
counters) are represented in host-byte-order. The extra_segment has not a
51 The original working modes were MODE_CAPT, to capture packets, and MODE_STAT, corresponding to the statistical mode.
7.2 Data structures
88
particular byte-order. As a consequence, the data contained in the
shared_memory_segment should be considered opaque by the BPF virtual
processor.
In particular, here is a graphical representation of the content of each entry in the
lut_segment and of each block in the shared_memory_segment.
Entry
block offset
exec_fcn ID
The main data structures used by the TME extensions are defined in the header file
<tme.h>; the most important ones are:
• TME_CORE
• TME_DATA
• RECORD
Another structure, that contains the extended_memory, is MEM_TYPE, defined in
the header file <mem_type.h>.
7.2.1 TME_CORE
This structure contains all the informations of the TME extensions; it is used by all
the kernel driver functions of WinPcap that deal with the TME.
typedef struct __TME_CORE
{
uint32 working;
uint32 active;
uint32 validated_blocks;
uint32 active_read;
TME_DATA block_data[MAX_TME_DATA_BLOCKS];
} TME_CORE;
The four unsigned integers working, active, validated_blocks and
active_read correspond to the analogous register of the TME extensions;
Block
key
Timestamp
data
7.2 Data structures
89
block_data is an array of structures containing the registers of a TME data block,
each.
7.2.2 TME_DATA
This structure contains the registers of a TME data block.
typedef struct __TME_DATA
{
uint32 lut_entries;
uint32 max_fill_state;
uint32 rehashing_value;
uint32 key_len;
uint32 shared_memory_blocks;
uint32 filled_entries;
uint32 block_size;
uint32 extra_segment_size;
uint32 filled_blocks;
lut_fcn lookup_code;
uint32 default_exec;
uint32 out_lut_exec;
uint8 *lut_base_address;
uint8 *shared_memory_base_address;
uint8 *extra_segment_base_address;
uint8 *last_found;
} TME_DATA;
The unsigned integers lut_entries, max_fill_state, key_len,
rehashing_value, shared_memory_blocks, filled_entries,
block_size, extra_segment_size, filled_blocks correspond to the
analogous registers of a TME data block.
lookup_code contains a pointer to the lookup_fcn function used when the LOOKUP
instruction is issued; this variable does not contain the ID of the lookup_fcn, but rather a
‘direct’ pointer to it to improve performance.
default_exec and out_lut_exec contain the ID of exec_fcns, represented as
32-bit unsigned integers.
lut_base_address, shared_memory_base_address and
extra_segment_base_address contain the pointers to the starting physical
addresses of these segments. These are absolute pointer, that is to say they refer to
physical memory locations. To obtain the base address of these segments relative to the
extended_memory, it is sufficient to subtract the starting physical address of the
extended_memory, available to all the functions dealing with the TME_data
structure.
7.2 Data structures
90
last_found contains the absolute (i.e. physical) address of the last found entry in
the lookup table, as described in chapter 6 - The Table Management Engine (TME).
7.2.3 RECORD
This structure corresponds to an entry in the lut_segment.
typedef struct __RECORD
{
uint32 block;
uint32 exec_fcn;
}
RECORD;
block contains a pointer to a block in the shared_memory_segment; the pointer
is relative to the extended_memory.
exec_fcn contains the ID of the exec_fcn associated with the entry; we did not use a
function pointer here, but rather an ID because this field is directly accessible by the
instructions of the BPF machine, being stored in the extended_memory, so the BPF
instructions can maliciously modify it. Using an ID prevents hang-ups of the processor,
since every ID is checked by the TME co-processor.
NOTE: each of the fields is represented in network-byte-order; two functions52 are
provided to obtain and set the values of these fields in the usual host-byte-order; they
are SW_ULONG_AT53 to obtain the value, SW_ULONG_ASSIGN to set the value; they
are defined in <memory_t.h>.
7.2.4 MEM_TYPE
This general structure is the one used to contain the extended_memory.
typedef struct __MEM_TYPE
{
uint8 *buffer;
uint32 size;
} MEM_TYPE;
buffer contains a pointer to a physical location in memory, size contains the size, in bytes, of this buffer.
52 At the moment, they are inline functions. 53 SW stands for Swapped, ULONG means unsigned int.
7.3 New BPF instructions
91
7.3 New BPF instructions
The most important BPF instructions, that were added to deal with the TME, call
some back-end functions, defined in source file <tme.c>. The most important
functions correspond to the following BPF instructions:
• INIT
• VALIDATE
• LOOKUP
• EXECUTE.
In the following paragraphs we will describe what sits behind these instructions.
7.3.1 INIT
As stated in the previous chapter, this instruction initializes a TME data block with
default values. These default values are defined in <tme.h>, through some
#defines; their name is XXX_DEFAULT, where XXX is the name of a TME data
block register.
The back-end function called is init_tme_block:
uint32 init_tme_block(TME_CORE *tme, uint32 block)
{
…
//check if the block is usable, otherwise return ERROR
…
//the TME data block is zeroed
…
//the default values are assigned
…
//return SUCCESS
}
tme is a pointer to a TME_CORE structure containing all the data relative to the TME
co-processor; block contains the index of the to-be-initialized TME data block.
This function returns ERROR only if block contains an invalid value, otherwise it
returns SUCCESS.
7.3.2 VALIDATE
This instruction validates a TME data block; the corresponding back-end function is
validate_tme_block, defined in <tme.c>.
7.3 New BPF instructions
92
uint32 validate_tme_block(MEM_TYPE *mem_ex, TME_CORE
*tme, uint32 block, uint32 mem_ex_offset)
{
…
//check if the block is usable, otherwise return ERROR
…
//check if the registers in the to-be-validated TME
data block are valid, otherwise return ERROR
…
//calculate required memory; if it is not available,
return ERROR
…
//clean memory used by the TME data block
…
//initialize XXX_segment registers
…
//set the block as validated and as ACTIVE
…
//return SUCCESS
}
mem_ex is a pointer to the structure containing the extended_memory, tme
contains the data of the TME co-processor, block is an index to the to-be-validated
TME data block, mem_ex_offset is the base address (in the extended_memory)
from which the data relative to the TME data block will be stored.
7.3.3 LOOKUP
This instruction corresponds to function lookup_frontend. This function limits
itself to call the lut_fcn defined in the currently ACTIVE TME data block.
uint32 lookup_frontend(MEM_TYPE *mem_ex, TME_CORE
*tme,uint32 mem_ex_offset, struct time_conv *time_ref)
{
…
//call lut_fcn of the currently ACTIVE TME data block
}
mem_ex is a pointer to the structure containing the extended_memory, tme
contains the data of the TME co-processor, mem_ex_offset is the base address (in the
extended_memory) from which the key is stored, time_ref is an opaque structure
containing the data used to generate timestamps.
7.3.4 EXECUTE
This instruction corresponds to function execute_frontend. In practice, it calls
the exec_fcn associated to the entry pointed by register LAST_FOUND.
7.4 LOOKUP callbacks
93
uint32 execute_frontend(MEM_TYPE *mem_ex, TME_CORE *tme,
uint32 pkt_size, uint32 offset)
{
…
//if LAST_FOUND contains 0xFFFFFFFF (not found),
//execute the exec_fcn whose ID is OUT_LUT_EXEC
//and return
…
//check if entry at address LAST_FOUND contains valid
//data, otherwise return ERROR
…
//execute the exec_fcn associated with entry at address
//LAST_FOUND
}
mem_ex is a pointer to the structure containing the extended_memory, tme
contains the data of the TME co-processor, pkt_size contains the size of the packet,
offset is the base address (in the extended_memory) of some data that must be
passed to the EXECUTE callbacks by the BPF machine.
7.4 LOOKUP callbacks
As described in the previous charter, LOOKUP instruction is a front-end to a
callback function, generally known as lookup_fcn. The prototype of this class of
function is the following:
typedef uint32 (*lut_fcn)(uint8 *key, struct __TME_DATA
*data,MEM_TYPE *mem_ex, struct time_conv *time_ref );
The parameters have the following meaning:
• uint8 *key: it is a buffer containing the key to enter the lookup table. This is the
key generated by the BPF machine. This buffer resides in the extended_memory,
so any modification alters the original key.
• TME_DATA *data: it is a pointer to the TME data block the function will work
on.
• MEM_TYPE *mem_ex: it is a pointer to the MEM_TYPE structure containing the
extended_memory.
• struct time_conv *time_ref: this structure contains some values used to
retrieve the timestamp of the packet. It must be considered opaque by the function; a
function is provided to obtain the timestamp in the usual format used by WinPcap,
that is GET_TIME.
7.4 LOOKUP callbacks
94
The return value is a status one, three values can be used:
• TME_TRUE: it means that the search was successful; the LOOKUP instruction will
jump to jt instruction;
• TME_FALSE: it means that the search was unsuccessful; the LOOKUP instruction
will jump to jf instruction;
• TME_ERROR: an error occurred; the program flow of the BPF virtual processor ends
immediately, by executing an implicit ‘RET 0’ instruction.
In the next paragraph we will describe two predefined lookup_fcns, to give an example
of how such functions have to be written.
7.4.1 Two examples: default_lut_w_insert and default_lut_wo_insert
default_lut_w_insert is a function that manages the lookup-table as an open
addressed hash and it inserts a new entry in the table, if the searched key is not found.
uint32 normal_lut_w_insert(uint8 *key, TME_DATA *data, MEM_TYPE
*mem_ex, struct time_conv *time_ref)
{
uint32 i;
uint32 tocs=0;
uint32 *key32=(uint32*) key;
uint32 shrinked_key=0;
uint32 index;
/*records is the array containing the hash table */
RECORD *records=(RECORD*)data->lut_base_address;
uint8 *offset;
uint32 key_len=data->key_len;
/*the key is shrinked into a 32-bit value, prior to applying*/
/*the hashing function */
for (i=0; i<key_len;i++)
shrinked_key^=key32[i];
/*hashing function*/
/*the first index in the table is calculated*/
index=shrinked_key % data->lut_entries;
/*search loop, tocs is the number of iterations*/
/*they must be less or equal the number of filled entries*/
/*if it is greater, the searched key is certainly not present*/
while (tocs<=data->filled_entries)
{
/* if records[index].block=0, then the entry is free*/
if (records[index].block==0)
{
/*we can create a new entry only if there free blocks*/
/*in the shared_memory_segment; this is because each */
/*entry is associated with a block in that segment */
if (data->filled_blocks==data->shared_memory_blocks)
{
/*there are no free blocks, we set LAST_FOUND as NULL*/
7.4 LOOKUP callbacks
95
/*that is the analogous of 0xFFFFFFFF in the physical*/
/*address space, we update the timestamp of the first*/
/*entry, that corresponds to an unsuccessful search */
/*and we return TME_FALSE */
GET_TIME((struct timeval *)(data->
shared_memory_base_address + 4*key_len),time_ref);
data->last_found=NULL;
return TME_FALSE;
}
/*offset contains the absolute pointer (i.e. in physical*/
/*memory) to the block associated with the newly created*/
/*entry*/
offset=data->shared_memory_base_address+
data->block_size*data->filled_blocks;
/*copy the key in the block*/
/*COPY_MEMORY is a macro, that permits the function to */
/*work either at kernel level and user level */
COPY_MEMORY(offset,key32,key_len*4);
/*put a timestamp in the block; it is immediately after */
/*the key */
GET_TIME((struct timeval *)(offset+4*key_len),time_ref);
/*updates the newly created entry with the address, in */
/*the extended memory, of the new block, written in */
/*network byte order */
/*SW_ULONG_ASSIGN is a macro to perform this assignment */
SW_ULONG_ASSIGN(&records[index].block,offset-mem_ex->
buffer);
/*assign the default exec_fcn ID to the entry, in network*/
/*byte order*/
SW_ULONG_ASSIGN(&records[index].exec_fcn,data->
default_exec);
/*increment the number of free entries*/
data->filled_entries++;
/*increment the number of filled blocks*/
data->filled_blocks++;
/*set register LAST_FOUND with the pointer to the newly */
/*created entry */
data->last_found=(PUCHAR)&records[index];
/*return TRUE */
return TME_TRUE;
}
/*the entry is not free, let’s check if it is the right one*/
offset=mem_ex->buffer+SW_ULONG_AT(&records[index].block,0);
for (i=0; (i<key_len) && (key32[i]==ULONG_AT(offset,i*4));i++);
if (i==key_len)
{
/* the key in the block matches the searched one, */
/* right entry */
/* we update the timestamp and LAST_FOUND register*/
GET_TIME((struct timeval *)(offset+4*key_len),time_ref);
data->last_found=(PUCHAR)&records[index];
7.4 LOOKUP callbacks
96
/*return successful search*/
return TME_TRUE;
}
else
{
/*the entry is wrong, let’s apply rehashing */
index=(index+data->rehashing_value) % data->lut_entries;
/*and increment the number of iterations */
tocs++;
}
}
/* if we reach this point, the search loop was unsuccessful */
/* so we update the timestamp of the #1 block */
GET_TIME((struct timeval *)(data->
shared_memory_base_address+4*key_len),time_ref);
/* and set the LAST_FOUND register to NULL, that is the */
/* analogous of 0xFFFFFFFF in the physical address space*/
data->last_found=NULL;
/* and finally return unsuccessful search */
return TME_FALSE;
}
normal_lut_wo_insert is almost the same, the only difference is that when the
hashing function hits a free entry, no insertion is performed, but a TME_FALSE value is
returned. In practice, the code marked with a vertical dashed line is substituted with a
‘return TME_FALSE;’ line of code.
7.4.2 Some tips for writing new LOOKUP callbacks
It is not so difficult to write new lookup_fcns, to implement new search techniques.
However, it is important to follow some guidelines:
• A function can read all the registers of the TME data block, but it can modify only
some of them, that are filled_entries, filled_blocks and last_found;
in particular, registers containing the base addresses of the segments,
XXX_base_address, must be never touched. When a new entry is filled in the
lut_segment, filled_entries must be incremented by one; when a new
block is used in the shared_memory_segment, filled_blocks must be
incremented by one. As described in chapter 6 - The Table Management Engine
(TME), register last_found must be used to store the physical address of the
found entry.
• In order to exchange the byte order, from little-endian to big-endian and viceversa,
some functions are provided, they are defined in <memory_t.h>. These functions
are written in Assembler, to guarantee good performances.
• In order to put or update the timestamp of an entry, a function is provided,
GET_TIME, defined in <time_calls.h>; this function receives a void pointer to
7.4 LOOKUP callbacks
97
the location in which the timestamp must be saved, and a pointer to a time_ref
structure; this last structure is an input parameter of all lookup_fcns.
• When a new entry cannot be inserted since the table is full, the function must return
TME_FALSE; it must not return TME_ERROR.
• It is important to remember that register key_len contains the length of the key in
terms of 32-bit words.
After a new lookup_fcn is created, it must inserted in the framework in this way:
1. The new function must be inserted in file <functions.c>.
2. We must assign a unique ID to this function; this is done through a #define in file
<functions.h>; we must include the prototype of the function in this file, too.
For example, if the function is called my_fcn, and its ID is 0x00000055, we
must modify the above mentioned file in this way
...
/* lookup functions */
#define NORMAL_LUT_W_INSERT 0x00000000
uint32 normal_lut_w_insert(uint8 *key, TME_DATA *data, MEM_TYPE
*mem_ex, struct time_conv *time_ref);
#define NORMAL_LUT_WO_INSERT 0x00000001
uint32 normal_lut_wo_insert(uint8 *key, TME_DATA *data, MEM_TYPE
*mem_ex, struct time_conv *time_ref);
#define MY_FCN 0x00000055
uint32 my_fcn(uint8 *key, TME_DATA *data, MEM_TYPE *mem_ex,
struct time_conv *time_ref);
...
3. Then we have to add a new case in the ‘lookup function mapper’; this function
returns the pointer of the lookup_fcn corresponding to the given ID. It is used by the
TME core. This mapper is in file <functions.c>. Let’s suppose we have to add
the above mentioned function, we have to modify the mapper in this way
...
lut_fcn lut_fcn_mapper(uint32 index)
{
switch (index)
{
case NORMAL_LUT_W_INSERT:
return (lut_fcn) normal_lut_w_insert;
case NORMAL_LUT_WO_INSERT:
return (lut_fcn) normal_lut_wo_insert;
case MY_FCN:
return (lut_fcn) my_fcn;
default:
return NULL;
}
7.5 EXECUTE callbacks
98
}
...
7.5 EXECUTE callbacks
As explained, the EXECUTE instruction is a front-end to an exec_fcn callback. Here
is the prototype of this class of callbacks:
typedef uint32 (*exec_fcn)(uint8 *block, uint32 pkt_size,
struct __TME_DATA *data, MEM_TYPE *mem_ex, uint8
*mem_data);
The parameters have the following meaning:
• uint8 *block: it is a pointer to the block, in the shared_memory_segment,
associated with the entry at address LAST_FOUND. It points to a physical memory
location.
• uint32 pkt_size: it contains the size of the packet, as received by the NIC
interface.
• TME_DATA *data: it is a pointer to the TME data block the function will work
on.
• MEM_TYPE *mem_ex: it is a pointer to the MEM_TYPE structure that contains the
extended_memory.
• uint8 *mem_data: it points to a physical memory location (in the
extended_memory), that is passed directly from the EXECUTE front-end.
The returned value is a status one, two possible values are allowed:
• TME_SUCCESS: the function completed successfully;
• TME_ERROR: an error occurred; in this case the BPF machine will immediately
terminate the program flow by executing an implicit ‘RET 0’ instruction.
7.5.1 An example: count_packets
This is the simplest example of an exec_fcn; in practice, it stores two counters
in the block, incrementing them for each packet associated with that block.
typedef struct __c_p_data
{
struct timeval timestamp;
uint64 packets;
uint64 bytes;
}
c_p_data;
uint32 count_packets(uint8 *block, uint32 pkt_size, TME_DATA *data,
MEM_TYPE *mem_ex, uint8 *mem_data)
{
7.5 EXECUTE callbacks
99
/*counters is the structure containing the counters*/
/*and the timestamp; since at the beginning of the */
/*block there is the key, this structure begins */
/*immediately after the key */
c_p_data *counters=(c_p_data*)(block+data->key_len*4);
/*now we increment the counters */
counters->bytes+=pkt_size;
counters->packets++;
/*and return success*/
return TME_SUCCESS;
}
7.5.2 Some tips for writing new EXECUTE callbacks
It is simple to write new EXECUTE callbacks, bearing in mind some details:
• All the registers of the TME data block must be considered read-only, generally; in
particular, as the LOOKUP callbacks, it is important not to modify the
XXX_base_address registers.
• The first part of each block contains the key, that identifies each block uniquely, and
the timestamp; as a consequence, this part must be never modified.
• It is important to remember that register key_len contains the length of the key in
terms of 32-bit words.
After a new exec_fcn is created, it must inserted in the framework in this way:
1. The new function must be inserted in file <functions.c>.
2. We must assign a unique ID to this function; this is done through a #define in file
<functions.h>; we must include the prototype of the function in this file, too.
Some exec_fcns may use a particular data structure to represent the bytes of each
block; it is convenient to define this structure into a separate header file, and insert
an #include near the ID definition. For example, if the function is called
my_exec_fcn, its ID is 0x00000024, and a data structure is defined in file
<my_exec_fcn.h>, we must modify the above mentioned file in this way
...
/* execution functions */
#define COUNT_PACKETS 0x00000000
#include "count-packets.h"
uint32 count_packets(uint8 *block, uint32 pkt_size, TME_DATA *data,
MEM_TYPE *mem_ex, uint8 *mem_data);
#define TCP_SESSION 0x00000800
#include "tcp-session.h"
uint32 tcp_session(uint8 *block, uint32 pkt_size, TME_DATA *data,
MEM_TYPE *mem_ex,uint8 *mem_data);
7.6 Interaction with the user
100
#define MY_EXEC_FCN 0x00000024
#include "my_exec_fcn.h"
uint32 my_exec_fcn(uint8 *block, uint32 pkt_size, TME_DATA *data,
MEM_TYPE *mem_ex,uint8 *mem_data);
...
3. Then we have to add a new case in the ‘execute function mapper’; this function is
analogous to the lookup function mapper, and it is defined in file
<functions.c>. Let’s suppose we have to add the above mentioned function,
we have to modify the mapper in this way
...
exec_fcn exec_fcn_mapper(uint32 index)
{
switch (index)
{
case COUNT_PACKETS:
return (exec_fcn) count_packets;
case TCP_SESSION:
return (exec_fcn) tcp_session;
case MY_EXEC_FCN:
return (exec_fcn) my_exec_fcn;
default:
return NULL;
}
}
...
7.6 Interaction with the user
7.6.1 Passing data to the user
An application running on top of the TME architecture needs to the receive the data
contained in the shared_memory_segment of one the validated TME data blocks,
in particular the filled blocks of this segment.
Two approaches can be followed to achieve this result:
• it is possible to use a shared buffer, that is to say we can share the buffer containing
the shared_memory_segment, which is at kernel level, with the user level
application. In practice, the application receives a pointer to the same buffer the TME
uses to store this segment;
• it is possible to copy the shared_memory_segment into a user buffer, provided
by the application.
Each of the solutions has its own pros and cons.
The former is a 0-copy system, and this is generally an excellent solution, since
copies are usually time-consuming; however, it has a big drawback: since the
7.6 Interaction with the user
101
application receives the same buffer on which the TME works, incoherences on the data
may happen; let’s suppose the application is reading some data and a new packet
appears on the network, it happens that the application is put in a sleep state, and the
BPF virtual processor processes the packet, thus modifying the data of the TME; when
the application is awakened, the data it has read previously can be meaningless.
There is also another problem, that is more infrequent, however: if the application is
able to modify the data contained in this shared buffer, the TME behavior, and possibly
the one of the entire operating system, can become unpredictable.
The latter solution does not have the above mentioned problem because the data of
the TME are copied into a user buffer; nevertheless, whenever the application needs to
receive some information, it is necessary to copy a certain of memory into the
application buffer, and this amount can be important; for example, a TME data block
initialized with default values may need to transfer up to 1 Megabyte54 to the
application.
This last solution is the one adopted by the implementation of the TME.
The implementation is not completely free from incoherence problems: the main issue
is that the process of copying the shared_memory_segment into the user buffer is
performed at a lower priority55 than the BPF virtual processor, that is to say the
receiving of a packet has the precedence on the copy process. As a consequence, the
copy process can be interrupted by the BPF processor, and incoherent data can be
present in the user buffer.
An hack was used to partially prevent this problem: the process of copy and the virtual
processor are locked, with one of the so called spinlocks56, in a particular way; in fact,
the process of copy is performed into blocks (the ones that form the
shared_memory_segment), before and after the copy of each block the lock is
acquired and released. This solution guarantees that the data stored in each block are
coherent, although coherence is not guaranteed between blocks.
The only solution to gain coherence between blocks is to acquire the lock before the
whole copy, and release it when the copy has ended; this solution has the big drawback
that, during the copy, the virtual processor is stopped, so the packets that arrive in this
interval are lost; in practice this solution violates completely the priorities of the various
threads (i.e. the read process gains a higher priority than the BPF virtual processor).
Here is a snippet of the code used in the copy process:
for (cnt=0;cnt<blocks;cnt++)
54 We need to transfer the shared_memory_segment, which is composed of 16000 blocks of 64 bytes each. 55 The copy runs at PASSIVE_LEVEL, in the application context, while the BPF virtual processor works at DISPATCH_LEVEL, in the system context. 56 A spinlock is a simple synchronization mechanism present in the Windows Operating Systems.
7.6 Interaction with the user
102
{
//we acquire the spinlock
NdisAcquireSpinLock(&Open->machine_lock);
//we copy the block
RtlCopyMemory57(UserPointer,tmp,block_size);
//we release the spinlock
NdisReleaseSpinLock(&Open->machine_lock);
tmp+=block_size;
UserPointer+=block_size;
}
To maintain compatibility with the existing read functions of the LIBPCAP library
(pcap_loop and pcap_dispatch), the user buffer contains also an header BPF, which
precedes the buffer copied from the shared_memory_segment of the TME.
7.6.2 Filter injection
The implementation of the TME was realized in such a way to maintain
compatibility with the original APIs used to inject the filter in the BPF virtual processor,
in practice with pcap_setfilter.
The problem is that it is necessary to pass not only the filter code to the driver, but the
initialization code, too; the chosen approach is to link the two parts, filter and
initialization, into a ‘hybrid’ filter, in this way: this hybrid filter is formed by the
concatenation of the two codes, separated by a particular instruction,
BPF_SEPARATION, whose opcode is 0xFF.
Here is a graphical explanation of this mechanism:
filter code
BPF_SEPARATION
initialization code
If the hybrid filter does not contain the BPF_SEPARATION instruction, it is considered
that the initialization phase is not needed; this solution guarantees compatibility with the
57 This is the kernel level counter-part of the usual memcpy API.
7.7 Implementation issues
103
original LIBPCAP compiler, used to generate the filter code from human readable
strings (e.g. ‘tcp port 80’). At the lower level, in the driver, the function that
handles the IOCTL used to inject the filter has been modified to separate the two parts;
this function performs the initialization phase, too; as a consequence, if this phase fails,
the IOCTL returns an unsuccessful result, and consequently pcap_setfilter API fails.
7.7 Implementation issues
Several issues arose during the development of the extension; the majority of them
was due to the specific operating system on which the module was developed,
Microsoft Windows.
The main problem regards timestamps: as stated before, each block in the
shared_memory_segment has its own one, represented through a timeval58 structure;
the precision of this timestamp is the microsecond.
Microsoft developing tools offer a variety of APIs to obtain timestamps, the problem is
that the only one that guarantees the microsecond precision is
KeQueryPerformanceCounter; this function gives two 64-bit integers, representing the
number of ticks elapsed from boot time, and the number of ticks per second.
This API has a serious problem: it is very slow to return; some informal tests proved
that this function usually takes up to 3 µs to return; as a consequence, if we have to
generate 100.000 timestamps per second, the CPU is occupied up to 10-20% execute
this call. A further detail makes the things worse: in order to represent timestamps as
timeval structures, at least two 64-bit integer divisions are involved, and these divisions
are heavily CPU consuming59.
A partial solution to this problem is to use a particular high-performance counter that is
shipped with all 32-bit INTEL processors, the TimeStamp Counter (TSC). This counter
is incremented at every processor clock cycle, so the precision is the CPU frequency.
The IA3260 x86 assembler provides an instruction to obtain this timestamp, RDTSC,
which is very fast (1 clock cycle).
However the use of this counter, to generate timestamps, has some problems:
• it is necessary to know the exact processor frequency to convert this counter into a
timeval structure; unfortunately, the Microsoft APIs do not provide a mean to
obtain this frequency, at least in the documented ones;
• two 64-bit integer divisions are necessary to convert this timestamp to a timeval
structure; we need to perform this conversion to grant a coherence with the
58 This structure is formed by two fields, ts_sec and ts_usec, that represent the time elapsed since 1/1/1970, expressed in seconds and microseconds. 59 At least on 32-bit processors. 60 Intel Architecture 32 bits.
7.7 Implementation issues
104
timestamps used by the LIBPCAP library, since these timestamps are passed
without modification to the user;
• this technique of generating timestamps is highly dependant on the particular
processor, since it works only on INTEL CPUs (or compatible ones, as AMD
ATHLON processors);
• the newer INTEL processors for laptops, the Mobile Pentium series, implement a
new technology, SpeedStep, by which the processor lowers its frequency when the
laptop uses the battery. This behavior can produce wrong timestamps if the
frequency is lowered when an instance of the driver is running. The solution to this
problem is to calculate the frequency whenever a driver instance is opened,
remembering that timestamps are wrong if the processor frequency changes when an
instance is open, that is to say, you must close all the open driver instances before
changing the CPU clock.
The current implementation calculates the processor frequency in a ‘dirty’ manner: we
sample the TSC and the timestamp given by the KeQueryPerformanceCounter at the
same time, after having raised the IRQL to the highest possible one, then we delay some
time, and repeat the sampling; we obtain the processor frequency through some 64-bit
integer divisions. The precision depends on the time elapsed between the two samples,
with a delay time of about 300 ms, the frequency is exact to the kiloHertz.
This technique is somewhat dirty since we have to raise the IRQL to the highest
possible, HIGH_IRQL, that is a strongly discouraged operation.
Because of the strong dependence on the processor, it is possible to compile two
versions of the driver containing the TME extension: a more performant one, that uses
the RDTSC to generate timestamps, and a less performing one, that uses
KeQueryPerformanceCounter API.
Another problem is related to way NDIS passes packets to the protocol drivers61: in
practice, the function that is called whenever a packet is received by the NIC card does
not always receive the entire packet, there are some cases in which it receives only a
fragment of it (for example 256 bytes); NDIS provides an API to transfer the remaining
part of the packet to a buffer, the problem is that this API is managed asynchronously
by the OS62. As a consequence, there are some cases in which it is difficult to analyze
the whole packet. At the moment, the problem is still open.
The last issue is related to the user buffer: wpcap.dll (the Windows version of
LIBPCAP) allocates a user buffer of 256 kilobytes, and this size is not modifiable, since
it is defined through a #define instruction. This buffer is large enough for packet
61 As described in charter Wincpap, the kernel part of the architecture is realized as a network protocol driver, and NDIS is the Windows framework on which network components sit. 62 That is to say, the function returns immediately, even if the transfer has not finished; a particular callback function, that protocol drivers must define, is called when the transfer ends.
7.7 Implementation issues
105
capture, but it is too small to contain the data passed from the kernel driver, in practice
the shared_memory_segment of a TME data block: for example, initializing a
TME data block with the default values gives a segment of about 1 Megabyte.
This issue was solved by adding a new LIBPCAP API to resize the user buffer,
pcap_setuserbuffer.
8 Interacting with the TME
In the previous chapters the BPF filters and the initialization code that used the TME
extension were generated by hand; this approach is generally not good, since the
process of creating new filters is very difficult and time consuming. A better approach is
to have a system that generated the BPF filters in a simple way, from a high level
description of want the BPF filter has to do; in practice we need a compiler, that is able
to generate these low level filters starting from a human readable string describing what
we want to do.
The WinPcap library (wpcap.dll) is shipped with a compiler that is able to generate
the filter code, for capturing packets, from a high level description of the wanted
packets, for example from the string “ip”; in order to support the TME extension we
need to add some new syntactic statements to the high level language describing the
filters, and modify the compiler to support these statements and generate the appropriate
BPF filter code (and the initialization code, too).
The TME architecture is quite general, therefore in the next paragraphs we will
propose some modifications to the original WinPcap library to perform complex
statistics, only.
8.1 The SWITCH statement
The main types of monitoring analyses that were highlighted in chapter 3 -
Monitoring networks are the following:
• open matrices
• close matrices
• multiple analyses
• bucket analyses.
All these types can be declared at high level with a single new statement, SWITCH.
Open matrices: The SWITCH statement is used with this sintax
SWITCH(header_field1, header_field2,...)
header_fieldi specifies in a distinct way a field in the protocol headers which will
be used to generate the key to enter the table. For example, if we want to create an open
table in which the key is the source IP address of the packets, the syntax is
SWITCH(ip src host)
8.1 The SWITCH statement
107
If we want to create a multidimensional matrix, the various headers fields are separated
by a semi-colon (“;”). For example, if we want to create a 2-dimensional matrix
formed by the source and destination IP addresses, the syntax is
SWITCH(ip src host ; ip dst host)
Close matrices: The SWITCH syntax is the following
SWITCH(header_field=value1,value2,...)
header_field specifies a distinct field in the protocol headers, valuei are the
values of the header_field that must be monitored. For example, if we want to
create a close matrix on the source IP addresses 1.2.3.4, 4.5.6.7 and 3.8.9.2, the syntax
is
SWITCH(ip src host=1.2.3.4, 4.5.6.7, 3.8.9.2)
If we want to create multidimensional matrices, the syntax is similar to that one used for
open matrices: the various header fields are separated by a semi-colon. This is a small
example of a 2-dimensional close matrix
SWITCH(tcp src port= 1, 2, 3, 4;tcp dst port= 1, 2, 3, 4)
Multiple analyses: The SWITCH syntax is the following
SWITCH(condition1,condition2,...)
conditioni are the conditions that the packet must satisfy in order to be ‘monitored’
by the TME. A condition can be a protocol (e.g. “ip”) or a generic condition an
header field (e.g. “tcp src port=80”). For example, if we want to monitor the IP
traffic, the TCP one and the Web one, the syntax is
SWITCH(ip,tcp, tcp src port=80 or tcp dst port=80)
Bucket analyses: the SWITCH syntax is the following
SWITCH(header_field=interval1,interval2,...)
8.2 Interpreting data received by the TME coprocessor
108
header_field specifies a distinct field in the protocol headers, intervali
specifies an interval of the possible values assumed by the header_field, in the
form start_value-stop_value. The intervals can be not contiguous (e.g. 1-50,
55-100), but they must be not overlapping, otherwise the compiler does not compile the
statement. Here is a small example that deals with the packet length
SWITCH(pkt_len=1-128,129-256,257-512,513-1514)
Notes:
• it is not possible to combine different SWITCH statements relative to a different
monitoring analysis. For example, the following syntax is not acceptable, since it
combines a close matrix with an open one.
SWITCH(ip dst host= 2.2.3.4, 2.5.6.7; ip src host)
The following syntax is acceptable, because it combines two open SWITCHes to
create a 2-dimensional open matrix.
SWITCH(ip dst host;ip src host)
• The SWITCH statement can be preceded by a series of pre-filters that restrict the
traffic monitored by the TME; this pre-filtering uses the syntax of the original BPF
architecture. This is a small example
ip src host 130.192.1.2 SWITCH(ip dst host)
8.2 Interpreting data received by the TME coprocessor
In the previous chapters data received by the TME coprocessor, through the
dispatcher_handler function, were interpreted and printed on the screen by hand, that is
to say these functions knew the format of the received data, and they were able to
understand them correctly. The problem is that this approach requires a different
dispatcher_handler for each different BPF filter that can be injected in the BPF virtual
processor, and this impossible, and stupid, to realize.
Moreover, if a compiler is used to generate the BPF filters, it is impossible to know how
to interpret the data received prior to compiling the BPF code itself. As a consequence,
it is necessary to develop an API that is able to understand the data received, in such a
8.3 The pcap_keyparser() function
109
way to be independent from the particular filter that is compiled. In practice, the most
important feature of this new API is the ability to translate the keys, that identify each
block transferred from the TME to the user, into a human readable format. For example,
if the BPF filter creates a table in which each entry corresponds to an IP address, the
key is an IP address (e.g. 0x01020304); this new API must be able to translate this
raw sequence of bytes in a string like “IP 1.2.3.4”
The proposed architecture to deal with this problem is shown in Figure 34.
EXTENDEDCOMPILER
EXTENDEDCOMPILER
SWITCH(tcp src port=21,23,25,80)SWITCH(tcp src port=21,23,25,80)
struct
bpf_keyparser
To parse the key
struct
bpf_keyparser
To parse the keyBPF filterBPF filter
TME data
pcap_keyparser()pcap_keyparser()
BPF core
registers
extended_memory
TME core
packet
registers
basic
mem
ory
ctrl
TME co-processor
BPF + TME Figure 34 - The proposed architecture to interpret the TME data
Since the process of interpretation (parsing) highly depends on the BPF filter generated
by the compiler, the compiler has to create a particular data structure (struct
bpf_keyparser) too, together with the BPF code. This data structure, that must be
considered opaque to the user, will be used by the new API, pcap_keyparser(), to
parse the data received from the TME. This function has to be called inside the user-
defined dispatcher_handler to understand the received data.
8.3 The pcap_keyparser() function
The syntax of this API is the following
char *pcap_keyparser(char *buffer,
struct bpf_keyparser *keyparser,
char parsed_key[1024],
int *offset)
The meaning of the parameters is the following:
8.4 The key parsing process
110
• char *buffer: it is the buffer received by the dispatcher_handler, that contains
the data transferred by the TME coprocessor.
• struct bpf_keyparser *keyparser: it is a pointer to the data structure
created by the compiler to parse the data received by the TME.
• char parsed_key[1024]: the function will store the parsed version (e.g. “IP
1.2.34.5”) in this buffer; this buffer must be allocated by the application using
pcap_keyparser().
• int *offset: the function will store the offset from where the data coupled with
that key start (in the case of monitoring, these data are the byte and packet counters).
The value returned by the function is the pointer of the next block present in the buffer
received by the dispatcher_handler; a NULL pointer is returned if the parsed block is
the last one.
The scheme of a dispatcher_handler function using pcap_keyparser() is the
following:
void dispatcher_handler(u_char *user, const struct pcap_pkthdr *h,
const u_char *sp)
{
char parsed_key[1024];
char *tmp=sp;
int offset;
//the structure generated by the compiler is passed to the
//dispatcher handler through the ‘user’ parameter
struct bpf_keyparser *keyparser=(struct bpf_keyparser*)user;
while(tmp!=NULL)
{
char *tmp2=pcap_keyparser(tmp,keyparser,parsed_key,&offset);
... //code to display the data
tmp=tmp2;
}
}
8.4 The key parsing process
The above described monitoring systems produce tables in which the keys have a
meaning that depend on the particular performed analysis. In practice we have the
following cases:
Open and close matrices: they use a key that represents a value of a precise field in the
protocol headers. For example the syntax “SWITCH(ip src host)” generates a
BPF filter whose key are IP addresses. In this case the pcap_keyparser() routine must
print the key in a more readable manner by formatting it, in a way similar to the printf()
C function. In practice, the new API must perform a translation like this one
8.4 The key parsing process
111
key= “0x01020304” � string= “IP 1.2.3.4”
This parsing can be made with something like
printf(“IP %u.%u.%u.%u”,key[0],key[1],key[2],key[3]);
The data structure must therefore contain the formatting string, in the example the string
“IP %u.%u.%u.%u”. A graphical explanation is shown in Figure 35.
struct bpf_keyparserstruct bpf_keyparser
pcap_keyparserpcap_keyparser
key“Ox01020124”
key“Ox01020124”
format the key in this way
“ip source %u.%u.%u.%u”,
key[0],key[1],key[2],key[3]
Parsed key“ip source 1.2.1.36”
Parsed key“ip source 1.2.1.36”
Figure 35 - Key parsing for open and close matrices
Multiple analyses: in this case the keys do not represent a value of a particular field in
the protocol headers, rather they are a description of what is present in the block
associated with the key (e.g. a four characters string, as described in paragraph 3.3.2),
or an identification number, chosen by the compiler itself. As a consequence, it is
impossible to use the formatting string described previously. What we need is a small
table, in which it is saved the correspondence between the possible values of the keys,
and the strings that must be returned by the pcap_keyparser function. This approach is
shown in Figure 36.
Bucket analyses: In this case the key represents the limits of a bucket interval, but
these numbers are values of a particular header field, for example a TCP port. For this
reason we can use the method of the formatting string to translate the keys from raw
bytes to a human readable string.
8.4 The key parsing process
112
struct bpf_keyparserstruct bpf_keyparser
pcap_keyparserpcap_keyparser
Key“Ox00000004”
Key“Ox00000004”
Parsed key“NetBios”
Parsed key“NetBios”
key parsed key
0x00000001 “IP”
0x00000002 “TCP”
0x00000003 “WEB”
0x00000004 “NetBios”
Figure 36 - Key parsing for multiple analyses
In order to integrate the two different method of key parsing into a single data
structure, the bpf_keyparser structure must contain a type field that indicates in
which way the key must be parsed. Moreover, this structure, generated by the compiler,
must be able to store both the structures used for the parsing, that is to say the
formatting string and the table of the used keys. A graphical representation is shown in
Figure 37.
A possible model of this data structure is the following
struct bpf_keyparser
{
int type;
char *formatting_string;
struct key_table *table;
}
in which the structure key_table contains the table of the correspondences
key�string.
When the structure ‘uses’ the key formatting technique, type contains the FORMAT
value, formatting_string contains the formatting string, and table is a NULL
pointer.
When the structure ‘uses’ the table technique to translate the keys, type contains the
TABLE value, formatting_string contains a NULL pointer, and table is a
pointer to the structure containing the table of the keys.
8.4 The key parsing process
113
struct bpf_keyparser
{
type=FORMAT/TABLE
...
struct bpf_keyparser
{
type=FORMAT/TABLE
...pcap_keyparserpcap_keyparser
Key Key
Parsed keyParsed key
format the key in this way
“ip source %u.%u.%u.%u”,
key[0],key[1],key[2],key[3]
key parsed key
0x00000001 “IP”
0x00000002 “TCP”
0x00000003 “WEB”
0x00000004 “NetBios”
Figure 37 - The bpf_keyparser structure
9 A case study
At the moment of writing this thesis, only two LOOKUP callbacks are defined,
normal_lut_w_insert and normal_lut_wo_insert, described in the
previous chapter. Moreover, only two EXECUTE callbacks have been written, that are
count_packets, described in the previous chapter, and tcp_session, a function
that recognizes TCP flows, dealing with serial numbers and acknoledgement numbers,
giving the amount of bytes transferred in that TCP session. As a consequence, at the
moment the architecture is able to realize only monitoring systems. In the following
paragraphs we will propose some monitoring problems, and present the relative
solutions using the TME extension. Finally, we will show the performances of the TME
extension through one of these examples, particularly highlighting the role of the main
components of the architecture to the whole.
9.1 A system to monitor the traffic sent by each host in a network
We are interested to know the traffic sent by each host in a LAN; we use the MAC
source address to identify each host. We need to create an open matrix, since we do not
know the MAC address of each host, in which each cell is labeled with the MAC source
address, and the action for each cell is to ‘count packets’.
We do not know the exact number of hosts on the LAN; just to make the things easier,
we will use the default values to create the lookup table; these values create a lookup
table of 32.007 entries, and the number of usable blocks63 is 16.000; since we want to
create an open matrix, the LOOKUP callback will be the default one,
normal_lut_w_insert; the EXECUTE callback, associated to each newly created
entry, is count_packets, which is the default one.
Since the length of the key must be expressed in terms of 32-bit words,and the actual
length of the key is 6 bytes, this length is padded to 2 32-bit words.
Therefore, the initialization code has to resize the extended_memory to store the TME
table, which is about 1,3 Megabytes; since we use the default values, we have to issue
instruction INIT, then set the key length to 2; then we have to validate the TME data
block. The filter code will store the key in the first 8 bytes of the extended_memory, so
the TME data will start from byte 8.
63 Remember that each entry, in the lut_segment, is associated to a block, in the shared_memory_segment.
9.1 A system to monitor the traffic sent by each host in a network
115
The last thing to do is to instruct the TME the data passed to the user must be taken
from TME data block #0, this is done through the SET_ACTIVE_READ instruction.
The initialization code is the following:
1. SET_MEMORY k=1300000
2. INIT k=0
3. LD 2
4. SET_REGISTER_VALUE k=TME_KEY_LEN
5. LD 0
6. VALIDATE k=8
7. SET_ACTIVE_READ 0
8. RET 1
The corresponding filter code has to extract the MAC source address from the packet
and save it into the first six bytes of the extended_memory, where we have decided
to store the key. This is done through a 32-bit and a 16-bit LOAD from the packet, and
through a 32-bit and a 16-bit STORE to the extended_memory.
Then we need to call the LOOKUP, to search the table for that particular MAC address,
and then EXECUTE, to increment the bytes and packets counters of the found entry.
Finally the code returns with a 0 value, through a RET instruction.
This is the filter code:
1. LD P[6]
2. EX_ST MEM_EX[0]
3. LDH P[10]
4. EX_STH MEM_EX[4]
5. LOOKUP jt=6 jf=6 k=0
6. EXECUTE k=0
7. RET 0
What we described so far is the program to be injected in the kernel driver, now we
have to deal with the user level counterpart. The following code is a very simple
example of a dispatcher_handler64 function, that limits itself to print out the
MAC address and the counters on the screen.
void dispatcher_mac(u_char *user, const struct pcap_pkthdr *h, const
u_char *sp)
64 As described in chapter 5 - The original BPF architecture and WinPcap, this is the function that LIBPCAP calls when the data are transferred from the driver to user level.
9.2 A system to monitor Web, FTP, SMTP and Telnet traffic
116
{
/*h is the header, sp is the buffer */
/*we calculate the number of transferred blocks by dividing */
/*the buffer size by the block size=64 bytes */
ULONG records=h->caplen/64;
ULONGLONG i;
c_p_data *d;
/*THIS IS THE FIRST BLOCK, OUT OF COUNTS-> PACKETS LOST */
/*BECAUSE THE TABLE IS FULL */
/*this offset=8 is necessary to skip the key, which is */
/*2 32-bit words long*/
d=(c_p_data*)(sp+8);
printf("\n-----------------------------------------\n");
printf("Out of counts\tpackets= %-I64u\tbytes= %-I64u\n ",
d->packets,d->bytes);
/*THIS IS THE LOOP FOR THE “NORMAL” BLOCKS */
for (i=1;i<records;i++)
{
/*curr contains the pointer of the i-th block */
PUCHAR curr=sp+i*64;
d=(c_p_data*)(curr+8);
printf("MAC source = %.2x:%.2x:%.2x:%.2x:%.2x:%.2x\t",
curr[0],
curr[1],
curr[2],
curr[3],
curr[4],
curr[5]);
printf("packets= %-I64u\tbytes= %-I64u\n",d->packets,d->bytes);
}
printf("-----------------------------------------\n");
}
9.2 A system to monitor Web, FTP, SMTP and Telnet traffic
We are interested to obtain statistical data relative to particular TCP services. This is
done through a close matrix, in which each entry corresponds to a TCP port, in
particular each one corresponds to one of the five interested ones (80�Web, 21�FTP,
20�FTP data, 25�SMTP, 23�Telnet). Here we can a table of 11 entries, and 6
blocks; we use a table with more entries than needed because we want to keep it quite
free, since the table is managed as a hash.
The true key length is 2 bytes, but it will be padded to 1 32-bit word.
It is not necessary to resize the memory, since 64kilobytes, the default size, are enough.
In the initialization phase we will initialize a TME data block with the default LOOKUP
callback, normal_lut_w_insert, that is able to insert to entries, then we will use
LOOKUP to force the insertion of the desired 5 entries, without performing a
subsequent EXECUTE, then we will change the LOOKUP callback to
normal_lut_wo_insert, that cannot insert new entries.
9.2 A system to monitor Web, FTP, SMTP and Telnet traffic
117
Here is the initialization code:
1. INIT k=0
2. LD 1
3. SET_REGISTER_VALUE k=TME_KEY_LEN
4. LD 11
5. SET_REGISTER_VALUE k=TME_LUT_ENTRIES
6. LD 6
7. SET_REGISTER_VALUE k=TME_SHARED_MEMORY_BLOCKS
8. LD 0
9. VALIDATE k=64
10. SET_ACTIVE_READ k=0
11. LD 20
12. EX_STH MEM_EX[0]
13. LOOKUP jt=14 jf=14 k=0
14. LD 21
15. EX_STH MEM_EX[0]
16. LOOKUP jt=17 jf=17 k=0
17. LD 23
18. EX_STH MEM_EX[0]
19. LOOKUP jt=20 jf=20 k=0
20. LD 25
21. EX_STH MEM_EX[0]
22. LOOKUP jt=23 jf=23 k=0
23. LD 80
24. EX_STH MEM_EX[0]
25. LOOKUP jt=26 jf=26 k=0
26. LD NORMAL_LUT_WO_INSERT
27. SET_REGISTER_VALUE k=TME_LOOKUP_CODE
28. RET 1
The filter code is the following:
1. LDH P[12]
2. JE 0x800 jt=3 jf=15
3. LDH P[20]
4. JSET 0x1fff jt=15 jf=5
5. LDB P[23]
9.2 A system to monitor Web, FTP, SMTP and Telnet traffic
118
6. JE 0x6 jt=7 jf=15
7. LDXB 4*(P[14]&0xf)
8. LDH P[X+14]
9. EX_STH 0
10. LOOKUP k=0 jt=14 jf=11
11. LDH P[X+16]
12. EX_STH MEM_EX[0]
13. LOOKUP k=0 jt=14 jf=14
14. EXECUTE k=0
15. RET 0
In practice, we check that the packet contains an IP payload (instructions 1-2), then we
verify that it is not a fragment (instructions 3-4), and finally we test that is it a TCP
packet (instructions 5-6).
Afterwards, we load the IP header length with a special BPF instruction (7); then we
load the TCP source port and we perform a LOOKUP; if it is successful, we jump to
instruction 14, and we increment the counters with an EXECUTE; if the search (through
the LOOKUP) was not successful, we load the TCP destination port, and perform a
LOOKUP, again; at this moment, we always perform an EXECUTE; if even the second
search was not successful, the EXECUTE will increment the counters relative to ‘out-
of-counts’ block (the first block in the shared_memory_segment).
The flow chart of the algorithm is shown in Figure 38.
Is source
port known?
Increment
counters.
Is destination
port known?
Increment
‘out of counts’
counters.
Increment
counters.
YES
NOYES
NO
EXIT
EXITEXIT
Figure 38 – Flow chart of the system to monitor some known TCP ports
The following code is an example of dispatcher_handler function that prints out
the packets for each port, together with the timestamp of the block.
9.2 A system to monitor Web, FTP, SMTP and Telnet traffic
119
void dispatcher_handler(u_char *user, const struct pcap_pkthdr *h,
const u_char *sp)
{
ULONG i,records=h->caplen/64;
printf(“---------------------------------\n”);
for (i=0;i<records;i++)
{
PUCHAR block=sp+i*64;
/* we convert the TCP port from host-byte-order*/
/* to network-byte-order*/
USHORT port=block[0]*256+block[1];
/*the counter structure, c_p_data, starts 4 bytes after */
c_p_data *d=(c_p_data*)(block+4);
ULONG hours,mins,secs;
secs=d->timestamp.ts_sec%60;
mins=(d->timestamp.ts_sec/60)%60;
hours=(d->timestamp.ts_sec/3600)%24;
switch(port)
{
case 80: //web
printf(“Web traffic at %.2u:%.2u:%.2u\tPackets=%.I64u\n”,
hours,mins,secs,d->packets);
break;
case 21: //ftp
printf(“FTP traffic at %.2u:%.2u:%.2u\tPackets=%.I64u\n”,
hours,mins,secs,d->packets);
break;
case 20: //ftp data
printf(“FTP data traffic at %.2u:%.2u:%.2u
\tPackets=%.I64u\n”,hours,mins,secs,d->packets);
break;
case 23: //telnet
printf(“Telnet traffic at %.2u:%.2u:%.2u\tPackets=%.I64u\n”,
hours,mins,secs,d->packets);
break;
case 25: //smtp
printf(“SMTP traffic at %.2u:%.2u:%.2u\tPackets=%.I64u\n”,
hours,mins,secs,d->packets);
break;
case 0: //out of counts block
printf(“Other traffic at %.2u:%.2u:%.2u\tPackets=%.I64u\n”,
hours,mins,secs,d->packets);
break;
default:
break;
}
}
}
9.3 A system to monitor IP, TCP and UDP traffic
120
9.3 A system to monitor IP, TCP and UDP traffic
We are interested to know the overall traffic, in terms of bytes, the amount of IP traffic,
and in particular, the percentage of TCP and UDP traffic, relative to IP traffic. This case
is different from the previous ones, since we want to acquire data relative to
heterogeneous aspects of network traffic; the problem can be solved by using a sort of
string as a key to enter the lookup table, as presented in paragraph 3.2.3 - Multiple
analyses.
The used algorithm is shown in Figure 39.
LOOKUP,
key=‘ALL’
EXECUTE
Is IP?
LOOKUP,
key=‘IP’
EXECUTE
LOOKUP,
key=‘TCP’
EXECUTE
Is TCP?
YESNO
EXIT
EXIT
EXIT
YES
Is UDP?
LOOKUP,
key=‘UDP’
EXECUTE
EXIT
NO
NO
YES
Figure 39 – Algorithm used to realize a multiple level monitor
The initialization phase has to create a table of 9 entries (as usual, they are more than
needed, to achieve better performances using the hash search), and 5 blocks (4 blocks
for ‘ALL’, ‘IP’, ‘TCP’, ‘UDP’ plus one for the ‘out-of-counts’ block); the key
length is 4 bytes (the minimum one, 1 32-bit word).
Here is the initialization code:
1. INIT k=0
2. LD 1
3. SET_REGISTER_VALUE k=TME_KEY_LEN
4. LD 9
5. SET_REGISTER_VALUE k=TME_LUT_ENTRIES
6. LD 5
7. SET_REGISTER_VALUE k=TME_SHARED_MEMORY_BLOCKS
9.3 A system to monitor IP, TCP and UDP traffic
121
8. LD 0
9. VALIDATE k=4
10. SET_ACTIVE_READ 0
11. RET 1
The filter code, that implements the algorithm shown in Figure 39, is the following:
1. LD 0x414c4c00 //string ‘ALL’
2. EX_ST MEM_EX[0]
3. LOOKUP k=0 jt=4 jf=4
4. EXECUTE k=0
5. LDH P[12]
6. JE 0x800 jt=8 jf=7
7. RET 0
8. LD 0x49500000 //string ‘IP’
9. EX_ST MEM_EX[0]
10. LOOKUP k=0 jt=11 jf=11
11. EXECUTE k=0
12. LDB P[23]
13. JE 0x6 jt=14 jf=19
14. LD 0x54435000 //string ‘TCP’
15. EX_ST MEM_EX[0]
16. LOOKUP k=0 jt=17 jf=17
17. EXECUTE k=0
18. RET 0
19. JE 0x11 jt=20 jf=24
20. LD 0x55445000 //string ‘UDP’
21. EX_ST MEM_EX[0]
22. LOOKUP k=0 jt=23 jf=23
23. EXECUTE k=0
24. RET 0
An example of dispatcher_handler function prints out the results is the
following:
void dispatcher_handler(u_char *user, const struct pcap_pkthdr *h,
const u_char *sp)
{
ULONG i,records=h->caplen/64;
ULONGLONG all=0,ip=0,tcp=0,udp=0;
9.4 Performances
122
for (i=0;i<records;i++)
{
PUCHAR block=sp+i*64;
c_p_data *d=(c_p_data*)(block+4);
if((block[0]==’A’) &&
(block[1]==’L’) &&
(block[2]==’L’) &&
(block[3]==’\0’))
all=d->bytes;
if((block[0]==’I’) &&
(block[1]==’P’) &&
(block[2]==’\0’) &&
(block[3]==’\0’))
ip=d->bytes;
if((block[0]==’T’) &&
(block[1]==’C’) &&
(block[2]==’P’) &&
(block[3]==’\0’))
tcp=d->bytes;
if((block[0]==’U’) &&
(block[1]==’D’) &&
(block[2]==’P’) &&
(block[3]==’\0’))
udp=d->bytes;
}
printf(“Total bytes = %.I64u\n”,all);
if(all!=0)
printf(“Percentage of IP %f%%\n”,(float)ip/(float)all);
if(ip!=0)
{
printf(“Percentage of TCP relative to IP
%f%%\n”,(float)tcp/(float)ip);
printf(“Percentage of UDP relative to IP
%f%%\n”,(float)udp/(float)ip);
}
}
9.4 Performances
9.4.1 The monitor system under test
The monitor system used for the test uses the TCP session ID as a key to enter the
hash table.
In particular, the key is composed by the IPv4 source and destination addresses, and
TCP source and destination ports, in this order; therefore the key is 12 bytes wide, that
is to say 3 32-bit words, as shown in the following table:
32 bits 32 bits 16 bits 16 bits
IP source IP destination TCP src TCP dest
9.4 Performances
123
We use an open matrix, and we will initialize the TME with the default values; the
initialization code used for the tests is the following:
1. SET_MEMORY k=1300000
2. INIT k=0
3. LD 3
4. SET_REGISTER_VALUE k=TME_KEY_LEN
5. LD 0
6. VALIDATE k=12
7. SET_ACTIVE_READ k=0
8. RET 1
The filter code has to check if the packet contains IP and TCP, and that is not an IP
fragment; on success, extract the IP addresses and TCP ports, to generate the key, to
enter the lookup table. Finally we perform a LOOKUP and an EXECUTE.
1. LDH P[12]
2. JE 0x800 jt=3 jf=18
3. LDB P[23]
4. JE 0x6 jt=5 jf=18
5. LDH P[20]
6. JSET 0x1fff jt=18 jf=7
7. LD P[26]
8. EX_ST MEM_EX[0]
9. LD P[30]
10. EX_ST MEM_EX[4]
11. LDXB 4*(P[14]&0xf)
12. LDH P[X+14]
13. EX_STH MEM_EX[8]
14. LDH P[X+16]
15. EX_STH MEM_EX[10]
16. LOOKUP k=0 jt=17 jf=17
17. EXECUTE k=0
18. RET 0
The used dispatcher handler function simply calculates the total amount of packets
received, by adding the packet counters of the various blocks.
9.4 Performances
124
It prints out this sum, and the number of received blocks. The read process is executed
every five seconds. This is the code of this function:
void dispatcher_handler(u_char *user, const struct pcap_pkthdr *h,
const u_char *sp)
{
ULONG records=h->caplen/64;
ULONGLONG i;
ULONGLONG j=0;
c_p_data *d;
printf("\n-----------------\nRecords=%u\n",records);
for (i=0;i<records;i++)
{
PUCHAR curr=sp+i*64;
d=(c_p_data*)(curr+12);
j+=d->packets;
}
printf("Total packets=%.I64u\n",j);
}
9.4.2 Test bed
The tests were performed using two machines shipped with a copper Gigabit
Ethernet NIC each, connected with a UTP cross cable. The operating system was
Windows 2000 Service Pack 2. The TCP/IP stack was not bound to the NIC devices, so
the IP packets received by these NICs are processed only by the WinPcap driver. We
used the version of the driver that generates timestamps with the assembler instruction
RDTSC.
The test bed is shown in
Ethernet 1Gbps
Full duplex
Traffic
Generator
Sender Receiver
Pentium III - 500MHz256MB RAM, 13GB HD
3Com Gigabit Ethernet 3C996TWindows 2000 Professional, eng
Pentium II - 400MHz128MB RAM, 6.4GB HD
3Com Gigabit Ethernet 3C996TWindows 2000 Server, eng
Ethernet 1Gbps
Full duplex
Traffic
Generator
Sender Receiver
Pentium III - 500MHz256MB RAM, 13GB HD
3Com Gigabit Ethernet 3C996TWindows 2000 Professional, eng
Pentium II - 400MHz128MB RAM, 6.4GB HD
3Com Gigabit Ethernet 3C996TWindows 2000 Server, eng
Figure 40 –Test bed
The traffic was generated by a program based on a modified version of WinPcap that is
able to send packets at a specified rate, for testing purposes. It sends an IPv4 packet
9.4 Performances
125
with a TCP payload65 several times. Figure 41 and Figure 42 show the trend of the
maximum packet rate and the relating traffic (in Mbps) relative to different packet sizes.
Packet rate vs. packet size
0
20000
40000
60000
80000
100000
120000
140000
160000
0 200 400 600 800 1000 1200 1400
Packet size (in bytes)
Packet
rate
(in
pkts
/s)
Figure 41 – Packet rate vs. packet size
Traffic vs. packet size
0
100
200
300
400
500
600
700
0 200 400 600 800 1000 1200 1400 1600
Packet size
Tra
ffic
(in
Mb
ps)
Figure 42 – Traffic vs. packet size
The CPU load was measured through Task Manager, the CPU ticks were measured
through some particular counters that are implemented in the INTEL Pentium
processors, that can be enabled and disabled through a special assembler instruction,
that count the CPU ticks.
9.4.3 Test 1
The first test is used to show the overall performances of the TME architecture with
the BPF filter described in paragraph 9.4.1. It points out the dependency of CPU load
from packet rate.
Some millions packets were sent on the link, at different packet rates; the packet size
was always 100 bytes. The BPF filter is the one
65 In particular, MAC source is 2:2:2:2:2:2, MAC destination is 1:1:1:1:1:1, IP source is 1.2.3.4, IP destination is 5.6.7.8, TCP source port is 4500 and TCP destination port is 4444.
9.4 Performances
126
The result of this test is shown in Figure 43.
CPU vs. packet rate
149.300
225.000
0%
20%
40%
60%
80%
100%
0 50.000 100.000 150.000 200.000 250.000
packet rate
CP
U Theorical
Experimental
Figure 43 – CPU vs. Packet rate
It is clear that the CPU load grows linearly with the packet rate; the maximum packet
rate we reached during the tests was 145.000 packets/s, but the CPU load of the sender
machine was never at 100%. We do not know why we could not reach a higher packet
rate, we suspect that the limit is due either to the WinPcap send module, or to a physical
limit of the Gigabit NICS. In theory, if the trend of the curve in Figure 43 is linear, the
receive machine under test could manage a packet rate of up to 225.000 packets/s
without losses; it is expectable that a much faster processor, for example a Pentium 4
machine, is able to deal with a larger amount of packets per second.
9.4.4 Test 2
The second test points out the contributions of each component to the overall
performance of the system under test. We sent one thousand packets, and we measured
the CPU ticks used to process these packets, isolating the various parts that form the
monitoring system under test.
The test was performed in several steps:
• at first, the filter code was set to a ‘RET 0’ instruction. In practice, this shows the
CPU ticks used by the underlying network drivers to receive a packet; this is the
most variable information, since it depends on a variety of parameters, among which
the specific NIC and its drivers, the packet rate, the specific timings of the PCI bus.
• then we added the filter code that generates the key for the lookup instruction
(instructions 1-13 of the filter code); this test points out the contribute of the key
generation, that is to say the part in which the packet headers are parsed and the key,
formed by some fields of the headers, is created;
• afterwards we added the LOOKUP instruction; in this first case, the driver was
modified not to insert timestamps; this test gives the contribute of the LOOKUP
instruction, without the timestamping issues;
9.4 Performances
127
• then we re-enabled the timestamp generation, and performed the last test again; in
this way we can obtain the CPU ticks needed to add the timestamps; the tests were
made first using the RDTSC instruction to generate the timestamp, and then the
Windows API KeQueryPerformanceCounter.
• finally we added also the EXECUTE instruction, too; this test uses the normal filter
code previously described.
1. LDH P[12]
2. JE 0x800 jt=3 jf=18
3. LDB P[23]
4. JE 0x6 jt=5 jf=18
5. LDH P[20]
6. JSET 0x1fff jt=18 jf=7
7. LD P[26]
8. EX_ST MEM_EX[0]
9. LD P[30]
10. EX_ST MEM_EX[4]
11. LDXB 4*(P[14]&0xf)
12. LDH P[X+14]
13. EX_STH MEM_EX[8]
14. LDH P[X+16]
15. EX_STH MEM_EX[10]
Key generation
16. LOOKUP k=0 jt=17 jf=17 LOOKUP
17. EXECUTE k=0 EXECUTE
18. RET 0 underlying drivers
The result of these tests is shown in Figure 44 and Figure 45.
CPU ticks of each component (RDTSC)
418
146
296
1241350
Underlying drivers
Key generation
LOOKUP withouttimestamp
Timestamp
EXECUTE
Total CPU ticks = 2334
Figure 44 – CPU ticks of each component using RDTSC to generate timestamps
9.4 Performances
128
CPU ticks of each component
(KeQueryPerformanceCounter)
418
146
2156
124
1350
Underlying drivers
Key generation
LOOKUP withouttimestamp
Timestamp
EXECUTE
Total CPU ticks = 4194
Figure 45 – CPU ticks of each component using KequeryPerformanceCounter to generate
timestamps
Several conclusions can be inferred from these tests:
• It is clear that generating timestamps with KeQueryPerformanceTimer is almost 10
times slower than using the RDTSC instruction. This is why it is so important to use
this assembler instruction, although it has the previously mentioned problems
related to the frequency calculation.
• Even using the RDTSC instruction, the CPU ticks needed are more than 10% of the
total CPU ticks needed: this is due to the fact that the process of timestamping
involves several 64-bit integer divisions, that are very slow on 32-bit processors.
• Using the RDTSC, less than 50% of the total CPU ticks is used by the WinPcap
monitoring system, the remaining part depends on the underlying NIC drivers.
• Using a processor with a clock of 500 MHz (like the one used for the tests) the
interval between the arrival of two packets must be at least 4.7 µs using the RDTSC
timestamps, and at least 8.4 µs using the KeQueryPerformanceCounter timestamps.
This means that the maximum traffic rate is about 215.000 pkts/s for the former
solution, and 119.000 pkts/s for the latter solution.
9.4.5 Test 3
This test wants to point out the performances of the used hash algorithm when one or
more hash collisions happen.
The test was performed in this way:
• at first we generated N different packets to generate N-1 collisions;;
• then we sent some packet bursts that involved one or more collisions with the
previously sent packets, and we measured the CPU load.
The packet burst was formed of ten million equal packets of 100 bytes, with a packet
rate of 100.000 packets/s; each different burst caused a fixed number of collisions for
9.4 Performances
129
each packet. The test was repeated with bursts that caused an increasing number of
collisions.
The results of the test are shown in Figure 46.
CPU ticks vs. hash collisions
2334 2405 2482 2545 2618 2731
0
500
1000
1500
2000
2500
3000
0 1 2 3 4 5
Hash collisions
CP
U t
icks
Figure 46 – CPU ticks vs. Hash collisions
The outcome of the test shows clearly that performances are not much lowered by
collisions; moreover we must point out a detail: usually two or more collisions happen
when the hash table is quite full; this can be avoided by using large hash tables, and
limiting the fill rate by using a number of blocks (in the
shared_memory_segment) which is about half the number of entries (in the
lut_segment).
9.4.6 Test 4
This last test wants to point out that performances do not depend on the packet size.
To demonstrate this fact, we sent some packet bursts with the same packet rate (50.000
packets/s) and different packet sizes. For a convenience in the measures, this test is
quoted in CPU load rather than CPU ticks.
The result of this test is shown in Figure 47.
9.4 Performances
130
CPU vs. packet size
24% 25% 25% 25% 25% 26% 27%
0%
20%
40%
60%
80%
100%
60 100 150 200 500 1000 1514
packet size
CP
U
Figure 47 – CPU vs. Packet size
In practice, the CPU load grows slightly; this is due to the fact the underlying network
drivers have to process a number of bytes per second that grows when the packet size
raises. Apart from this fact, the CPU load can be considered almost constant. This is due
to the fact that the packet is never copied, so the performances do not depend on the
packet length.
10 Future work
At the moment of writing this paper, the architecture is somewhat immature: the
proposed extension, TME, is totally written, but many features that are necessary to
build a real universal packet filtering architecture, have still to be realized. Here is a
list of the new features whose development is still in progress:
• New LOOKUP and EXECUTE callbacks. At the moment there are only two
LOOKUP callbacks and two EXECUTE callbacks; in particular, it is necessary to
add a new lookup_fcn to deal with ACLs. As regards EXECUTE callbacks, the
existing function to deal with TCP sessions (tcp_session) is quite rough, as it
cannot deal with the latest version of TCP (e.g. TCP SACK). Another interesting
feature, added through a LOOKUP callback, is the chance to perform pattern
matching, maybe in a way similar to snort.
• Opportunity to bridge packets from an interface to another after having filtered
them at kernel level. At the moment it is possible to perform this task at user
level, that is to say transferring packets received by an interface at user level and
then sending them out of another interface; this solution does not scale when the
packet rate grows, since the operation of receiving and sending each packet
involves at least two context switches and three or four copies of the packet.
• Test a kernel-buffered version of the driver. The current version of the driver
performs the packet filtering on the fly, by executing the BPF virtual processor,
that is to say without buffering them. Although a packet copy can be time
consuming, the buffering technique can make things better during network bursts.
• The problems related to timestamps were solved by using the RDTSC assembler
instruction. The open issues are how to obtain the processor frequency in an
elegant manner (the current technique is too complex and ‘dirty’), and how to
behave on particular processors (in particular the Intel processors implementing
the SpeedStep technology).
• Implement logging capabilities. This feature is quite simple to realize, since we
can exploit the circular buffer, usually used to store packets at kernel level, and
inserting a new working mode, MODE_LOG.
• On-the-fly filter modification. It would be very interesting to have the chance to
modify the packet filtering system while it is running; for example, if we are
monitoring some precise TCP ports (a close matrix), we want to add a new TCP
port to the matrix. In practice we want to inject new snippets of initialization
code that modify the behavior of the currently working filter program.
132
• Cascading filters. At the moment each instance of the driver uses only one filter,
we have to open several drivers instances in parallel to manage multiple filters.
An optimization can be merging n filters into one big one, either in series or in
parallel, composed of modules, that correspond to the original filters, each one
with a different working mode, for example one in capture mode, one in
statistical mode, one in logging mode.
11 Related Work
11.1 The statistical mode of WinPcap drivers
WinPcap offers the chance to work in a particular mode, in which packets are not
captured and passed to a user level application, but rather counted, the statistical mode
(MODE_STAT).
From the BPF filters’ point of view, WinPcap works in the same way of the capture
mode: each packet is filtered by analyzing its protocol headers. The value returned by
the filter (which is a sort of Boolean function applied to the packet) tells if the packet
must be accepted or it must be discarded. The difference between the original capture
mode of WinPcap (and of LIBPCAP) and the statistical mode comes after this phase: in
capture mode the accepted packet is copied into a kernel buffer and then passed to the
user level application, in the statistical mode two counters, representing the number of
bytes and of packets, are incremented when the packet is accepted.
This mode performs a simple kernel level monitoring, that is to say packets are never
transferred up to user level, so the performances are generally excellent.
Another advantage of this working mode is that the filters are the same ones used in
capture mode; as a consequence, the compiler shipped with LIBPCAP (and therefore
with WinPcap) can be used seamlessly. The last advantage is that WinPcap is actually a
library, so users can write their own tools.
This working mode has a big drawback: it can take into account only one single
aspect of traffic at a time: if we want to analyze several aspects of network traffic (e.g.
three different TCP ports), we have to open several instances of the WinPcap driver,
each one considering a particular traffic feature (e.g. each instance filters a particular
TCP port). This approach does not scale when the number of driver instances grow, the
overhead becomes not negligible; moreover, the process of retrieving data from the
various driver instances to the application becomes quite time consuming, since a
context switch is involved for every single instance.
Another drawback (described in paragraph 3.3.1, too) is that it does not fit to open
matricial statistics, whose matrix dimension is not fixed.
11.2 NeTraMet
NeTramet is a free software developed by The University of Auckland's Information
Technology Systems & Services group, and it is the first implementation of the
Realtime Traffic Flow Measurement (RTFM) Working Group's Measurement
Architecture, outlined in RFC 1272, "Internet Accounting Background".
11.3 NTOP
134
It provides a means to obtain statistics about the so called flows, that is to say the
traffic, in terms of bytes and packets, from a source host to a destination host. The
sources and destinations of the flows can be identified at different OSI layers, in
particular at level 2 (data-link), 3 (network) and 4 (transport).
NeTraMet is able to work over Ethernet at data-link layer, over IP, DECnet phase IV,
IPX, Ethertalk and CLNS NSAP at network layer, and various transport protocols,
depending on the underlying network protocols.
It consists of two main parts, the meters, that interact directly with the network,
analyzing the packets flow, and the monitors, that provide an interface between the user
and the meters.
The meters are programmed through the so called rulesets; they are a sequence of
instructions to be executed by the meters’ Pattern Matching Engine.
NeTramet is shipped with a compiler that is able to translate rulesets written with
SRL, the Simple Ruleset Language, an high level procedural which enables users to
create effective rulesets without having to understand the details of the meters’
architecture. The SRL syntax is documented in RFC 2723.
A big advantage of this architecture is that rulesets can be written using a high level
language; the second advantage is that the monitors are able to generate detailed reports
on the data retrieved by meters. The last advantage is that in theory meters and monitors
can work on different machines, and the communication is obtained through the SNMP
protocol.
One of the main disadvantages of the NeTraMet architecture is that it is able to deal
with flows only, that is to say it gives information relative to packets going from a host
to another; we cannot obtain statistical like “web traffic”. Another issue is that on the
major part of the supported platforms, it is based on LIBPCAP; as a consequence, it
captures all the packets, bringing them to user level to perform the needed processing.
This approach generally cuts off performances greatly, since packets are copied at least
two times before reaching the so called meters.
11.3 NTOP
NTOP is a free network tool developed by Luca Deri, that is based on LIBPCAP.
This tool is able to show several aspects of network usage, either at command-line or
through a Web interface.
The main informations it can provide are:
• Data received and transmitted by each host of a network, divided into various
protocols, among which IP, TCP, UDP, IPX, ARP, ICMP, NetBios, OSPF, IGMP.
• Data transmitted over TCP and UPD, divided into application protocols, among
which FTP, Web, Telnet, Mail, NetBios, NFS.
11.3 NTOP
135
• The average and peak throughput of each host.
• The hosts’ activity in the various moments of the day.
• Global statistics, for example the amount of broadcast relative to the overall traffic,
the average packet size.
• Matrices relative to data exchanged between the various hosts of the LAN.
• Some informations relative to TCP session.
It is clear, from the above described features, that the strength of NTOP is that it can
offer a variety of statistics about network traffic within a single tool; moreover, the
statistics are accessible with a standard Web browser, even remotely.
However, this tool has a big drawback, too: it captures the first 350 bytes of the
packets, bringing them to user level, in order to obtain these statistics; this approach has
the same consequences that NeTraMet has: packets are copied several before they are
processed to obtain statistical data. In fact, this tool has proved not being performing on
heavy loaded network: in this case the most frequent issue are packet losses. Moreover,
the user cannot define new statistics.
12 TME Extension Programming Manual
Some new instructions have been introduced to the original BPF architecture, to deal
with the TME; basically, they can be divided in three main categories:
• extended memory management instructions
• TME operative instructions
• initialization instructions.
Each instruction is composed of four elements:
• code: it contains the opcode of the instruction;
• k: it contains a value passed to the instruction, for example in LOAD instructions it
can contain a constant or an address;
• jt: it contains the number of instructions to be skipped forward; it is used only in
JUMP on condition instructions; the program counter is incremented by this value
when the condition coupled with the instruction is true;
• jf: it is almost the same as jt, but the program counter is incremented when the
condition is false.
In the following chapters we will deal with each one, describing each opcode.
12.1 Extended memory management instructions
The extended architecture is shipped with some new memory, extended
memory, whose size, by default, is 64 kB.
New instructions permit 8-, 16- and 32-bit loads and stores, from/to registers A and
X, in absolute mode (i.e. parameter k in the instruction contains the absolute address in
memory), and relative mode (i.e. k contains the offset in memory, base address is stored
in X); there is also an instruction to dynamically resize the extended memory.
Data is represented in big-endian mode, to preserve the order in which bytes traverse
the wire. A more detailed discussion on this problem can be found in paragraph 6.1.1.
The new instructions are
• SETMEMORY
• EX_LD
• EX_LDH
• EX_LDB
• EX_LD_REL
• EX_LDH_REL
• EX_LDB_REL
• EX_LDX
• EX_LDXH
• EX_LDXB
12.1 Extended memory management instructions
137
• EX_ST
• EX_STH
• EX_STB
• EX_ST_REL
• EX_STH_REL
• EX_STB_REL
• EX_STX
• EX_STXH
• EX_STXB
12.1.1 SETMEMORY
This instruction resizes the extended memory; k contains the new size, in bytes;
on failure (e.g. memory not available), the machine flow is interrupted and the virtual
processor returns 066.
NOTES:
• it is not possible to set a size of 0 bytes;
• when memory is resized, the content of memory, previous to the operation, is lost;
• SETMEMORY is not allowed in filter programs.
OPCODE Registers
involved
Allowed in filter
programs.
On failure
SETMEMORY BPF_MISC|BPF_TME|
BPF_SETMEMORY
Flow interrupted,
return 0.
12.1.2 EX_LDx
These instructions load a value from memory to register A; k contains the absolute
address from where to load; x identifies how many bytes to load: nothing means a 32-bit
word, h means a 16-bit word, stored in the least significant part of A, b means a 8-bit
word, stored in the least significant part of A.
OPCODE Registers
involved
Allowed in filter
programs.
On failure
EX_LD
M[k+3]
M[k+2]
M[k+1]
M[k]
A
BPF_LD|BPF_MEM_EX_IMM|
BPF_W
A � Flow interrupted,
return 0.
EX_LDH
M[k+1]
M[k]
0
0
A
BPF_LD|BPF_MEM_EX_IMM|
BPF_H
A � Flow interrupted,
return 0.
66 The BPF virtual processor that executes the instructions returns 0 immediately if an error occurs, for example if the opcode is unknown, the instruction tries to read or write in a non-existent memory location.
12.1 Extended memory management instructions
138
EX_LDB
M[k]
0
0
0
A
BPF_LD|BPF_MEM_EX_IMM|
BPF_B
A � Flow interrupted,
return 0.
12.1.3 EX_LDx_REL
These instructions load a value from memory to register A; here k contains the
relative address from where to load; register X contains the base address. x identifies
how many bytes to load: nothing means a 32-bit word, h means a 16-bit word, stored in
the least significant part of A, b means a 8-bit word, stored in the least significant part of
A.
OPCODE Registers
involved
Allowed in
filter
programs.
On failure
EX_LD_REL
M[X+k+3]
M[X+k+2]
M[X+k+1]
M[X+k]
A
BPF_LD|BPF_MEM_EX_REL|
BPF_W
A, X � Flow
interrupted,
return 0.
EX_LDH_REL
M[X+k+1]
M[X+k]
0
0
A
BPF_LD|BPF_MEM_EX_REL|
BPF_H
A, X � Flow
interrupted,
return 0.
EX_LDB_REL
M[X+k]
0
0
0
A
BPF_LD|BPF_MEM_EX_REL|
BPF_B
A, X � Flow
interrupted,
return 0.
12.1.4 EX_LDXx
These instructions are the same as EX_LDx, except for the fact the value is stored in
register X.
OPCODE Registers
involved
Allowed in
filter
programs.
On failure
EX_LDX
M[k+3]
M[k+2]
M[k+1]
M[k]
X
BPF_LDX|BPF_MEM_EX_IMM|
BPF_W
X � Flow
interrupted,
return 0.
12.1 Extended memory management instructions
139
EX_LDXH
M[k+1]
M[k]
0
0
X
BPF_LDX|BPF_MEM_EX_IMM|
BPF_H
X � Flow
interrupted,
return 0.
EX_LDXB
M[k]
0
0
0
X
BPF_LDX|BPF_MEM_EX_IMM|
BPF_B
X � Flow
interrupted,
return 0.
12.1.5 EX_STx
These instructions store a value from register A to memory; k contains the absolute
address in where to store; x identifies how many bytes to store: nothing means a 32-bit
word, h means that the least significant 16 bits of A are stored, b means that the least
significant 8 bits of A are stored.
OPCODE Registers
involved
Allowed in
filter
programs.
On failure
EX_ST
M[k+3]
M[k+2]
M[k+1]
M[k]
A
BPF_ST|BPF_MEM_EX_IMM|
BPF_W
A � Flow
interrupted,
return 0.
EX_STH
M[k+1]
M[k]
A
BPF_ST|BPF_MEM_EX_IMM|
BPF_H
A � Flow
interrupted,
return 0.
EX_STB
M[k]
A
BPF_ST|BPF_MEM_EX_IMM|
BPF_B
A � Flow
interrupted,
return 0.
12.1.6 EX_STx_REL
These instructions store a value from register A to memory; k contains the relative
address in where to store, register X contains the base address; x identifies how many
bytes to store: nothing means a 32-bit word, h means that the least significant 16 bits of
A are stored, b means that the least significant 8 bits of A are stored.
12.2 TME operative instructions
140
OPCODE Registers
involved
Allowed in
filter
programs.
On failure
EX_ST_REL
M[X+k+3]
M[X+k+2]
M[X+k+1]
M[X+k]
A
BPF_ST|BPF_MEM_EX_REL|
BPF_W
A, X � Flow
interrupted,
return 0.
EX_STH_REL
M[X+k+1]
M[X+k]
A
BPF_ST|BPF_MEM_EX_REL|
BPF_H
A, X � Flow
interrupted,
return 0.
EX_STB_REL
M[X+k]
A
BPF_ST|BPF_MEM_EX_REL|
BPF_B
A, X � Flow
interrupted,
return 0.
12.1.7 EX_STXx
These instructions are the same as EX_STx, except for the fact that they use value in
register X.
OPCODE Registers
involved
Allowed in
filter
programs.
On failure
EX_STX
M[k+3]
M[k+2]
M[k+1]
M[k]
X
BPF_STX|BPF_MEM_EX_IMM|
BPF_W
X � Flow
interrupted,
return 0.
EX_STXH
M[k+1]
M[k]
X
BPF_STX|BPF_MEM_EX_IMM|
BPF_H
X � Flow
interrupted,
return 0.
EX_STXB
M[k]
X
BPF_STX|BPF_MEM_EX_IMM|
BPF_B
X � Flow
interrupted,
return 0.
12.2 TME operative instructions
This class of instructions permits the interaction with the TME core during the
filtering process; they are
• LOOKUP
• EXECUTE
• SET_ACTIVE.
12.2 TME operative instructions
141
12.2.1 LOOKUP
This instruction performs a lookup in the table associated with the currently ACTIVE
TME data block; this instruction is actually a front-end to a lookup-fcn, in particular that
one with ID stored in register TME_LOOKUP_CODE of the current active TME data
block. This instruction is implemented as a jump, if the lookup-fcn returns TME_TRUE,
the instruction condition is true (so program counter jumps to value jt of instruction);
if it returns TME_FALSE, the instruction condition is false (so program counter jumps
to value jf of instruction). LOOKUP needs a key to perform its tasks, that must be
stored in the extended memory, and k must contain the base address of the key.
If no TME data block is currently active, this instruction fails.
lookup-fcns usually save the address of the entry satisfying the key in the
TME_LAST_FOUND register of the currently ACTIVE TME data block, or
0xFFFFFFFF if the search was not successful. Nevertheless it is up to the particular
lookup-fcn to use to behave this way.
OPCODE Registers
involved
Allowed in
filter
programs.
On failure
LOOKUP
BPF_MISC|BPF_TME|
BPF_LOOKUP
� Flow interrupted.
Returns 0.
NOTES:
• at the moment, normal_lut_w_insert lookup-fcn returns TME_FALSE if the
table is full (i.e. it’s not possible to insert new entries); normal_lut_wo_insert
lookup-fcn returns TME_FALSE if the key provided was not found in the table;
• this instruction always refers to the ACTIVE TME data block;
• this instruction can be used during initialization phase to force the creation of some
entries in the lookup table.
12.2.2 EXECUTE
This instruction executes the code associated to the entry stored at address pointed by
register TME_LAST_FOUND of the currently ACTIVE TME data block (usually set by
a previous call to LOOKUP); it is actually a front-end a front-end to an exec-fcn.
The exact operations it performs are:
1. if no block is currently active, fail;
2. look at value stored in register TME_LAST_FOUND, it contains the address of
the entry found in the last call to LOOKUP, 0xFFFFFFFF otherwise:
• if it contains 0xFFFFFFFF, execute exec-fcn whose ID is stored in register
TME_OUT_LUT_EXEC, on the first block of the shared_memory_segment
(the one reserved for this special case);
12.2 TME operative instructions
142
• otherwise, first check if it’s a valid address (it must be in the lut_segment),
then execute exec-fcn whose ID is stored at that location+4, on block whose
address is stored at that location (see Figure 48).
NOTE: a check is made on the validity of the block addresses (they must be in
the shared_memory_segment), prior to the call of the specific exec-fcn.
block address
exec-fcn ID
31 16 15 0
UPFE_LAST_FOUND
0x00004F30
0x00004F30
0x00004F34
0x00004F38
0x00004F3C
Figure 48
The exec-fcns may need to receive also some data, so k must contain the base
address for these data in the extended memory.
OPCODE Registers
involved
Allowed in
filter
programs.
On failure
EXECUTE
BPF_MISC|BPF_TME|
BPF_EXECUTE
� Flow interrupted.
Returns 0.
NOTE: this instruction refers always to the ACTIVE TME data block.
12.2.3 SET_ACTIVE
This instruction activates TME data block k. This instruction fails if that block
cannot be activated, since either it was not validated upon initialization phase (i.e. bit k
in the VALIDATED_BLOCKS register is not set), or k is out of range (i.e. k is
greater/equal MAX_TME_DATA_BLOCKS).
This instruction is used in filter programs to switch between several active blocks.
Upon activation, register WORKING is set to k, too.
NOTE: when the machine is turned on, there is no block active.
OPCODE Registers
involved
Allowed in
filter
programs.
On failure
SET_ACTIVE
BPF_MISC|BPF_TME|
BPF_SET_ACTIVE
� Flow interrupted.
Returns 0.
12.3 TME initialization instructions
143
12.3 TME initialization instructions
The new co-processor, TME, needs some instructions to be properly initialized; these
instructions have to be issued prior to the injection of the ‘filter’ program, in the
initialization phase; moreover most of these can be used only in this phase (i.e. they are
not legal in the filter code).
Here is the full list of them, in the next paragraphs we will show in detail each of
them:
• SET_WORKING
• INIT
• VALIDATE
• GET_REGISTER_VALUE
• SET_REGISTER_VALUE
• SET_ACTIVE_READ.
12.3.1 SET_WORKING
This instruction sets register WORKING to k. It fails if k is out of range (k must be in
the range 0 ÷ MAX_TME_DATA_BLOCKS-1). It can be issued only during initialization
phase, since it must be used to switch to a block that has not yet been validated (see
12.3.3).
OPCODE Registers
involved
Allowed in
filter
programs.
On failure
SET_WORKING
BPF_MISC|BPF_TME|
BPF_SET_WORKING
Flow interrupted.
Returns 0.
12.3.2 INIT
This instruction initializes TME data block k with default values, and makes it
working (i.e. sets register WORKING to k).
Here is a list of default values for each register:
REGISTER DEFAULT VALUE
TME_LUT_ENTRIES 32.007
TME_MAX_FILL_STATE 15.000
TME_REHASHING_VALUE 1
TME_KEY_LEN 01
TME_SHARED_MEMORY_BLOCKS 16.000
TME_FILLED_BLOCKS 0
TME_BLOCK_SIZE 64 bytes
TME_EXTRA_SEGMENT_SIZE 0 bytes
12.3 TME initialization instructions
144
TME_LOOKUP_CODE default_lut_w_insert
TME_DEFAULT_EXEC count_packets
TME_OUT_LUT_EXEC count_packets
TME_LUT_BASE_ADDRESS not defined2
TME_SHARED_MEMORY_BASE_ADDRESS not defined2
TME_EXTRA_SEGMENT_BASE_ADDRESS not defined2
TME_LAST_FOUND 0
1 Key length must be set upon initialization. 2 These values are set when block is validated (through instruction
VALIDATE).
OPCODE Registers
involved
Allowed in
filter
programs.
On failure
INIT
BPF_MISC|BPF_TME|
BPF_INIT
Flow interrupted.
Returns 0.
12.3.3 VALIDATE
This instruction validates TME data block in register A, reserving memory from
address k; if the block is valid, it also makes it active (i.e. sets register ACTIVE to A).
This is the only instruction which can put block k in the pool of the validated ones (i.e.
the only one that can set bit A in register VALIDATED_BLOCKS).
The exact operations it performs are:
1. Check if registers contain values, in particular
• TME_KEY_LEN > 0
• TME_LUT_ENTRIES > 0
• TME_SHARED_MEMORY_BLOCKS > 0
• TME_BLOCK_SIZE > 0
• TME_LOOKUP_CODE must exist
• TME_OUT_LUT_EXEC must exist
• TME_DEFAULT_EXEC must exist.
2. Check if there is enough memory; the amount of required memory is calculated
according to the following:
required_memory = TME_LUT_ENTRIES * 4 +
12.3 TME initialization instructions
145
TME_BLOCK_SIZE * TME_SHARED_MEMORY_BLOCKS +
TME_EXTRA_SEGMENT_SIZE
For example, if default values are used (by issuing instruction INIT),
required_memory will be
required_memory = (TME_LUT_ENTRIES) 32007 * 4 +
(TME_BLOCK_SIZE) 64 * (TME_SHARED_MEMORY_BLOCKS)
16000 +
(TME_EXTRA_SEGMENT_SIZE) 0 = 1.152.028 bytes
Since this data is stored in the extended memory, the validation succeeds only if k
+ required_memory is below (or equal) extended memory size. A small
example is shown in Figure 49.
3. If the previous steps succeed, the block is validated:
• bit A of register VALIDATED_BLOCKS is set to 1
• register ACTIVE is set to A
• register WORKING is set to A
• register TME_LUT_SEGMENT_BASE_ADDRESS is set to k
• register TME_SHARED_MEMORY_BASE_ADDRESS is set to
k + 4*TME_LUT_ENTRIES
• register TME_EXTRA_SEGMENT_BASE_ADDRESS is set to
k + 4*TME_LUT_ENTRIES + TME_BLOCK_SIZE*TME_SHARED_MEMORY_BLOCKS
• register TME_FILLED_BLOCKS is set to 1 (this is because shared memory
block #0 is used for ‘out-of-counts’ packets).
extended_memory
LUT_SEGMENT
EXTRA_SEGMENT
SHARED_MEMORY
_SEGMENT
0x00000000
0x00000040
0X00008040
0x0000ffff
32.768 bytes
(required memory)
64 bytes
65.536 bytes
(total memory)
Figure 49 – Validate memory check.
12.3 TME initialization instructions
146
OPCODE Registers
involved
Allowed in
filter
programs.
On failure
VALIDATE
BPF_MISC|BPF_TME|
BPF_VALIDATE
A Flow interrupted.
Returns 0.
12.3.4 GET_REGISTER_VALUE
This instruction is used to retrieve the value of a register in the currently WORKING
TME data block (i.e. that one whose index is in the WORKING register).
It is possible to retrieve a value from all the registers, except from
TME_LOOKUP_CODE. This is due to the specific implementation of the TME, and will
be solved in the future releases. k contains the code of the register, and the retrieved
value is saved in register A. The instruction fails if k contains an unknown register code.
Valid register codes are
• TME_LUT_ENTRIES
• TME_MAX_FILL_STATE
• TME_REHASHING_VALUE
• TME_KEY_LEN
• TME_SHARED_MEMORY_BLOCKS
• TME_FILLED_ENTRIES
• TME_BLOCK_SIZE
• TME_EXTRA_SEGMENT_SIZE
• TME_FILLED_BLOCKS
• TME_DEFAULT_EXEC
• TME_OUT_LUT_EXEC
• TME_SHARED_MEMORY_BASE_ADDRESS
• TME_LUT_BASE_ADDRESS
• TME_EXTRA_SEGMENT_BASE_ADDRESS
• TME_LAST_FOUND_BLOCK.
NOTE: the exact values of these IDs are stored in a header file (*.h) which is by default
included in the projects using this extension. The use of the numeric IDs is strongly
discouraged, since they can change in the future releases of the TME.
OPCODE Registers
involved
Allowed in
filter
programs.
On failure
GET_REGISTER_VALUE
BPF_MISC|BPF_TME|
BPF_GET_REGISTER_VALUE
A � Flow
interrupted.
Returns 0.
12.3 TME initialization instructions
147
12.3.5 SET_REGISTER_VALUE
This instruction is used to set a value in a specific register of the current working
TME data block (i.e. that one whose index is in the WORKING register).
Values to be set must be stored in A register, and k contains the register code. Only
some registers can be set, and some of them can be set only during the initialization
phase, as explained in the following table.
REGISTER
Allowed in filter
programs
Allowed during
initialization
phase
Notes
TME_LUT_ENTRIES � �
TME_MAX_FILL_STATE � �
TME_REHASHING_VALUE � �
TME_KEY_LEN � This register must be
set prior to call
VALIDATE
TME_SHARED_MEMORY_BLOCKS �
TME_FILLED_ENTRIES � �
TME_BLOCK_SIZE �
TME_EXTRA_SEGMENT_SIZE �
TME_FILLED_BLOCKS � � This register is set
only if the value is
greater than the old
one.
TME_DEFAULT_EXEC � � Only valid exec_fcn
IDs are accepted.
TME_OUT_LUT_EXEC � � Only valid exec_fcn
IDs are accepted.
TME_SHARED_MEMORY_BASE_ADDRESS
TME_LUT_BASE_ADDRESS
TME_EXTRA_SEGMENT_BASE_ADDRESS
The content of these
registers cannot be
changed, they are set
by VALIDATE
instruction.
TME_LAST_FOUND_BLOCK � �
TME_LOOKUP_CODE � � Only valid lookup_fcn
IDs are accepted.
Special care must be taken in changing register values after the TME data block has
been validated; although the TME core performs internal checks before every operative
instruction, putting malicious values in the registers can hang-up the entire processor.
12.3 TME initialization instructions
148
OPCODE Registers
involved
Allowed in
filter
programs.
On failure
SET_REGISTER_VALUE
BPF_MISC|BPF_TME|
BPF_SET_REGISTER_VALUE
A � Flow
interrupted.
Returns 0.
12.3.6 SET_ACTIVE_READ
This instruction sets what TME data block will be passed to the user. It sets register
ACTIVE_READ to k, only if bit k in register TME_VALIDATED_BLOCKS is set (i.e.
only if block k has been previously validated).
OPCODE Registers
involved
Allowed in
filter
programs.
On failure
SET_ACTIVE_READ
BPF_MISC|BPF_TME|
BPF_SET_ACTIVE_READ
Flow
interrupted.
Returns 0.
Instruction Opcode Description
Registers
involved
Allowed
in filter
programs
SETMEMORY BPF_MISC|BPF_TME|BPF_SETMEMORY Resizes extended memory. NONE
EX_LD BPF_LD|BPF_MEM_EX_IMM|BPF_W Loads word from ext_mem to A A �
EX_LDH BPF_LD|BPF_MEM_EX_IMM|BPF_H Loads half-word from ext_mem to A A �
EX_LDB BPF_LD|BPF_MEM_EX_IMM|BPF_B Loads byte from ext_mem to A A �
EX_LD_REL BPF_LD|BPF_MEM_EX_REL|BPF_W Loads word from ext_mem to A A, X �
EX_LDH_REL BPF_LD|BPF_MEM_EX_REL|BPF_H Loads half-word from ext_mem to A A, X �
EX_LDB_REL BPF_LD|BPF_MEM_EX_REL|BPF_B Loads byte from ext_mem to A A, X �
EX_LDX BPF_LDX|BPF_MEM_EX_IMM|BPF_W Loads word from ext_mem to X X �
EX_LDXH BPF_LDX|BPF_MEM_EX_IMM|BPF_H Loads half-word from ext_mem to X X �
EX_LDXB BPF_LDX|BPF_MEM_EX_IMM|BPF_B Loads byte from ext_mem to X X �
EX_ST BPF_ST|BPF_MEM_EX_IMM|BPF_W Stores word from A to ext_mem A �
EX_STH BPF_ST|BPF_MEM_EX_IMM|BPF_H Stores half-word from A to ext_mem A �
EX_STB BPF_ST|BPF_MEM_EX_IMM|BPF_B Stores byte from A to ext_mem A �
EX_ST_REL BPF_ST|BPF_MEM_EX_REL|BPF_W Stores word from A to ext_mem A, X �
EX_STH_REL BPF_ST|BPF_MEM_EX_REL|BPF_H Stores half-word from A to ext_mem A, X �
EX_STB_REL BPF_ST|BPF_MEM_EX_REL|BPF_B Stores byte from A to ext_mem A, X �
EX_STX BPF_STX|BPF_MEM_EX_IMM|BPF_W Stores word from X to ext_mem X �
EX_STH BPF_STX|BPF_MEM_EX_IMM|BPF_H Stores half-word from X to ext_mem X �
EX_STB BPF_STX|BPF_MEM_EX_IMM|BPF_B Stores byte from X to ext_mem X �
LOOKUP BPF_MISC|BPF_TME|BPF_LOOKUP Performs a lookup in ACTIVE block NONE �
EXECUTE BPF_MISC|BPF_TME|BPF_EXECUTE Performs an execute in ACTIVE block NONE �
SET_ACTIVE BPF_MISC|BPF_TME|BPF_SET_ACTIVE Sets block k to ACTIVE NONE �
SET_WORKING BPF_MISC|BPF_TME|BPF_SET_WORKING Sets block k to WORKING NONE
INIT BPF_MISC|BPF_TME|BPF_INIT Initializes block to default values NONE
VALIDATE BPF_MISC|BPF_TME|BPF_VALIDATE Validates block k at address A A
GET_REGISTER_VALUE BPF_MISC|BPF_TME|BPF_GET_REGISTER_VALUE Retrieves a value from a TME block A �
SET_REGISTER_VALUE BPF_MISC|BPF_TME|BPF_SET_REGISTER_VALUE Sets a register in a TME block A �67
SET_ACTIVE_READ BPF_MISC|BPF_TME|BPF_SET_ACTIVE_READ Sets block k to be read NONE
67 With some limitations, see paragraph 12.3.5.
13 Bibliography
[1] S. McCanne and V. Jacobson, The BSD Packet Filter: A New Architecture for User-
level Packet Capture. Proceedings of the 1993 USENIX Technical Conference (San
Diego, CA, Jan. 1993), USENIX.
[2] Gary R. Wright, W. Richard Stevens, TCP-IP illustrated Volume II. Addison-
Wesley professional computing series.
[3] Microsoft Software Development Kit and Driver Development Kit Examples,
Microsoft Corporation.
[4] Microsoft Windows NT and Windows 2000 Driver Development Kit
documentation, Microsoft Corporation.
[5] Peter G. Viscarola, W. Anthony Mason, Windows NT Device Driver Development,
Macmillan Technical Publishing.
[6] Microsoft MSDN Library, Microsoft Corporation, May 2001.
[7] A. Begel, S, McCanne, S. L. Graham, BPF+: Exploiting Global Data-flow
Optimization in a Generalized Packet Filter Architecture, 1999.
[8] T. H. Cormen, C. E. Leiserson, R. L. Rivest, Introduzione agli algoritmi Volume 1,
Jackson Libri, 1994.
[9] Donald E. Knuth, The Art of Computer Programming, volume 3 Sorting and
Searching, Addison Wesley, 1981.
[10] M. Roesch, Snort – Lightweight Intrusion Detection for Networks, presented at
the USENIX LISA conference in November 1999.
[11] N. Brownlee, C. Mills, G. Ruth, RFC2063 - Traffic Flow Measurement:
Architecture, January 1997.
[12] Winpcap web site, available at http://www.netgroup.polito.it/winpcap
12.3 TME initialization instructions
150
[13] IA-32 Intel Architecture Software Developer’s Manual, volume 2 Instruction Set
Reference, Intel Corporation, 2001.
[14] NeTraMet web site, available at
http://www2.auckland.ac.nz/net/Accounting/ntm.Release.note.html
[15] NTOP web site, available at http://www.ntop.org