POLITECNICO DI MILANO Como Campus · con la pazienza, la comprensione e il sacriﬁcio. ... BigEar...

POLITECNICO DI MILANOComo Campus

M.Sc. in Computer Science and EngineeringDepartment of Electronics, Information and Bioengineering

BigEar: A LOW COST WIRELESS DISTRIBUTED AUDIO CAPTURINGSYSTEM FOR UBIQUITOUS AUTOMATIC SPEECH RECOGNITION

Advisor: Prof. Fabio SaliceTutor: Eng. Hassan Saidinejad

Master Graduation Thesis of:Stefano Gorla, ID 745603

Academic Year 2014-2015

POLITECNICO DI MILANOPolo Regionale di Como

Corso di Laurea Magistrale in Ingegneria InformaticaDipartmento di Elettronica, Informazione e Bioingegneria

BigEar: SISTEMA A BASSO COSTO, WIRELESS, DISTRIBUITO DIACQUISIZIONE AUDIO PER IL RICONOSCIMENTO AUTOMATICO DEL

PARLATO

Relatore: Prof. Fabio SaliceTutore: Ing. Hassan Saidinejad

Tesi di Laurea di:Stefano Gorla, matricola 745603

Anno Accademico 2014-2015

Ai miei genitori,che mi hanno sempre sostenuto

con la pazienza, la comprensione e il sacrificio.

Acknowledgements

First of all I would like to thank Prof. Fabio Salice deeply, for his valuable and helpful advice, support

and encouragements during this work. Special thanks to Eng. Ashkan Saidinejad for his useful and

important tutoring. Without their support and their help, the pages of this work would be white.

Also I have to mention the names of Prof. Augusto Sarti and Prof. Giuseppe Bertuccio for their

willingness; a special thank to Eng. Marco Leone and Eng. Carlo Bernaschina for their help in the

prototype realization and implementation.

Thanks to Emanuele De Bernardi, Fabio Veronese, Simone Mangano, who shared with me important

moments of social brainstorming, but also lunch breaks and a lot of coffee.

During my long study career, I met many people, and many of them shared part of my journey. I

want to thank first my family, dad and mom, my brother Marco with his wife Sonia, my nephews Anna

and Giulia, my sisters Elena with her husband Pietro and Agnese, with Eleonora, Lucrezia and of course

in our hearts Mauro.

Thank you to the people with whom I share my passions: my choir, the diocesan school of music,

don Rinaldo, don Simone, don Nicholas, Celeste, Andrea and Carlo; Filippo, Laura and the friends from

Castel San Pietro. Thanks to Rita, Ugo and the friends of AdFontes for their infectious enthusiasm.

Thank you to the Non-violent guys group, to the Benzoni’s family, and to don Felice Rainoldi: not only

an outstanding teacher but specially a true friend.

Thank you to the music: a strange world where you can escape, but where, as always, are the best

inspirations.The final thank goes to Paola. Thank you for having always believed in me and for giving me day

by day the right encouragement to continue my work in many hard times. Thank you.

i

Abstract

THE ageing of world’s population will raise the demand and challenges of elderly care in comingyears. Several approaches have been devised to deal with the needs of older people proactively.Assistive domotics represents a relatively recent effort in this direction; in particular, Vocal Inter-

action can be a favored way to control the smart home environment, providing that the interface fulfillsrequirements of transparency and unobtrusiveness. Absence of intrusive devices induces a more natu-ral interaction model in which there is no need to wear a microphone or to issue commands to specific“hotspots” of the dwelling.

From these assumptions, a wireless, modular and low-cost speech capturing system has been imple-mented, in which a set of wireless audio sensors send captured data to a Base Station, which in turn isresponsible for aggregating received data in order to rebuild the speech captured distributely by BigEarAudio Sensors. The reconstruction algorithm performs a first stage of energy and delay analysis of theaudio streams coming from the sensors; this stage is needed for compensating energy and delay dif-ferences due to different source-sensors distances. Then, the streams are superposed in order to mergeeach contribution into a unique output stream. Each sensor generates an audio stream that, depending onnetwork interaction model, will be not complete, presenting thus sequences of silence (holes). Holes inthe reconstructed signal drastically decrease the accuracy of the speech recognition procedure. AlthoughBigEar system, working with the best network parameters, ensures high chances for a successful speechrecognition, four different methods have been used to repair the audio signal before sending it to thespeech recognition block.

BigEar architecture has been simulated by means of a MATLAB-based simulator that allows to studythe whole system behavior, from the environment (room) acoustic simulation up to Network Interactionprotocol. Once best parameters are pointed out by the simulation, a real-world prototype has beenrealized.

From Results Analysis it can be seen that BigEar can be identified as a minimum cost, wirelessand distributed system. Moreover, ubiquitous approach allows Data Intelligence mechanisms, e.g. per-forming a coarse-grain localization - using sensor signal power and delays information - in order to addinformative content that could disambiguate context-free vocal commands (“Turn off the light” or - bet-ter - “Turn off this light” could be integrated with localization information in order to determine whichlight has to be switched off).

iii

Sommario

L’INVECCHIAMENTO della popolazione mondiale nei prossimi anni causerà l’accrescimento del-la domanda e le relative sfide nella cura degli anziani. Sono stati studiati numerosi approcciper far fronte alle esigenze delle persone meno giovani. La domotica assistiva rappresenta

un passo - seppur relativamente recente - in questa direzione; in particolare, l’interazione vocale puòessere una via preferenziale per controllare l’ambiente domestico gestito dalla domotica, a patto chel’interfaccia uomo-macchina soddisfi requisiti di trasparenza e non-intrusività. L’assenza di dispositi-vi invasivi induce a una interazione più naturale in cui non sia necessario indossare un microfono oimpartire comandi attraverso specifici punti di ascolto dell’abitazione.

A partire da queste premesse è stato implementato un sistema modulare, senza fili e a basso costo dicattura del parlato in cui un insieme di sensori audio invia i dati acquisiti a una Base Station, la qualeprovvede ad assemblare i dati ricevuti in modo da ricostruire il parlato precedentemente acquisito, inmodo distribuito, dai sensori. L’algoritmo di ricostruzione esegue anzitutto l’analisi energetica e dei ri-tardi dei flussi audio; questa operazione è necessaria alla compensazione delle differenze energetiche edei ritardi dovuti alle distanze tra i sensori e la sorgente. Dopodiché, i flussi audio vengono sovrappostiin modo da fondere i singoli contributi in un unico flusso audio. A seconda del modello di interazionedi rete, ogni sensore genera un flusso audio che presenta sequenze di silenzio (buchi). I buchi nel segna-le ricostruito diminuiscono drasticamente l’accuratezza della procedura di riconoscimento del parlato.Sebbene il sistema BigEar, quando configurato per operare con i migliori parametri di rete, assicuri alteprobabilità affinché il riconoscimento dia risultati positivi, sono stati testati quattro diversi metodi perriparare il segnale audio prima che esso venga inviato al blocco di riconoscimento del parlato.

L’architettura è stata testata tramite un simulatore basato su codice MATLAB che permette di studiareil comportamento dell’intero sistema, dalla simulazione acustica dell’ambiente fino ai protocolli usati perle interazioni di rete. A partire dai parametri ottimi indicati dal simulatore è stato realizzato un prototiporeale.

Dall’analisi dei risultati si può notare che BigEar può essere identificato quale sistema a costo mi-nimo, senza fili e modulare. Inoltre, l’approccio distribuito al problema permette meccanismi di In-telligenza dei Dati, ad esempio eseguendo una localizzazione sommaria della sorgente (sfruttando leinformazioni di ritardo e di energia del segnale) che permetta di aggiungere contenuto informativo ingrado di disambiguare comandi vocali privi di contesto (i comandi “Spegni la luce” o - meglio - “Spegniquesta luce” potrebbero essere integrati con una localizzazione in grado di determinare quale luce debbaessere spenta).

v

Contents

Acknowledgements i

Abstract iii

Sommario v

1 Introduction 11.1 BRIDGe - Behaviour dRift compensation for autonomous and InDependent livinG . . . 11.2 Ubiquitous Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 BigEar Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Structure of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 BigEar Architecture 72.1 Overview of the System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Wixel Programmable USB Wireless Module . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Audio Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.1 Physical Characteristics of the Typical Situation . . . . . . . . . . . . . . . . . . 92.3.2 Sensitivity of the microphone . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3.3 ADC characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 BigEar Receiver and Base station . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5 Network Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5.1 ALOHA Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 BigEar Modeling and Simulation 193.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Audio Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.1 Physical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2.2 MCROOMSIM toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3 Sensors network model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3.1 Radio Transmission Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3.2 N-buffer Internal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

vii

Capitolo 0. Sommario

4 Signal Reconstruction and Repair 354.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.2 Energy Compensation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2.1 Bias Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.2.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3 Streams Superposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.3.1 Weighted Sum of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 414.3.2 Holes Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4 Cross-correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.4.1 Cross-correlation Drawbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.4.2 Envelopes Cross-correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.5 Healing Signal Fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.5.1 B-Spline Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.5.2 Noise Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.5.3 Packet Loss Concealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.5.4 LSAR - Least Squares Auto-Regressive Interpolation . . . . . . . . . . . . . . . 484.5.5 Audio inpainting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.5.6 Healing methods comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5 BigEar Implementation 535.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.2 Hardware Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.2.1 BigEar Audio Capture Board . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.2.2 BigEar Receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.3 Software Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.3.1 BigEar Audio Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.3.2 BigEar Input Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.3.3 BigEar Receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.3.4 BigEar Base Station . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6 BigEar Results Analysis 756.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.2 Metrics Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.2.1 Reconstructed Signal Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.2.2 Software Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.3 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786.4 On-field Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786.5 Results Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.5.1 Reconstructed Signal Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 806.5.2 Software Performance Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836.5.3 Coarse-grain localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

7 Conclusions and Future Work 897.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 897.2 Strengths and Weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

7.2.1 Strengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 907.2.2 Weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

viii

A TI CC2511 MCU Key features 93A.1 CC2511F32 Logic Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94A.2 CC2511F32 Radio Packet format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

B BigEar simulator MATLAB implementation 97B.1 Radio Transmission Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97B.2 N-Buffer Internal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98B.3 Signal Reconstruction block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

C BigEar Reconstruction MATLAB Script 103

D Cross-correlation convergence test 107

E BigEar Implementation schematics and pictures 111E.1 BigEar Audio Capture board . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111E.2 BigEar Receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112E.3 BigEar Audio Capture application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113E.4 BigEar Input Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117E.5 BigEar Receiver application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118E.6 BigEar SerialPort application GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

F BigEar Simulation 121F.1 Simulation setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

G Datasheets 125G.1 Kingstone KECG2738PBJ-A Electret Microphone . . . . . . . . . . . . . . . . . . . . 125G.2 Linear Technology LT1013 Dual Precision Op Amp . . . . . . . . . . . . . . . . . . . . 127

Bibliography 133

ix

List of Figures

2.1 Overview of the BigEar architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Wixel module pinout and components . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 CC2511 ADC Block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 BigEar Receiver Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.5 BigEar Application Reconstruction Logic . . . . . . . . . . . . . . . . . . . . . . . . . 152.6 Aloha vs. Slotted Aloha throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Architecture model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 Geometry of wave propagation from a point source x1 to a listening point x2 . . . . . . . 213.3 Point-to-point spherical wave simulator . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4 Law of reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.5 Geometry of an acoustic reflection caused by “multipath” propagation . . . . . . . . . . 223.6 Image source method using two walls, one source and one listening point. . . . . . . . . 243.7 MCROOMSIM Data flow from configuration stage to the end of simulation . . . . . . . 263.8 Simulation and reconstruction flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . 293.9 Flowchart of the radio transmission model . . . . . . . . . . . . . . . . . . . . . . . . . 313.10 Buffer Frames sliding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.11 Flowchart of the N-buffer model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1 Flowchart of the signal reconstruction block . . . . . . . . . . . . . . . . . . . . . . . . 374.2 Alignment of audio packets considering their timestamp . . . . . . . . . . . . . . . . . 374.3 Flowchart of the correlation analysis block . . . . . . . . . . . . . . . . . . . . . . . . . 384.4 Representation of a holey signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.5 How summation of different biases might influence the reconstructed signal . . . . . . . 404.6 Harmonic distortion due to unweighted sum of contributions . . . . . . . . . . . . . . . 424.7 Cross-correlation test between noisy and holey signals . . . . . . . . . . . . . . . . . . 454.8 Cross-correlation analysis and alignment on signal’s envelopes . . . . . . . . . . . . . . 464.9 B-Spline method for healing reconstructed signal . . . . . . . . . . . . . . . . . . . . . 474.10 G.711.I Frame erasure concealment algorithm . . . . . . . . . . . . . . . . . . . . . . . 484.11 Gabor spectrograms of original, reconstructed and inpainted streams . . . . . . . . . . . 51

5.1 BigEar Audio Capture Board Biasing circuit . . . . . . . . . . . . . . . . . . . . . . . . 545.2 BigEar Audio Capture Board Signal Conditioning stage . . . . . . . . . . . . . . . . . . 56

xi

List of Figures

5.3 LT1013 OpAmp Voltage VS. Frequency Gain . . . . . . . . . . . . . . . . . . . . . . . 565.4 Frequency response of the simulated signal conditioning circuit . . . . . . . . . . . . . . 585.5 Flowcharts of the BigEar Audio Capture application . . . . . . . . . . . . . . . . . . . . 605.6 Signal and Saturation LED indicators policy . . . . . . . . . . . . . . . . . . . . . . . . 635.7 BigEar Data field structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.8 Flowchart of the BigEar Receiver application . . . . . . . . . . . . . . . . . . . . . . . 675.9 BigEar SerialPort raw file formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.10 BigEar SerialPort Capture flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.11 BigEar SerialPort Raw-to-CSV decoding . . . . . . . . . . . . . . . . . . . . . . . . . 715.12 BigEar SerialPort Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725.13 BigEar Reconstructor Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.1 Reconstructed signal metrics plotted as a function of TmaxDelay parameter . . . . . . 816.2 Fill_ratio as a function of TmaxDelay . . . . . . . . . . . . . . . . . . . . . . . . . 826.3 BigEar Reconstructor RPR metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.4 BigEar Capture Board Application timing analysis . . . . . . . . . . . . . . . . . . . . 856.5 BigEar Reconstructor PAR metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866.6 Signal power vs. alignment delay plot for coarse-grain localization . . . . . . . . . . . . 88

G.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

xii

List of Tables

2.1 Physical characterization of the Typical Situation . . . . . . . . . . . . . . . . . . . . . 112.2 ADC: Speed of Conversion VS. Decimation Rate . . . . . . . . . . . . . . . . . . . . . 13

4.1 Noise Addition for Speech Signal Repair . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2 Healing methods comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.1 Design Choices for Signal Conditioning stage . . . . . . . . . . . . . . . . . . . . . . . 585.2 BigEar Audio Capture application Design Choices and Parameters . . . . . . . . . . . . 66

6.1 BigEar Simulator parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

xiii

1Introduction

QUALITY of life depends heavily on the efficiency, comfort and cosiness of the place an individual

calls “home”. Thus, a wide range of products and systems have been invented in the past to

advance human control over the entire living space. Domotics is a field specializes in specific

automation techniques for private homes, often referred to as “home automation” or “smart home tech-

nology”. In essence, home environment control and automation extend the various techniques typically

used in building automation, such as light and climate control, control of doors and window shutters,

surveillance systems, etc. through the networking of ICT in the home environment, including the inte-

gration of household appliances and devices. Such solutions are not only offering comfort and security,

but when serving an elderly or a person with disability (a fragile person in general) can leverage safety

and individual independence. Assistive domotics represents a relatively recent effort in this direction

that further specializes in the needs of people with disability, older persons, and people with little or no

technical affinity, and which seeks to offer such residents new levels of safety, security and comfort, and

thereby the chance to prolong their safe staying at home.

The ageing of world’s population will raise the demand and challenges of elderly care in coming

years. Based on a study of the US census, the number of people aged over 65 will increase by 101

percent between 2000 and 2030, at a rate of 2.3 percent each year; during that same period, the number

of family members who can provide support for them will increase by only 25 percent, at a rate of

0.8 percent each year. Several approaches have been devised to deal with the needs of older people

proactively.

1.1 BRIDGe - Behaviour dRift compensation for autonomous and InDe-

pendent livinG

The BRIDGe1 (Behaviour dRift compensation for autonomous InDependent livinG) project [17], being

done at Politecnico di Milano - Polo di Como, aims to build strong connections between a person living

independently at home and his or her social environment (family, caregivers, social services, and so on)

1atg.deib.polimi.it/projects

1

http://atg.deib.polimi.it/projects/

Chapter 1. Introduction

by implementing a system that provides focused interventions according to the user’s needs.

The target of Bridge are people with mild cognitive or physical impairments and, more generally,

fragile people whose weakness threatens their autonomy, health or other important aspects of life. Fragile

people need for mutual reassurance: they typically want to be independent and autonomous, but they also

know that often somebody else must be present to help them with unexpected needs.

At Bridge’s core is a wireless sensor-actuator network that supports house control and user behavior

detection through a rich and flexible communication system between the person and his or her social

environment, aimed at reassuring both the family and the user.

One of the simpler ways the user can adopt for meeting his needs is to ask someone to do something:

ask to switch off the light, to unlock the door, to increase the room temperature, etc. Inhabitants of the

“smart home” can vocally control some parts of a dwelling (lights, doors, principal gate, and so on) by

means of issuing commands to a Vocal Command Interface (VCI).

The inhabitant starts by activating the VCI, which is continuously listening and waiting for the per-

sonalized activation keyword. Activation is confirmed to the user through a vocal message, and the VCI

is now ready to receive the inhabitant’s vocal command (for example, “turn on the light in the kitchen”).

The vocal command is analyzed in an ad hoc application and the result of the interaction is the execution

of the vocal command.

1.2 Ubiquitous Computing

Ubiquitous computing is a paradigm in which the processing of information is linked with each activity or

object as encountered. It involves connecting electronic devices, including embedding microprocessors

to communicate information. Devices that use ubiquitous computing have constant availability and are

completely connected.

Ubiquitous computing is one of the keys to success of Bridge system: the more the sensors network is

unobtrusive and transparent with respect to the interaction of the elder with his environment, the greater

is the sense of autonomy and independence felt by the user.

In case of vocal interaction, a ubiquitous approach to the problem requires that the interaction

take place as transparent as possible: no need to wear a microphone or to issue commands to specific

“hotspots” of the dwelling. Absence of intrusive devices induces a more natural interaction model that

generates in turn several positive aspects in chain:

Learning curve performance: Elderly user has no need to wear devices in order to perform human-

computer interaction; he/she only need to ask someone to perform an action. In this way, the

learning process will not present the typical high barrier that arouses when the user approaches an

interaction model foreign to his user experience.

Technology acceptance: According to the Technology Acceptance Model [2], the two most important

attitudinal factors in explaining acceptance and usage of a new technology are perceived usefulness

and perceived ease of use. Perceived usefulness is described as “the degree to which a person

2

1.3. BigEar Application

believes that using the particular technology would enhance his/her job performance”. Perceived

ease of use is defined as “the extent to which a person believes that using a technology is free of

effort”. Lacks in ease of use, in particular when dealing with elderly, can demotivate the user and

bring him to consider a technology as useless or misleading. “Why do I have to wear a mobile

phone for switching off the light? It would be easier if I could use a remote control directly”.

Mastering of the living enviroment: As permanent medical devices can reduce the perceived quality

of life, so intrusive assistive technologies can be perceived as a gilded cage that lowers the sense

of freedom and induces the user to feel a stranger in his own house. Ubiquitous and transparent

systems can help elderly or injured people to feel masters of their home.

1.3 BigEar Application

Target of this work is to model, realize and implement a distributed speech capturing system that meets

the expectations of ubiquitousness and transparency described above. Behind the acronym BigEar (uBiq-

uitous wIreless low-budGet spEech cApturing inteRface) is an audio acquisition system built following

Wireless Sensor Network [3] requirements:

Minimum cost: The adopted technology (hardware and software) has to consider the economical pos-

sibilities of the people according to the paradigm: “a good but costly solution is not a solution at

all”.

Wireless: The absence of power and/or signal cables is a strong requirement in order to lower costs

for house adaptation. Moreover, wireless systems can ensure a higher degree of flexibility and

configurability than wired systems.

Distributed: The key for pervasiveness is distributed computing; according to definition, [7, p. 2] the

devices concur in building a result whose information content is more than the sum of single contri-

bution. Moreover, sensors are completely independent and an eventual failure will not completely

compromise the result of the sensor network collaboration.

Modular: The system should be implemented using a modular approach in order to be scalable and

quickly configurable for matching environment characteristics and user’s needs.

Responsive: Speech recognition should be immediate, so speed of processing is a crucial requirement

in order to give the user an immediate feedback; assistive domotic interaction has to be as fast as

possible since the user don’t wait for Microsoft Window’s Hourglass.

1.4 Related Works

Literature focusing on Wireless Low-Cost Speech Capturing systems is limited. In general, the approach

to the subject tends to emphasize one aspect (such as audio quality) at the expense of others (such as

flexibility) and none of the examined projects can be considered as low cost.

3

Chapter 1. Introduction

SWEET-HOME Project

The SWEET-HOME project [15] aims at designing a smart home system based on audio technology

focusing on three main aspects: to provide assistance via natural man-machine interaction (voice and

tactile command), to ease social e-inclusion and to provide security reassurance by detecting situations

of distress. The targeted smart environments in which speech recognition must be performed thus include

multi-room homes with one or more microphones per room set near the ceiling.

To achieve the project goals, a smart home was set up. It is a thirty square meters suite flat including

a bathroom, a kitchen, a bedroom and a study; in order to acquire audio signals, seven microphones were

set in the ceiling. All of the microphones were connected to a dedicated PC embedding an 8-channel

input audio card.

Strength The approach to multi-channel audio alignment proposed by the paper is the same adopted in

the BigEar Reconstruction Algorithm (Ch. 4); it is based on the alignment of each audio channel

with the one with the highest SNR.

Weaknesses The use of a multichannel audio card requires dedicated hardware increases costs and

reduces flexibility. Moreover, a wired approach is hard to implement since it requires to fix wires

onto walls or into existing electrical pipes.

Wireless Sensor Networks for Voice Capture in Ubiquitous Home Environ-

ments

The work introduces a voice capture application using a hierarchical wireless sensor network: nodes in

the same area belong to the same group of nodes (cluster); in each cluster there is a special node that

coordinates the cluster activities (clusterhead). Each clusterhead collects audio data from their cluster

nodes and relays it to a base station instantiated by a personal computer. Each node is instantiated by a

MicaZ mote.

In order to capture audio signals from the environment, sensor nodes have a high sensitivity micro-

phone, which is initially used to detect the presence of human voice in the environment; to do so, audio

signal is continuously sensed using a sampling frequency of 2 kHz and if the sensed signal intensity

exceeds a predetermined threshold, the node would send a notification through its wireless interface to

the clusterhead, that in turn is asked to send a capture command to the sensor node. Then the sensor

node enters into a High Frequency Sampling (HFS) mode to capture three seconds of audio and store

each sample in the on-board EEPROM. Once the node finishes sampling, it would exit the HFS mode

and would transfer the audio data to the clusterhead node which in turn would relay it to the base station

where a more resourceful computer would process speech recognition tasks.

The High Frequency Sampling state comprehends shutting down the communications interface and

sampling audio through the microphone using a sampling frequency of 8 kHz. The microphone in the

motes is wired to a 10-bit resolution ADC, but only earlier 8 bit samples were used. It was imperative to

shut down the communications interface to be able to sample at such frequency because both components

4

1.5. Structure of the Dissertation

cannot be enabled simultaneously due to the fact that these nodes have a resource-limited microcontroller

on-board.

For speech recognition, VR Stamp Toolkit from Sensory Inc. are used. The VR Stamp toolkit is an

off the shelf tool that uses neural network-based algorithms for voice pattern matching. It consists of a

daughterboard with an integrated RSC-4128 speech processor, 1 Mbit flash memory and a 128 kB serial

EEPROM for data, this board is attached to the base station (PC) via USB port.

Strength The Low Frequency Sampling (LFS) state allows power saving and extends battery life.

Weaknesses The system is not full real-time since speech recognition starts after that 3 seconds of

audio data are recorded and sent; MicaZ motes have limited lifetime since on-board EEPROM can

bear a number of read/write operations that is far less than in RAM memory; The system is very

expensive since each MicaZ mote costs about 80 $2, at which other costs have to be added for the

microphones and for the VR Stamp toolkit.

1.5 Structure of the Dissertation

This work is organized as follows: in Chapter 2 the proposed Architecture of BigEar system is described,

focusing on the characterization of the context of use, on the features of the Wixel prototyping board (core

of the system) and on the communication protocol between the components of the system. The following

Chapter illustrates the simplified model that allows to investigate the architecture capabilities by means of

a simulation environment. The core of the system is represented by the BigEar Reconstruction Algorithm

described in Chapter 4; in that chapter all of the methods used for reconstructing a speech signal starting

from the audio data captured by each sensor are described. Then, a real-world working prototype of the

system has been built following the guidelines and the design choices discussed in Chapter 5. The audio

data captured by means of the BigEar Prototype have been compared with the ones generated by the

simulator, and results of this comparison are discussed in Chapter 6. Finally, in Chapter 7 final remarks

conclude the dissertation and look out over the future works.

2Quotation: first quarter of 2015

5

2BigEar Architecture

Introduction

In this chapter the architecture of the system is described. Basically, it is composed of a set of wireless

audio sensors connected to a master receiver that collects data from the sensors and sends audio packets

to the base station via USB. The base station is responsible for aggregating received data in order to

rebuild the speech captured distributedly by the sensors.

2.1 Overview of the System

The system is composed of a network of audio sensors that capture audio in a room distributedly.

The speech is sent to a main receiver, which basically acts as an interface that converts speech packets

received via radio channel into serial data for being sent to the base station. The base station contains

the application logic for handling speech packets. Since the audio sensors basically perform a space-

time sampling of the audio inside a room, the application logic tries to reconstruct a good-quality speech

stream starting from packets that arrive to the base station with different timestamps and with different

physical characteristics. Indeed, each sensor samples the audio signal that reaches the microphone after

undergoing variations due to the physical model of the environment: different delays and amplitudes that

depend on the position of the person with respect to the sensors, and reflections/diffusions due to the

geometry of the room and materials of the walls and furniture. Granularity of space-time sampling is

influenced by:

• Number of audio sensors w.r.t the dimensions of the room: the bigger is the number of sensors

spread in the room, the finer is the granularity of space sampling.

• Audio sensor internal characteristics and constraints: each sensor needs time in order to sample

data (depending on ADC type), store them in buffers and send them to the main receiver.

• Network communication protocol characteristics and constraints: the number of packets sent to

the main receiver is affected by the number of collisions that may happen on the channel and also

7

Chapter 2. BigEar Architecture

§

Sensor A

§

Sensor B

§

Sensor C

§

Sensor D

§

Sensor E

§

Sensor F

§

Sensor G

u

Receiver

Core

USB

Figure 2.1: Overview of the BigEar architecture

Figure 2.2: Wixel module pinout and components (From www.polulu.com/wixel)

by the protocols themselves (handshaking, request-response timings, timeslot allocations)

In the following sections the different components of the BigEar architecture will be explained.

2.2 Wixel Programmable USB Wireless Module

Leaf nodes of architecture are represented by the sensors, that are built using Wixel Programmable USB

Wirelesse Modules.

The Pololu Wixel is a general-purpose programmable module featuring a 2.4 GHz radio and USB

port. The Wixel is built around the CC2511F32 microcontroller from Texas Instruments, which has an

integrated radio transceiver, 32 KB of flash memory, 4 KB of RAM, and a full-speed USB interface.

The overall Wixel pinout and peripherals is depicted in Figure 2.2. A total of 15 general-purpose I/O

lines are available, including 6 analog inputs, and the 2.54 mm (0.1") pin spacing makes the Wixel easy

to use with breadboards and perfboards.

The Wixel can connect to a computer’s USB port via a USB A to Mini-B cable or a USB A to Mini-B

adapter. The USB connection is used to configure the Wixel and also to transmit and receive data. The

8

2.3. Audio Interface

USB connection may also provide power to the Wixel.

On the side of the board opposite the USB connector, the Wixel has a 2.4 GHz PCB trace antenna.

This antenna, along with the other RF circuitry, forms a radio that allows the Wixel to send and receive

data packets in the 2.4 GHz band. The CC2511F32 core makes the Wixel compatible with the CC2500

transceiver, the CC2510Fx family, and the CC2511Fx family of chips from Texas Instruments. The

Wixel’s radio is not compatible with Wi-Fi, Zigbee, or Bluetooth. The antenna is a “meandered Inverted

F” design that provides optimal performance despite its small size.

TI CC2511F32 Microcontroller Unit

The CC2511F32 is a true low-cost 2.4 GHz system-on-chip (SoC) designed for low-power wireless

applications. The CC2511F32 combines the excellent performance of the state-of-the-art RF transceiver

CC2500 with an industry-standard enhanced 8051 MCU, up to 32 kB of in-system programmable flash

memory and 4 kB of RAM, and many other powerful features. The small 6x6 mm package makes it very

suited for applications with size limitations.

The CC2511F32 is highly suited for systems where very low power consumption is required. This

is ensured by several advanced low-power operating modes. The CC2511F32 adds a full-speed USB

controller to the feature set of the CC2510F32. Interfacing to a PC using the USB interface is quick

and easy, and the high data rate (12 Mbps) of the USB interface avoids the bottlenecks of RS-232 or

low-speed USB interfaces.

2.3 Audio Interface

The analog stage that converts pressure waves into an electric signal and that conditions it for feeding

the ADC has been designed in order to meet physical characteristics of a speech signal and to meet the

constraints due to MCU features and capabilities. Pressure waves are captured by a small microphone

that converts it into an electric signal whose amplitude is dependent on both the intensity of pressure

wave and the sensitivity of the microphone itself. Then, the electric signal has to be amplified in order

to make it readable (the output of the microphone is much smaller than the quantization step) and to

minimize quantization error. Moreover, the signal has to be polarized since the ADC works between 0

and a positive voltage Vref while microphone output works symmetrically around 0.

2.3.1 Physical Characteristics of the Typical Situation

In telephony, the usable voice frequency band ranges from approximately 300 Hz to 3400 Hz [8]. It is for

this reason that the ultra low frequency band of the electromagnetic spectrum between 300 and 3000 Hz

is also referred to as voice frequency, being the electromagnetic energy that represents acoustic energy

at baseband. The bandwidth allocated for a single voice-frequency transmission channel is usually 4

kHz, including guard bands, allowing a sampling rate of 8 kHz to be used as the basis of the pulse code

modulation system used in telephony.

9


Here it is considered the simplest case of a single person speaking in a room. According to SPL

tables [9], Sound Pressure Level (SPL) is considered of about 60 dB one meter away from the source.

Typically sensors are attached to the ceiling of the room, so it can be assumed that they will be placed at

an average height of 2.80 m above the floor. According to standard heights used in architecture [31, p.

109] the mouth of a person is considered to be located between 1.12 m and and 1.53 m above the floor,

that gives an average vertical distance mouth-ceiling of 1.48 m. It has been taken in account that the

person probably will not stay exactly below a sensor, so an arbitrary mouth-sensor distance of about

2 m has been considered.

Sound in free field

A field is defined as free when all the reflections are negligible (e.g. an anechoic chamber, or a sound

field region with no adjacent reflecting surfaces). In this situation, Sound Pressure Level halves every

doubling of distance. From Hopkins-Stryker equation [22, p. 508]:

Lp = Lw + 10log10

(Q

4πr2

)︸︷︷︸

Distance attenuation

(2.1)

where:

Lw is the level of source sound power,

Q its directivity

r is the distance between between source and receiver;

Evaluating the distance attenuation in Equation 2.1, it can be noted doubling r, the distance attenuation

increases of about 6 dB. From these values, the effective pressure that reaches the microphone capsule

can be calculated:

Lp = 20 · log10(p

p0

)⇓

p = p0 · 10Lp/20 with p0 = 20µPa (2.2)

Lp = 60− 6 = 54 dB

pspeech = 20× 10−6 Pa · 1054/20 = 10.02 mPa (2.3)

where p is the Sound Pressure Level expressed in Pa and Lp is its corresponding value expressed in dB

and evaluated w.r.t. the reference value p0.

10

2.3. Audio Interface

Voice frequency bandwith from 300 to 3000 Hz (BW = 2700 Hz)

Speech intensity 60 dB @ one meter away from the mouth

Typical height of a room 2.80 m

Mouth height (seated - upright position) 1.12 - 1.53 m

Considered mouth-sensor distance 2 m

Speech pressure at 2 m 20 mPa up to 10.02 mPa

Table 2.1: Physical characterization of the Typical Situation

Semi-reverberant fields

A field is said to be semi-reverberant when it contains both free-field zones (near the source, where

the direct sound prevails) and reverberant field zones (near the walls, where the reflected field prevails).

Normally-sized rooms can be supposed as semi-reverberant fields. In this situation, Sound Pressure

Level can be defined by the Hopkins-Stryker equation as:

Lp =Lw + 10log10

(Q

4πr2+

4

A

)(2.4)

A =αS =∑i

αi · Si

where:

Lw is the level of source sound power,

Q its directivity;

r is the distance between between source and receiver;

α is the average absorption coefficient;

S is the total interior surface

In a semi-reverberant acoustic field, the density of sound energy in a point is therefore given by the sum

of the direct and indirect acoustic fields. Therefore, depending on acoustic absorption coefficients and

size of the room, the overall attenuation could be much lower than 6 dB per distance doubling. So the

target of the amplifier design will take account of this range of pressure values in order to provide the

best signal to ADC stage. Table 2.1 summarizes the physical characteristics of the situation described

above.

2.3.2 Sensitivity of the microphone

The sensitivity of a microphone is the electrical response at its output pins to a given standard acoustic

input [16]. This is expressed as the ratio of the electrical otput to the input pressure. The standard

reference input signal for microphone sensitivity measurements is a 1 kHz sine wave at 94 dB SPL, or

1 Pa. For analog microphones, the sensitivity is typically specified in units of dBV, that is, decibels with

11


reference to 1.0 V rms. Analog microphones sensitivity can be measured with mV/Pa units:

SdBV = 20 ∗ log10(SmV/Pa

OutREF

)where OutREF = 1000 mV/Pa reference output voltage

(2.5)

⇓

SmV/Pa = OutREF · 10SdBV /20 (2.6)

Microphones used for the prototypes have a sensitivity of -38 dB, so we can derive the sensitivity of

them expressed in mV/Pa units:

SdBV = −38 dB

SmV/Pa = 1000 mV/Pa · 10−38/20 = 12.59 mV/Pa (2.7)

Using equations 2.3 and 2.7 we can assert that in presence of a typical speech signal, the microphone

capsule will produce an output signal whose amplitude will be:

Vmic = pspeech · SmV/Pa = 10.02× 10−3 Pa · 12.59mV/Pa = 0.126 mV (2.8)

2.3.3 ADC characteristics

The Analog-to-Digital Converter of CC2511F32 MCU, whose block diagram is depicted in Fig. 2.3, is

capable of converting an analog input into a digital representation with up to 12 bits resolution. The ADC

includes an analog multiplexer with up to six individually configurable channels and reference voltage

generator. All references to VDD apply to voltage on the pin AVDD. The main features of the ADC are

as follows:

• Selectable decimation rates which also sets the resolution (7 to 12 bits).

• Six individual input channels, single-ended or differential

• Reference voltage selectable as internal, external single ended, external differential, or VDD.

• Interrupt request generation

Time of conversion

The time required to perform a conversion, Tconv , depends on the selected decimation rate:

Tconv = (decimation rate+ 16) ∗ τ (2.9)

where τ = 0.25µs.

When, for instance, the decimation rate is set to 128, the decimation filter uses exactly 128 ADC

clock periods to calculate the result. When a conversion is started, the input multiplexer is allowed 16

12

2.4. BigEar Receiver and Base station

Figure 2.3: CC2511 ADC Block diagram

Dec. rate Resolution Time of conversion Max. allowed frequency

64 7 bits 16 µs 62.50 kz

128 9 bits 32 µs 31.25 kHz

256 10 bits 64 µs 15.63 kHz

512 12 bits 128 µs 7.81 kHz

Table 2.2: ADC: Speed of Conversion VS. Decimation Rate. It is assumed that conversions are per-

formed always on the same channel, so the multiplexer does not need to allocate clock periods for

changing input channel.

ADC clock periods to settle in case the channel has been changed since the previous conversion. The 16

clock cycles settling time applies to all decimation rates [12, p. 139].

Times required to perform a conversion in function of the decimation rates are summarized in Table

2.2. In the third column, maximum allowed sampling frequency with ADC decimation timings can

be seen. It is worth considering the speed of conversion in connection with ADC resolution since it

represent an important design choice. The higher is the resolution, the lower is the quantization error.

On one hand, according to table 2.1, low resolutions allow high sampling frequency and so a higher

temporal resolution with the drawback of a lower number of quantization levels; on the other hand, high

resolutions reduce quantization error granularity. It is an important key factor in signal post-processing.

ADC Resolution and Sampling Frequency will be deeply explained in sections 5.2.1 and 5.3.1.

2.4 BigEar Receiver and Base station

The Base Station is the device that collects data from the sensors and arranges packets in order to produce

a clear and intelligible speech signal. It consists of one Wixel connected via USB port to a machine that

runs the application logic.

The only task of the Wixel (BigEar Receiver) is to act as a wireless-USB interface (and vice versa).

It mainly receives radio packets from the sensors, transforms them into hexadecimal nibbles and sends

13


USB

Mod

ule Nibbles to Bytes

Bytes to Nibbles RA

DIO

Mod

ule

HEX Nibbles Radio packets

BigEar receiver

Calculator

§

BigEar Audio

Capture boards

Figure 2.4: BigEar Receiver Logic

them to the machine via USB port. When the computer has the need to send commands to the sensors

(or to reply to protocol messages), BigEar Receiver receives hexadecimal nibbles through the USB port,

converts them in bytes and sends them using the radio module of CC2511 SoC (Fig. 2.4).

Wixel Programmable Module has been programmed in order to handle up to 256 different radio

channels [6]. All of the sensors share the same channel used by the Wixel receiver, so if the network

architecture requires channel separations (e.g. for reducing number of collisions), a second Wixel con-

nected to another USB port of the machine is needed.

The application logic (illustrated in Figure 2.5) that runs on the machine basically receives radio

packets (Fig. 2.5a) containing each one a set of bufferized audio samples tagged with a timestamp and

the sensor ID; for each sensor, audio samples are arranged according to their timestamp. In this, for

each sensor it is obtained a coherent but incomplete stream: coherent because audio samples are in the

right time position with respect to the sensor timestamp, incomplete because there is no guarantee that

the transmission time is less than or at most equal to the sampling time. Conversely, the procedures

for elaborating, buffering and sending samples require a time overhead that does not allow a real-time

transfer rate.

Once the samples have been arranged, the application perform a time delaying-or-advance of the audio

streams (Fig. 2.5b) coming from the sensors in order to remove the delays caused by the different

distances between the mouth of the user and the sensors. Therefore, in-phase audio contributions are

obtained; they can be summed each other in order to produce a seamless stream. (Fig. 2.5c).

During the alignment process it is needed to take account of different energy contribution of the

sensors: the closer is the sensor to the user, the bigger will be the signal amplitude and vice versa.

Moreover, the sum of contributions will be weighted in order to prevent amplitude distortions on the

resulting signal.

2.5 Network Protocols

Network protocols have a big impact on the efficiency of the whole system: collisions, granularity of the

network of sensors and presence of protocol messages can affect the number of audio packets transmitted

and received successfully.

In general, when the granularity of the network increases, also grows the likelihood of collisions. At

14

2.5. Network Protocols

Source signal

Received samples

(a) Data received from sensors cannot be summed directly

Source signal

Sensor n. 1

Sensor n. 2

Sensor n. 3

(b) Audio packets are arranged and aligned

Source signal

Reconstructed signal

(c) Audio packets are summed into a unique stream

Figure 2.5: BigEar Application Reconstruction Logic

15


the same time the number of service messages for implementing synchronization mechanisms have to

be increased in order to reduce the number of collision. This can be done at the expense of the channel

availability and software complexity. CC2511 does not offer either Carrier Sense nor Collision Detection,

so the scenario can be seen as a pure broadcast domain. Moreover, each Wixel receives everything is

being transmitted on its channel, and each transmission reaches every Wixel that is listening on the same

radio channel.

2.5.1 ALOHA Protocol

The simplest protocol that can be adopted in this scenario is Aloha [1]; This name refers to a simple

communications scheme in which each source (transmitter) in a network sends data whenever there is

a frame to send. Aloha specifies also that if the frame successfully reaches the destination (receiver),

the next frame is sent; otherwise if the frame fails to be received at the destination, it is sent again.

Since audio data are time-dependent, for BigEar application purposes it is worthless to retransmit audio

packets, so the transmitter-side application will not wait for any acknowledgment from the base station.

Advantages:

• It doesn’t need any protocol or service message.

• Adapts to varying number of stations.

Disadvantages:

• Theoretically proven throughput1 maximum of 18.4%.

By making a small restriction in the transmission freedom of the individual stations, the throughput

of the Aloha protocol can be doubled. Assuming constant length packets, transmission time is broken

into slots equivalent to the transmission time of a single packet. Stations are only allowed to transmit

at slot boundaries. When packets collide they will overlap completely instead of partially. This has the

effect of doubling the efficiency of the Aloha protocol and has come to be known as Slotted Aloha [20]

(Fig. 2.6).

BigEar application has been tested using pure ALOHA protocol (without acknowledge) in order to

exploit and examine system capabilities with the simplest communication protocol.

1Number of packets successfully (without collision) transmitted per unit time, where with unit time we consider the (fixed)

duration of a transmission

16

2.5. Network Protocols

Figure 2.6: Aloha vs. Slotted Aloha throughput

17

3BigEar Modeling and Simulation

Introduction

In this chapter the methods and structures used for simulating the entire architecture within its envi-

ronment are described. Dissertation starts illustrating the physical model of the propagation of a sound

within a room and tools used for simulating it, then it continues describing a model that simulate the

behavior of the network of sensors arranged in the room.

3.1 Overview

Figure 3.1 shows a schematic representation of BigEar simulator. It is composed of four interconnected

modules, at the end of which it is possible to withdraw the reconstructed signal.

The Audio model block consists in the simulation of the environment: room dimensions (and op-

tionally other parameters such as reflection and diffraction coefficients of walls), location of the sensors

and of the audio source are set. Then an audio file is provided as input. The block produces as many

audio streams as there are sensors. Each stream differs from the others in terms of amplitude, delay and

diffusion, which change depending on the acoustics characteristics of the room and the positions of the

sensor and the source.

The second block, Radio network model, simulates the behavior of transmitters according to a

choosen network protocol. In the simplest case of ALOHA protocol (See 2.5.1), it simulates random

instants of transmission of each transmitter and tags each packet as received or collided. This kind of

information is important because it affects the behavior of the following block that models the internal

buffer structure of the sensors.

N-buffer model block implements the buffering system internal to each transmitter. In real world,

data are continuously sampled and buffered by each transmitter in order to be ready to send them when

needed; during the simulation, instead, the time instants in which transmission occurs are known, but it is

needed to model the buffering structures in order to know what are the data ready for being transmitted.

This block produces in output the audio samples packed as if they were coming from real transmitters:

19

Chapter 3. BigEar Modeling and Simulation

Audio

Model

Fs

Room

Source

Sensor positions

N-Buffer model

Radio network model

Signal

Reconstruction

block

Model and Sensors

parameters

Reconstructed

signal

Sensor Model

Audio

streams

Output

packets

Rad

iotim

ings

and

Col

lisio

nfla

gsFigure 3.1: Architecture model

each packet contains the ID of the transmitter, the timestamp of the first sample1 of the packet and a

number of samples that corresponds to the frame size of each buffer.

The fourth block has been reported here for completeness but it will be discussed in chapter 4. It

concerns the reconstruction of the source signal starting from output of the previous block.

3.2 Audio Model

In order to create an audio model of the typical use case it is needed to do some assumptions about the

typical environment where the system will work. The room model assumed is a rectangular enclosure,

a three-dimensional space bounded by six surfaces (walls, ceiling and floor). In section 3.2.2 this type

of room will be also called “empty shoebox”. Each room surface has its own absorption and scattering

(diffusion) coefficients. The scattering coefficients set the ratio as a function of frequency between the

amount of specular and diffuse reflection. Sound scattering due to furniture and other objects in the

room can be approximated by higher levels of overall room diffuseness. This type of room has been

chosen because it is the most common shape of room, and also because of its ease of modeling and

computability.

Other assumptions concern directivity of source and sensors. It is assumed that directivity of the source

can be negligible because in small-medium rooms the reflections give a big contribution to the diffusion

of the sound. Small microphone capsules have omnidirectional polar pattern so directionality factors can

be neglected.

1Timestamps of other samples can be inferred summing τi = i · 1Fs

where i is the (0-based) the index of ith sample in the

packet

20

3.2. Audio Model

x1

x2

r12

Figure 3.2: Geometry of wave propagation from a point source x1 to a listening point x2 .

3.2.1 Physical Model

A point source produces a spherical wave in an ideal isotropic (uniform) medium such as air [24]. To

a good first approximation, wave energy is conserved as it propagates through the air. In a spherical

pressure wave of radius r, the energy of the wavefront is spread out over the spherical surface area 4πr2.

Therefore, the energy per unit area of an expanding spherical pressure wave decreases as 1/r2. This is

called spherical spreading loss. Since energy is proportional to amplitude squared, an inverse square law

for energy translates to a 1/r decay law for amplitude.

The sound-pressure amplitude of a traveling wave is proportional to the square-root of its energy per

unit area. Therefore, in a spherical traveling wave, acoustic amplitude is proportional to 1/r , where r

is the radius of the sphere. In terms of Cartesian coordinates, the amplitude p(x2) at the point x2 =

(x2, y2, z2) due to a point source located at x1 = (x1, y1, z1) is given by

p(x2) =p1r12

where p1 is defined as the pressure amplitude one radial unit from the point source located at x = x1,

and r12 denotes the distance from the point x1 to x2 :

r12 , ‖x2 − x1 ‖ =√

(x2 − x1)2 + (y2 − y1)2 + (z2 − z1)2

This geometry is depicted for the 2D case in Fig. 3.2.

In summary, every point of a radiating sound source emits spherical traveling waves in all directions

which decay as 1/r , where r is the distance from the source. The amplitude-decay by 1/r can be

considered a consequence of energy conservation for propagating waves. (The energy spreads out over

the surface of an expanding sphere.) We can visualize such waves as “rays” emanating from the source,

and we can simulate them as a delay line along with a 1/r scaling coefficient (see Fig. 3.3).

x(n) z−M1/r

y(n)

Figure 3.3: Point-to-point spherical wave simulator. In addition to propagation delay, there is attenua-

tion by g = 1/r.

21


f i

θi

fr

θr

Figure 3.4: The law of reflection states that the angle of incidence θi of a wave or stream of particles

reflecting from a boundary, conventionally measured from the normal to the interface (not the surface

itself), is equal to the angle of reflection θr, measured from the same interface.

h

S L

d

r r

Figure 3.5: Geometry of an acoustic reflection caused by “multipath” propagation. A direct signal and

a “floor bounce” are received from the source S at the listening point L

Reflection of Sound Waves

When a spreading spherical wave reaches a wall or other obstacle, it is either reflected or scattered. A

wavefront is reflected when it impinges on a surface which is flat over at least a few wavelengths in each

direction.2 Reflected wavefronts can be easily mapped using ray tracing, i.e., the reflected ray leaves at

an angle to the surface equal to the angle of incidence (see Fig. 3.4). Wavefront reflection is also called

specular reflection (especially when considering light waves).

A wave is scattered when it encounters a surface which has variations on the scale of the spatial

wavelength. A scattering reflection is also called a diffuse reflection.

The distinction between specular and diffuse reflections is dependent on frequency. Since sound trav-

els approximately 0.34 m per millisecond, a cube 0.34 m on each side will “specularly reflect” directed

“beams” of sound energy above 1 kHz, and will “diffuse” or scatter sound energy below 1 kHz.

22

3.2. Audio Model

Room Acoustics

The pressure wave that reaches the listening point has to be considered as the superposition of the direct

sound and the reflected waves due to “multipath” wave propagation.

Each contribute will have different delay and amplitude, since these quantities are dependent on the

distance traveled by the wave. In Fig. 3.5 a simple example is depicted, considering only direct wave

and one reflection. The acoustic source is denoted by ‘S’, the listener by ‘L’, and they are at the same

height h meters from a reflecting surface. The direct path is d meters long, while the length of the single

reflection is 2r meters. These quantities are of course related by the Pythagorean theorem:

r2 = h2 +

(d

2

)2

DIRECT SOUND reaches the listener delayed by ∆d = d/c [s] with c speed of propagation and

smoothed by a gain factor gd = 1/d, due to the SL = d distance, the speed of propagation and the

spherical spreading loss described at the beginning of this section.

FIRST REFLECTION reaches the listener delayed by ∆r = 2r/c [s] with r =√h2 + (d/2)2 and

smoothed by a gain factor gr = 1/2r.

This process should be iterated for every reflection, obtaining a set of delay coefficients ∆i and gain

coefficients gi. Every (∆i, gi) couple represent a delayed and smoothed version of the original pressure

wave.

In digital domain, after converting the pressure wave into an electric signal and sampling it at a certain

frequency Fs, the pressure wave is rendered by a sequence of numbers x(n). (∆i, gi) contributions, in

digital domain, can be condensed in a finite impulse response sequence h(n) in which ith stem represents

the ith (∆i, gi) couple: Height of stem is given by gi, and the position along x axes of the stem is given

by ∆i. According to the signal theory, in order to obtain the resultant sound perceived by the sensor at

position L we convolve the source signal x(n) with the room impulse response h(n):

y(n) = x(n)⊗ h(n) =

∞∑k=−∞

x(k) · y(n− k) (3.1)

Calculations for the first-order reflections can be easily done, but the resulting impulse response will

be very poor since it will contain only 7 contributions: one for the direct sound, 6 for the first-order

reflection of the sound against the walls. In order to perform a richer model it is needed to adopt the

image source method. In the image source method the sound source is reflected at each surface to

produce image sources which represent the corresponding reflection paths. The next section will explain

operating principles of this method, whose representation is illustrated in Fig. 4.11.

3.2.2 MCROOMSIM toolbox

Multichannel Room Acoustics Simulator (MCROOMSIM) [30] is a simulator which can simulate record-

ings of arbitrary microphone arrays within an echoic shoebox room. This software is written in the C2To determine whether a surface is effectively flat, it may first be smoothed so that variations less than a wavelength in size are

ignored. That is, waves do not “see” variations on a scale much less than a wavelength

23


(a) 0th order image (direct sound)(b) 1st order images (reflections of S over

walls a and b)

(c) 2nd order images (reflections of Sa over b

and Sb over a)(d) Sum of 0th, 1st and 2nd order reflections

Figure 3.6: Image source method using two walls, one source and one listening point.

24

3.2. Audio Model

programming language. The software is freely available from the authors.

MCROOMSIM simulates both specular and diffuse reflections in an empty shoebox environment.

Effects of air absorption and distance attenuation on sound are also modeled by the simulator. The

empty shoebox room geometry simplifies the necessary calculations used during simulation, which in

addition to the simulator being implemented in C, enables a significant reduction in the computation

time of room impulse responses, resulting in a fast simulation package. The simplification of the room

geometry to a shoebox model is suitable when, as in this case, the importance is in creating realistic room

impulse responses for evaluating threedimensional audio applications in echoic environments, rather than

modeling the architectural complexities of the room.

MCROOMSIM is designed to interface with MATLAB. The simulator receives all setup data from

MATLAB. Once the simulation has completed, MCROOMSIM returns the simulated room impulse

responses for all configured source and receiver combinations back to the MATLAB workspace. A set

of high level MATLAB functions are provided to configure the simulator, making it easier for the user to

setup the room, sources and receivers with specific properties, as well as configuring specific properties

of the simulation routines. Figure 3.7 illustrates the dataflow of the toolbox.

Room simulation method

Impulse responses are simulated between each source and receiver combination configured in the echoic

room. MCROOMSIM uses two geometrical acoustic simulation algorithms to generate the impulse

response between a source and receiver, with each algorithm simulating a different part of the reverberant

sound field. The main or specular reflections are simulated using the image source algorithm [4] and the

diffuse sound field is simulated using the diffuse rain algorithm [11,21]. The output from both algorithms

are combined to make the final output response.

Image Source algorithm The image source algorithm provides accurate direction and timing of the

main reflections. The algorithm starts with the zero-th order sound which corresponds to the direct sound.

For first order reflections, the sound source is mirrored in the room’s surfaces to create a series of virtual

sources, which are then traced back to the receiver along with the inclusion of distance attenuation and

air absorption. These virtual sources are then expanded to create more virtual sources of higher order. By

knowing the distance of the virtual source from the receiver and absorption of the surfaces that it reflects

off, the contribution of the corresponding sound propagation path to the room impulse response can be

calculated exactly. This contribution and the receiver’s directional response are then convolved, with the

output being the room response to that propagation path as ‘recorded’ by the receiver. The image source

algorithm continues to expand the order of virtual sources up to an maximum order set by the user or

ceases when the energy of all virtual sources of the same order drop below a predefined threshold.

Diffuse rain algorithm The diffuse rain algorithm is a fast stochastic ray tracing method. It aims to

provide a good approximation of the diffuse sound field in the room. This simulation technique models

the propagation of sound from the source to the receiver by a series of discrete rays that are traced around

25


Proceedings of the International Symposium on Room Acoustics, ISRA 2010 29–31 August 2010, Melbourne, Australia

the size, frequency dependent absorption and scatteringcoefficients of the walls/ceilings, room temperature andhumidity.

2. General simulation options: here the user is providedwith an opportunity to configure various features of thesimulator such as the maximum order that the imagesource algorithm will iterate up to, the minimum energythreshold for virtual image sources or rays, etc.

3. Receiver setup: the number of receivers to simulate, theirlocations, orientation, number of channels and directivity.For receivers with custom directivity, a list of directionalgains or impulse responses along with the matchingdirection list is also included here.

4. Source setup: same as the receiver setup with the limita-tion that sources can only be single channel.

Once the MCROOMSIM function is invoked, MATLAB com-bines all of the data, including all user-defined directional re-sponses (if provided), into a single structure which it then passesto MCROOMSIM. Once the simulation has completed, all ofthe room impulse response data are provided as output to theMATLAB workspace. The output of MCROOMSIM is a timedomain room impulse response for each source and receivercombination. In the case of a multichannel receiver array, aseparate response is provided for each channel.

Flow Of Software

Once the input configuration data is received by MCROOMSIM,the simulator first performs a validity check on the parametersto ensure that the configuration is not erroneous. Once this hascompleted successfully, the simulator uses these parameters toconfigure itself. All operations are performed in the frequencydomain, which improves efficiency of convolutions. Thereforeif a source or receiver has custom directivity defined with di-rectional impulse responses, MCROOMSIM transforms theseresponses into the frequency domain. Next the simulator es-timates the maximum possible length for the room impulseresponse by calculating the longest time for the virtual sourcesto decay below the energy threshold, as set in the simulationgeneral options. This is performed so that simulation memorycan be pre-allocated.

MCROOMSIM can be configured to run in one of three ways:specular only simulation, diffuse only simulation or both. Thesimulation algorithms are then executed to generate the roomimpulse responses between the defined sources and receivers.Each algorithm outputs its results in the time domain, so thereis no need for transformation. If both algorithms have beenexecuted then the outputs of each will be combined to create theoverall room impulse responses. Upon completion, the resultingsimulated room impulse responses are provided as output to theMATLAB workspace. Fig. 1 highlights the flow of the softwarefrom configuration to the end of simulation.

EXAMPLE APPLICATIONS OF THE SOFTWARE

Realistic in-room microphone array simulation

Microphone arrays are used in many three-dimensional audioapplications. Some examples include spatial sound reproduc-tion [2], measurement and analysis of directional properties ofreverberant sound fields using beamforming techniques [8] orplane wave decomposition [9]. In all of these situations MC-ROOMSIM can be used to generate realistic three-dimensionalroom impulse responses for the microphone array under test. Allthat is needed are the anechoic directional impulses responsesof the microphone array and the properties of the room undersimulation, i.e. size, location of source(s), location of micro-phone array, etc. Moreover, sources with specific directivitypatterns can also be simulated in the system, enabling the user

Configure Room(size, absorption, diffusion, etc.)





Configure Sources(location, directivity, etc.)




Set Simulation Options(thresholds, simulation order, etc.)




Configure Receivers(location, directivity, channels, etc.)




Generate Setup Structure

MATLAB

Configure Room(validate input parameters)




Estimate Response Length

Simulatespecular

reflections?

Simulatediffuse

reflections?




Simulate Specular Refl s(Image Source Algorithm)

n Configure Room(size, absorption, diffusion, etc.)



Simulate Diffuse Refl s(Diffuse Rain Algorithm)

n

Yes Yes




Combine Specular & Diffuse Room Responses

MCROOMSIM

Output: Room Impulse Responses(One response per source/receiver combination)

MATLAB

Process Custom Directivities(Perform FFT on directional impulse

responses)

Figure 1: The flow of the MCROOMSIM software is shownfrom the configuration stage to the end of simulation, where thesimulated room impulse responses are returned to the MATLABworkspace.

to compare the effects of various sources on the reverberantsound field.

3D Binaural rendering of echoic sound scenes

MCROOMSIM can be used to directly simulate realistic Bin-aural Room Impulse Responses (BRIRs). This is done by con-figuring the simulation software with a pair of sensors whosedirectional impulse responses are anechoic head related impulseresponses (HRIRs) measured for a number of directions in thefree-field for the particular user. This is useful for quickly gen-erating 3D BRIRs for many environments, without the needof performing time consuming recordings within each environ-ment. From this the user could then for instance listen to the3D reverberant sound fields created by sources with differentdirectivities.

ISRA 2010 3

Figure 3.7: MCROOMSIM Data flow from configuration stage to the end of simulation

26

3.3. Sensors network model

the room. Each ray trajectory is traced and is reflected in a random direction every time it hits a wall.

A ray’s energy decreases in time due to wall absorption, air absorption and distance attenuation. The

process of tracing a ray is continued until the ray’s energy falls below a predefined threshold.

MATLAB interface

MCROOMSIM is configured using a group of high level MATLAB functions that ease the setting of

parameters for simulation. The parameters comprise four main parts:

1. Room setup: physical characteristics of the room such as the size, frequency dependent absorption

and scattering coefficients of the walls/ceilings, room temperature and humidity.

2. General simulation options: here the user is provided with an opportunity to configure various

features of the simulator such as the maximum order that the image source algorithm will iterate

up to, the minimum energy threshold for virtual image sources or rays, etc.

3. Receiver setup: the number of receivers to simulate, their locations, orientation, number of chan-

nels and directivity.

4. Source setup: same as the receiver setup with the limitation that sources can only be single channel.

Once the MCROOMSIM function is invoked, MATLAB combines all of the data into a single structure

which it then passes to MCROOMSIM. Once the simulation has completed, all of the room impulse

response data are provided as output to the MATLAB workspace. The output of MCROOMSIM is a time

domain room impulse response for each source and receiver combination. In the case of a multichannel

receiver array, a separate response is provided for each channel.

3.3 Sensors network model

The block that models the network of sensors recreates a realistic simulation of the behavior of the

network in which each sensor samples audio signal, buffers it in packets of a specific size and sends

them according to a network communication logic. In order to do this, the block is internally split in

two parts: the Radio network model, that implements the communication protocol including possible

interactions between the receiver and the transmitters or between the transmitters themselves, and the

N-buffer model, that carries out the internal buffer mechanism of each transmitter.

The outer block, that will be implemented as the MATLAB function simulateWixels, receives

as input the audio streams generated by the audio simulator described in section 3.2, and provides as

output the packets of audio sampled and transmitted by each sensor. It also provides, for performance

measuring reasons, the collision ratio per each radio channel. In particular, the block receives:

• samples: the audio samples from the audio simulator in a matrix form: each column of the

samplesmatrix represent one sensor; in this way the audio stream of the ith sensor is represented

by the ith column vector of the samples matrix.

27


• Model: the structure containing all of the parameters that characterize a transmitter:

– FRAMES: the number of frames of which the buffer system is composed of

– FRAME_SIZE: the number of samples that can be stored in a frame

– Tbusy: the duration of a radio transmission, expressed in milliseconds

– TmaxDelay: the maximum allowed delay for a radio transmission (when we assume a

network protocol belonging to ALOHA family of protocols)

– Fs: the sampling frequency of the sensor

• Sensors: the structure that describe individual characteristics of sensors: their position in the room,

their radio channel and their name.

– X, Y, Z: column vectors containing coordinates of the sensors

– Channel: column vector containing radio channel of the sensor

– Name: (optional) column vector containing names of the sensors

while the outputs are represented by these data:

• output: M-by-N matrix where M is the number of samples in output obtained by theN th sensor

• positions: M-by-N in which the position in time (expressed as sample index) of the ith sample

transmitted by the jth sensor are specified

• collisionRatioPerChannel: a vector in the each ith element represents the collision ratio

within the ith channel

The collision ratio is a performance metric that someway measures protocol robustness. It is calculated

as following:

Cr(i) =Nlost(i)

Ntotal(i)(3.2)

where i represents the ith channel, supposing that the network exploits different radio channels in order

to reduce the number of collisions, Nlost represents the number of packets lost for collisions on ith

channel and Ntotal represents the total number of packets that have been transmitted on ith channel.

The simulateWixels block (see dashed box in Fig 3.8) performs the network simulation apply-

ing a communication model to the transmitters that belong to the same broadcast domain. It is possible

to split the broadcast domain into several sub domains exploiting the capability of CC2511 radio to work

on different 2.4 GHz carriers. In this way it is possible to reduce the number of collisions decreasing

the number of transmitters that share the same carrier. So, simulateWixels first divide the sensor

set into K subsets, and then for each subset it calls the generateCalls block, described in section

3.3.1, in order to obtain two matrices: times and valid_flags. In the times matrix each column

represents a transmitter; each element in a column is a number corresponding to a time instant (expressed

in milliseconds from the beginning of the simulation) in which the transmitter starts to transmit a packet.

28


Fs_In → Fs

Resampling

Audio streams

Fs

Add initial silence

Select transmitters

belonging to same channel

Generate radio timing and flags

Other

channels?

Select new

channel

Generate output packets

Model params

Reconstruct signal

Remove initial silence

Reconstructed

signal

simulateWixels

no

yes

Figure 3.8: Simulation and reconstruction flowchart

29


The valid_flags matrix has the same dimensions of times, where each element is a boolean value:

true when the packet has been received correctly, false when the packet has been transmitted but has not

been received because of a collision with another flying packet. The two matrices described above are

given as input of the following block (see section 3.3.2) that handles the internal buffer model of each

transmitter and determines which segment of audio samples has to be transmitted for each time instant.

At the end of this block a set of audio packets (stored in two matrices, as it can be seen in following

sections) is obtained; audio packets are ready for the reconstruction stage described in chapter 4.

The initial buffers conditions can affect the results of simulation, specially when dealing with short

audio segments. In order to avoid this side effect, an initial silence of the arbitrary duration of 5 seconds

has been added to the original audio stream. Moreover, each audio stream has been re-sampled at the

sampling frequency adopted by the sensor, Fs. In this way the position of each audio sample can be

expressed using an integer index i in such a way that the ith sample is located at (i − 1)/Fs seconds3

from the beginning of the simulation.

Since the initial silence can be removed only after the signal reconstruction, all of these operations (si-

lence insertion, re-sampling toFs and silence removal) are done by the surrounding function simulate-

AndReconstruct that also takes care of calling the reconstruction block before removing the initial

silence (see Fig. 3.8).

3.3.1 Radio Transmission Model

The radio transmission model is implemented by means of the MATLAB generateCalls function.

It is a realization of the pure ALOHA protocol described in section 2.5.1 in which each transmitter in the

network sends data whenever there is a frame to send and then it waits for a random delay before sending

another packet. Since audio data are time-dependent, for our purposes it is worthless to retransmit audio

packets, so the transmitter will not wait for any acknowledgment from the receiver. The random delay

that distances the transmissions is obtained by the internal random number generator of each transmitter

and it is chosen between 0 and a maximum value TmaxDelay.

The model also checks for collisions that happen when, given t(i,j) the time instant in which the jth

transmitter start to transmit the ith packet, another device starts to transmit within the guard time[t(i,j) − tbusy , t(i,j) + tbusy

]. In this case all of the transmissions that happen in the interval are marked

as colliding.

The flowchart of the radio transmission block is depicted in fig.3.9.

The block receives as inputs:

• n_samples: the lenght, in samples, of the audio stream

• Fs: the sampling frequency adopted by the transmitters

• TRANSMITTERS: the number of transmitters4

3(i− 1) because MATLAB indexing is 1-based4it refers to the number of transmitters belonging to the same broadcast subdomain. See page 28

30


Generate random delays

Input parameters

Sum delays

⇒ random time instants

Choose

t(i,j) time instant

Is valid?

Determine[t(i,j) − tbusy , t(i,j) + tbusy

]

Flag colliding time instants

Other

time instants?

Radio timings and flags

Select next t(i,j)time instant

yes

yes

no

no

Figure 3.9: Flowchart of the radio transmission model

31


• Tbusy: the duration of a radio transmission, expressed in milliseconds

• TmaxDelay: the maximum allowed delay between two transmissions.

and produce as outputs:

• call_time: M-by-N matrix; each (i,j) element represent the ith time instant in which the jth

transmitter starts to transmit

• call_valid_flag: M-by-N matrix of boolean elements; each (i,j) element is false (invalid) if

the ith packet transmitted by the jth transmitter has collided.

3.3.2 N-buffer Internal Model

The transmitter-side application can be seen as divided in two parts: the sampler and the transmitter.

The sampler (as illustrated in section 5.3.1) continuously samples data and stores them into a framed

buffer. Then the transmitter, at proper time, checks if there is at least one full frame for sending the

packet to the receiver. These two parts are in such a way isolated from one another since sampler is

ruled by a timer interrupt routine and transmission procedure is called in the main loop. In order to know

which segment of audio samples has to be transmitted for each time instant we have to model the internal

behavior of buffers of each transmitter. This is the role of the N-buffer block, implemented by means of

the generateTransmitterOutputs MATLAB function.

Multiple buffering allows the sensor to work simultaneously on read and write sides: the interrupt

routine acquires the signal and stores samples into the write frame, pointed by the irs_idx index,

while the main loop can read from the read frame, pointed by the main_idx index.

Frames are handled by means of incrementing pointers: the condition main_idx = irs_idx

states that main loop would try reading from the same frame in which the interrupt routine is writing

data, but it is not possible to do this: this condition, indeed, occurs when all of the buffer have been

emptied by previous readings, so there are no frames ready to be processed.

When a frame is completely read, the main loop increments main_idx index by means of a modulo

operation:

main_idx = (main_idx + 1) % FRAMES_NUMBER

The same operation is done by the interrupt routine: once the frame is full and there is another free frame

to fill, index is incremented by mean of:

irs_idx = (irs_idx + 1) % FRAMES_NUMBER

In this way the temporal consequentiality of the frames is preserved. In case the transmission takes place

after ∆t > Tframe · FRAMES, the last frame (the one with most recent data) slides towards rights

(following an ideal temporal axis pointing to the right) in order to keep itself updated. This mechanism

makes sure that if the transmissions are spaced ∆t > Tframe · FRAMES each other, we in output

32


will obtain packets spaced ∆t each other, as if the sampler was sampling only in correspondence with

transmissions. (See fig. 3.10).

The generateTransmitterOutputs block simulates the behavior described above, with the

difference that the starting point is not the continuous sampling but the analysis of the difference between

the time instants in which a transmission occurs for the same transmitter. In conjunction with the knowl-

edge of the previous frames status (how many frames are full, how many frames are free to be filled),

given a certain time difference the buffer status can be updated. The flowchart in Figure 3.11 illustrates

the simulator.

The generateTransmitterOutputs receive as inputs:

• call_time: M-by-N matrix; each (i,j) element represent the ith time instant in which the jth

transmitter starts to transmit

• call_valid_flag: M-by-N matrix of boolean elements; each (i,j) element is false (invalid) if

the ith packet transmitted by the jth transmitter has collided

• samples: the audio samples from the audio simulator in a matrix form: each column of the

samplesmatrix represent one sensor; in this way the audio stream of the ith sensor is represented

by the ith column vector of the samples matrix

• Fs: sampling frequency adopted by transmitters

• FRAMES: the number of frames of which the buffer is composed

• FRAME_SIZE: the size of each frame

and produce as outputs:

• output: M-by-N matrix where M is the number of samples in output obtained by theN th sensor

• positions: M-by-N in which are specified the position in time (expressed as sample index) of

the ith sample transmitted by the jth sensor

time [Tframe]0 1 2 3 4 5 6 7

1 2 3 3

TNOW

Frame 3 sliding

Figure 3.10: Frame sliding occurs when all of the frames are full and no transmissions have been

happened for a time ∆t > Tframe · FRAMES. In this case FRAMES = 3. Frames 1 and 2 are

ready to be processed, frame 3 is also ready, but continuously slided.

33


Initialize indices

Init buffers

Input parameters

Choose t(i,j)time instant

∆t = time distance from last

transmission by sensor j

irs_idx 6= main_idx

∨ ∆t > Tframe

Current packet

is valid?

Audio packet in output

Update main_idx

(read frame index)

Update irs_idx

(write frame index)

∆t/Tframe

> FRAMES

Slide current

read frame

Other

time instants?

Output packets

Select next t(i,j)time instant

yesno

no

yes

yes

no

yes

no

Figure 3.11: Flowchart of the N-buffer model

34

4Signal Reconstruction and Repair

Introduction

Once the audio signal has been acquired by the sensors and sent to the base station, audio packets need

to be arranged and summed in such a way to make the audio signal suitable for speech recognition. This

is the role of the signal reconstruction block. First, the streams coming from sensors are built using

timestamp information given by each audio packet; then, the streams are aligned in order to reduce

the delay due to the distance of the speech source with respect to the position of the sensors. When

the signals are properly aligned they can be superposed with several summing or replacing methods in

order to preserve the signal energy and not to introduce energy variations due to the different number of

contributions that are summed at the same time. Since reconstructed signal is given by the superposition

of audio packets sampled by different sensors at random time instants, it can be affected by sequences of

empty samples. These holes influence the result of the speech recognition algorithm, so they have to be

removed by means of healing methods that generate segments of signal coherent with the information

content of the audio stream.

4.1 Overview

Each transmitter transmits a number of audio packets per time unit that is influenced by two factors:

• Audio sensor internal characteristics and constraints: each sensor needs time in order to sample

data, store them in buffers and send them to the main receiver.

• Network communication protocol characteristics and constraints: the number of packets sent to

the main receiver is affected by the number of collisions that may happen on the channel and also

by the protocols themselves (handshaking, request-response timings, timeslot allocations).

Audio streams generated by each transmitter contain a number of samples that is proportional to the

transmission rate fTX of the sensor. Defining Fill_ratio as the number of audio samples received over

35

Chapter 4. Signal Reconstruction and Repair

the system uptime it is possible to write:

Fill_ratio = Tframe · fTX (4.1)

where fTX =number of packets

total timeand Tframe = FRAME_SIZE

Fs.

Fill_ratio tends to 1 (complete audio stream) when Tframe and fTX tend to be respectively one the

reciprocal of the other, i.e., when the average time distance between adjacent transmissions equals the

duration of the audio packet. Using random access protocols like ALOHA family (section 2.5.1), this

optimal condition is hard to reach so the audio streams collected by the base station will be characterized

by 0 < Fill_ratio < 1. If Fill_ratio = 1 then there are no holes in the reconstructed speech signal.

The higher is the Fill_ratio, the lower is the number of artifacts generated by holes in the reconstructed

signals.

The flowchart at Fig. 4.1 illustrates the operation performed by the signal reconstruction block. It

receives in input the audio samples from the sensors. According to previous stages (see 3.3.2) each

audio sample is represented by a couple (audioPackets(i,j),positions(i,j)): each element of the

matrix audioPackets represents the ith audio sample transmitted by the jth sensor. Position in time of

the given sample is specified in positions(i,j). Using position information, audio samples are correctly

spaced on the sensor’s timeline (see Fig. 4.2). The time instants in which no audio samples are present

are 0-filled.

Audio packets have been preventively unbiased in order to remove effects of input signal polarization.

Then, audio streams are normalized in order to have signals of comparable amplitudes. Although it is

an unnecessary step for applying cross-correlation, it is a needed transformation in order to obtain audio

streams with equal energy contributions.

The alignment of audio streams is obtained by using the cross-correlation function [19]. In order to

apply it efficiently, the audio streams are processed according to their informative contribution: they are

sorted by their normalized power in descending order to allow the cross-correlation algorithm to work

in the best condition. Once a couple of streams has been aligned, it is summed in order to generate a

“superstream” whose information content is the sum of the single contributions.

Delays applied for stream alignment are proportional to the distance between the source and the

sensor. Due to sound propagation laws, the closer is the sensor to the source, the lower will be the time

of arrival of pressure wave to the microphone. The same holds for normalized power of signals: the

closer is the sensor to the source, the higher will be the power of captured audio signal; so alignment

delays and normalized power of signals give information about the localization of the source.

At the end of the process a global stream is built and returned by the block, whose flowchart is

illustrated in Fig. 4.3. In addition to the output stream and to some statistic data and metrics that will be

discussed in Chapter 6, the block also produces a vector of the same length of the output stream in which

each element totalCoverage(i) maps the coverage of each sample output(i), that is the number of

audio packets that have been superposed in order to obtain that sample. totalCoverage contains values

totalCoverage(i) ∈ [0, N ], where N is the number of sensors; totalCoverage(i) = 0 means that

36

4.1. Overview

Generate coverage map

Input parameters

Unbias audio samples

a(i,j) = a(i,j) −∑Nk=1

a(k,j)

N

Normalize audio samples

a(i,j) =a(i,j)

max1≤k≤N

a(k,j)

Time-align audio samples

(Creation of partials)

Correlation analysis(See Fig. 4.3)

Update coverage map


Figure 4.1: Flowchart of the signal reconstruction block

t1,2 t2,2 t3,2 t4,2

t1,1 t2,1 t3,1 t4,1

Sens

or1

Sens

or2

time

Figure 4.2: Alignment of audio packets considering their timestamp

37


Sort partials by their

in-band normalized power

Power

of Signals

Select first couple of partials

Apply cross-correlation,

obtain delay ∆

Delays

between Signals

Evaluate

∆

Delay first

partial by ∆

Delay second

partial by ∆

Update

coverage map

Update

coverage map

Policy-driven superposition

of partials

Other partials?Select next

partial

∆ < 0

∆ > 0

∆ = 0

yes

no

Figure 4.3: Flowchart of the correlation analysis block

38

4.2. Energy Compensation

2.345 2.35 2.355 2.36 2.365 2.37 2.375 2.38

x 105

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Samples

Am

plit

ud

e

Original signal

Holey signal

Figure 4.4: Representation of a speech signal that presents holes that need to be filled

element output(i) does not belong to any audio packet of any transmitter, then it will be equal to 0.

Sequences of zeros in totalCoverage are called holes. A typical holey signal is represented in

Figure 4.4. Size and number of holes affect quality of the audio signal and also the speech recognition:

the bigger the number and size of holes, the higher the chance of less accurate speech recognition.

4.2 Energy Compensation

The following steps are performed:

• Bias removal, in order to eliminate an eventual incorrect polarization of the input stage.

• Normalization of input signals, for removing amplitude attenuation due to the different distances

between the speech source and the sensors

• Policy-driven superposition of sensor contributions after performing correlation and alignment.

4.2.1 Bias Removal

Incorrect polarization of the input signal can affect the result of the reconstruction block, that is based

on the summation of contributions that vary randomly in time. Audio signals coming from different

sensors are affected by different polarizations. The DC contribution of each audio packet can be seen as

a rectangular function whose length corresponds to the length of the audio packet Tframe, and whose

amplitude Aj is given by the bias point of the jth sensor:

rect(t(i,j)) , Aj(H(t− t(i,j))−H(t− (t(i,j) + Tframe))

)(4.2)

with H(t) ,

0, if t < 0.

1, otherwise.is the Heaviside step function

39


0 10 20 30 40 50

−2

−1

0

1

Original signal

0 10 20 30 40 50−2

0

2Bias components

0 10 20 30 40 50

−2

−1

0

1

Biased sum of contributions

Figure 4.5: How summation of different biases might influence the reconstructed signal

The summation of rectangle functions corresponds to the superposition to the audio signal of a square

wave whose frequency and amplitude are randomly changing, introducing in this way harmonic distor-

tion to the speech signal:

y(t) = x(t) +

N∑j=1

Pj∑i=1

rect(t(i,j)) (4.3)

where N is the number of sensors and Pj is the number of audio packets sent by the jth transmitter.

Influence of bias summation in the reconstructed signal is illustrated in Figure 4.5.

Moreover, bias removal is needed because holes in every audio stream coming from the sensors are

filled with zeros. If the signal was biased, an arbitrary DC component different from the mean of the

signal would be introduced. This components might influence behavior of cross-correlation function.

4.2.2 Normalization

The second step of energy compensation is the normalization of audio samples. Audio samples are scaled

by the maximum value given from each sensor. This is done after removing the signal polarization in

order not to introduce distortions:

a(i,j) =a(i,j)

max1≤k≤N

a(k,j)∀j ∈ (1, N) (4.4)

Normalization, although it is not needed to cross-correlation function itself, removes dependence on the

distance between the speech source and the sensor. In this way, neglecting differences in frequency

response of microphones and small variations in spectral content due to room acoustics, contributions of

different sensors can be summed without compensations coefficients.

40

4.3. Streams Superposition

4.3 Streams Superposition

Once audio streams obtained by sensor acquisition have been made uniform by means of unbiasing and

normalizations, once they have been delayed in order to make them coherent, they need to be superposed

in order to reconstruct the recorded speech signal. Two methods have been tested: Weighted Sum of

Contribution and Holes Replacement.

4.3.1 Weighted Sum of Contributions

Contribution are summed and scaled for preventing amplitude artifacts.

Given y(i,j) the ith sample of the audio stream coming from jth sensor and totalCoverage(i) the

number of sensor that contributes to the ith sample

totalCoverage(i) =

N∑j=1

(yi,j 6= 0) (4.5)

the samples coming from different sensors are summed and scaled by weighting factor w(i)

w(i) =

totalCoverage(i), if totalCoverage(i) 6= 0

1, otherwise.

ysum(i) =

∑Nj=1 (yi,j)

w(i)(4.6)

Weighted Sum is needed for energy preservation and for avoiding harmonic distortion due to the

summation of contribution. Figure 4.6 illustrates an example of distortion caused by sum of multiple

contribution without weighting.

4.3.2 Holes Replacement

Weighted Sum of Contribution as illustrated in previous paragraph presents some drawbacks: it does

not take into account the big differences in the spectrum of signals and in the environment contributions

between sensors located in different places. Each BigEar Audio Sensor is subject to an environment

contribution that depends on:

• the distance between the sensors and the speech source;

• the position of the sensors in the environment;

Contributions can be very different among them in term of signal spectrum and of reverberation. In

general, the closer the sensors, the lower will be the overall effect of the environment-inducted artifacts

since spectrum of the signals will be similarly colored and reverberation tails will be alike.

For this reason, an alternative superposition policy has been tested: instead of performing a weighted

sum of each contribution, only the holes in the first audio stream are filled with contributions coming from

other sensors. This method reduces the number of summation artifacts, provided that the reference signal

41


0 10 20 30 40 50

−2

0

2

Original signal

0 10 20 30 40 500

2

4Number of contributors

0 10 20 30 40 50

−2

0

2

Superposition of contributions

Figure 4.6: Harmonic distortion due to unweighted sum of contributions

(the one on which the holes will be replaced with samples coming from other sensors) has the higher

Fill_ratio, otherwise there is the risk that replacing artifacts will become prominent with respect to

summing artifacts.

4.4 Cross-correlation

Cross-correlation is a measure of similarity of two series as a function of the lag of one relative to the

other. For discrete functions it is defined as:

Rfg[n] = (f ? g) [n] ,+∞∑

m=−∞f∗[m]g[m+ n] (4.7)

where f∗ denotes the complex conjugate of f .

Consider two real valued functions f and g differing only by an unknown shift along the x-axis.

Cross-correlation is used to find how much g must be shifted along the x-axis to make it identical to f .

The equation essentially slides the g function along the x-axis, calculating the integral of their product

at each position. When the functions match, the value of (f ? g) is maximized. This is because when

peaks (positive areas) are aligned, they make a large contribution to the integral. Similarly, when troughs

(negative areas) align, they also make a positive contribution to the integral because the product of two

negative numbers is positive.

The cross-correlation can be applied to audio streams in order to find the optimum delay n∗ for which

two audio signal are best overlapping:

n∗ = arg maxn

((f ? g)[n]) (4.8)

Since audio streams can be affected by holes, audio streams are chosen by their Fill_ratio, i.e., the

number of real samples over the length of the stream (in samples).

42

4.4. Cross-correlation

The MATLAB function [r, lags] = xcorr (x, y) takes in input two sequences x and y,

slides y along the x-axis and calculate the summation of x*y ad each position. The function returns

two vectors: r, that contains cross-correlation results, and lags, that contains the lags at which the

correlations are computed. So, the delay that maximize correlation can be found at lags(i), where i

denotes the position at which r is maximum.

A negative delay implies that the second stream has to be anticipated, a positive delay states that the

second stream has to be delayed, and a zero value delay is returned when two streams are already aligned.

4.4.1 Cross-correlation Drawbacks

A first drawback of cross-correlation function is that it has to operate on superposable audio segments

since it works on signals’ similarity. If audio segments are not superposables, the result stream will be

incoherent since cross-correlation suggest the best way to superpose them.

The second drawback is represented by the inability of the cross-correlation function in discrimi-

nating between the true signal and noise or holes. Cross-correlation function operates on signals that,

for their origin, are noisy and holey. If holes and noise are negligibles, cross-correlation gives expected

results. Conversely, e.g., if holes are much bigger than the signal itself, cross-correlation will produce

wrong results since it returns the delay value for which the holes will be at their best superposed. There-

fore it is needed to analyze the behavior of cross-correlation in function of size and number of holes in

the input signals. The test procedure has been performed in the following way:

1. Load an audio file x

2. Select a fixed ∆ delay (in samples)

• Test correlation by changing dimension of holes

1. Set a fixed τ average time distance between holes

2. Determine number of holes N = b(length(x)/τ · Fs − 1)e

3. Set M = maximum size of holes

4. Create two copies of the original stream; delay by ∆ the second copy

5. Hole the two signals with N holes randomly spaced over all the length of the signals. Each

ith hole has size di ∼ U([0,M ])

6. Apply cross-correlation function Rfg[n] = (f ? g) [n] ,∑+∞m=−∞ f∗[m]g[m+ n]

7. ∆m = arg maxn ((f ? g)[n])

8. Plot the difference |∆ − ∆m|, where ∆m is the measured delay that maximizes cross-

correlation between the two signals.

9. Increase M

10. Repeat steps 4 to 9.

43


• Test correlation by changing number of holes

1. Set fixed d = size of holes

2. Set τ average time distance between holes

3. Determine number of holes N = b(length(x)/τ · Fs − 1)e

4. Create two copies of the original stream; delay by ∆ the second copy

5. Hole the two signals with N holes randomly spaced over all the length of the signals. Each

holes has fixed d size.

6. Apply cross-correlation function Rfg[n] = (f ? g) [n] ,∑+∞m=−∞ f∗[m]g[m+ n]

7. ∆m = arg maxn ((f ? g)[n])

8. Plot the difference |∆ − ∆m|, where ∆m is the measured delay that maximizes cross-

correlation between the two signals.

9. Decrease τ

10. Repeat steps 3 to 9.

Figure 4.7 plots the results of the convergence test. It can be noted that cross-correlation function

works correctly for

di|τ=3 ms ≤ 6.5 ms since di ∼ U([0,M ]) and M |τ=3 ms ≤ 13 ms

and

τ |M=12 ms =length of the stream [seconds]

number of holes≥ 4 ms

where di|τ=3 ms is the average size of the holes evaluated with constant average time distance between

holes τ = 3 ms, and τ |M=12 ms is the average time distance between holes evaluated with holes whose

size is constant within the uniform range [0, 12 ms].

The values described above, as it can be seen in Chapter 6, are evaluated at borderline system condi-

tions, so the cross-correlation convergence is reached.

4.4.2 Envelopes Cross-correlation

As illustrated in Section 4.4.1, if sequence of zeros (holes) are much bigger than the signal itself, or if

the signal is subject to particular type of noises such as impulse trains, the Cross-correlation function

would produce wrong results. To overcome this problem, instead of applying Cross-correlation function

directly on noisy or holey signals, it has been applied to the positive envelopes of the signals themselves.

A positive envelope is a particular representation of a signal that evidences the shape of the signal. It

is obtained with these steps:

1. Generate a copy of the original signal whose negative samples are truncated at 0;

2. Filter the signal with a Low-pass filter having very low cutoff frequency.

44

4.4. Cross-correlation

2 4 6 8 10 12 14 160

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Delay difference D − Dm

in function of the size of holes

Maximum size of holes (ms)

Me

asu

red

diffe

ren

ce

(s)

(a) |∆−∆m| Varying size of holes

0 5 10 15 20 25 300

100

200

300

400

500

600

700

800

900

Delay difference D − Dm

in function of the number of holes

Average time distance between holes (ms)

Me

asu

red

diffe

ren

ce

(s)

(b) |∆−∆m| Varying number of holes

0 2 4 6 8 10 12 14 160

200

400

600

800

1000

1200

1400

1600

Maximum cross−correlation value Rfg

[n*] in function of the size of holes

Maximum size of holes (ms)

Me

asu

red

diffe

ren

ce

(s)

(c) maxRfg[n] Varying size of holes

0 5 10 15 20 25 300

200

400

600

800

1000

1200

1400

Maximum cross−correlation value Rfg

[n*] in function of the number of holes

Average time distance between holes (ms)

Me

asu

red

diffe

ren

ce

(s)

(d) maxRfg[n] Varying number of holes

Figure 4.7: Results of Cross-correlation test between noisy and holey signals

45


6 7 8 9 10 11

x 104

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Samples

Am

plit

ude

Reconstruction algorithm via Envelope Analysis

Sensor 4 (Reference)

Sensor 2

Sensor 1

Sensor 3


Figure 4.8: Cross-correlation analysis and alignment on signal’s envelopes

Then, Cross-correlation has been applied to positive envelopes of the signals in order to find the right

alignment between the envelopes of the audio streams. Figure 4.8 illustrate the result of the alignment

step of the envelopes1. It can be noted that peaks and valley of the signals are globally aligned. This

alignment technique offers higher robustness with highly noisy or highly depleted streams, although the

effort for a better alignment could be frustrated from the lower intelligibility of the speech signal.

4.5 Healing Signal Fragmentation

The holes in the reconstructed signal drastically decrease the accuracy of the speech recognition proce-

dure. In fact speech recognition rarely succeeds in recognizing the speech audio with very depleted audio

signals. Although BigEar system, working with the best network parameters, ensures high chances for

a successful speech recognition, four different methods have been used to repair the audio signal before

sending it to the speech recognition block: the graphical B-Spline Interpolation, Noise Addition, Packet

Loss Concealing and Audio Inpainting.

4.5.1 B-Spline Interpolation

B-Spline interpolation represent a graphical approach to the problem: holes in the reconstructed signal

can be seen as holes in the plot of the signal itself, then a basis spline function has been used in order

to interpolate the curve in holes using real samples that lie around holes. Although the human listener

can appreciate a significant improvement to the speech signal after applying B-Spline interpolation, this

method does not help with increasing the accuracy of speech recognition, that can rarely recognize the

1On the image, envelopes of the streams coming from different sensor have been shifted along y-asis in order to give readability

to the plot

46

4.5. Healing Signal Fragmentation

1600 1650 1700 1750 1800 1850 1900−0.02

−0.015

−0.01

−0.005

0

0.005

0.01

0.015

0.02

0.025

Original signal

Reconstructed (holey) signal

B−Spline interpolated signal

Figure 4.9: B-Spline method for healing reconstructed signal

speech audio. Moreover, wide holes restoration by means of spline introduces low frequency spectral

components and amplitude distortions. Figure 4.9 depicts an example of B-Spline based speech signal

repair. It can be seen that in case of big holes, the result can be very different from the original one.

4.5.2 Noise Addition

For all the speech recognition tests, the audio data is directly sent to Google Speech Recognition (GSR)

through Google Speech API. The following observation led to the idea of noise addition to the speech

signal before it being sent to GSR: when the reconstructed speech signal with the holes was played loudly

with the speaker of a laptop computer and an instance of GSR on a mobile phone was used to recognize

the played speech signal, the result was significantly better with respect to sending the reconstructed

speech signal directly to GSR.

The first experiment done was to add the office noise to the reconstructed speech signal before send-

ing it to GSR. As it seemed that higher frequency components were contributing to the better recognition

accuracies, the same experiments were conducted with white and pink noises. Table 4.1 summarizes the

GSR accuracy results for a word pronunciation (accendi = keyword) in case of the reconstructed speech

signal with holes, B-Spline interpolation, and noise addition (office noise, white noise, pink noise). As it

can be seen, the addition of noise improves significantly the recognition accuracy. In the case of optimal

parameter, maximum transmission delay of 15ms, all the 50 repetitions result in successful recognition

(the major part correctly recognized and some incorrectly recognized).

4.5.3 Packet Loss Concealing

Packet Loss Concealment (PLC) algorithms, also known as frame erasure concealment algorithms, hide

transmission losses in an audio system where the input signal is encoded and packetized at a transmitter,

sent over a network, and received at a receiver that decodes the packet and plays out the output. The

algorithm is described in ITU-T G.711.I Recommendation [14]

47


Tmax(ms) reconstructed B-Spline Office Noise White Noise Pink Noise

5 0H/0Ma 0H/0M 7H/21M 8H/20M 7H/21M

8 0H/1M 1H/1M 32H/13M 32H/13M 32H/13M

15 0H/1M 2H/0M 44H/6M 43H/7M 44H/6M

25 0H/0M 0H/0M 41H/8M 41H/8M 41H/8MaH (hit) / M (miss): recognized the same word / recognized another word

Table 4.1: Noise Addition for the word accendi with different max transmission delays for 50 repetitions

Figure 4.10: G.711.I Frame erasure concealment algorithm

The objective of PLC is to generate a synthetic speech signal to cover missing data (erasures) in a

received bit stream. Ideally, the synthesized signal will have the same timbre and spectral characteristics

as the missing signal, and will not create unnatural artifacts. Since speech signals are often locally

stationary, it is possible to use the signals’ past history to generate a reasonable approximation to the

missing segment.

Figure 4.10 illustrates results of PLC. The algorithm uses sequences of good frames to perform pitch

detection; when a bad frame is received, the erasure is filled by the prolongation of the last good frames,

performed by Overlap-and-Add and scaling methods in order to prevent unnatural artifacts (beep).

Drawbacks of this method are given by the nature of holes: PLC needs long sequences of good frames

in order to perform pitch detection, while for the nature of audio packets and for the way in which they

are summed, this condition is quite impossible to reach. In this way the output of PLC block is very

unnatural and in the great part of tests the sound-to-noise ratio has been decreased fast.

4.5.4 LSAR - Least Squares Auto-Regressive Interpolation

An interpolation procedure which has proven highly successful is the autoregressive (AR) model-based

method. This was devised first for the concealment of uncorrectable errors in CD systems [29], but was

independently derived and applied to the audio restoration problem [26–28].

The method considers a set of samples drawn from an auto-regressive (AR) process with parameters

48

4.5. Healing Signal Fragmentation

a. Excitation vector e can be written in terms of the data vector:

e = Ax

= A(U x(i) + K x−(i))

= A(i)x(i) + A−(i)x−(i)

where x has been re-expressed in terms of its partitions: x(i) is the column-wise partition of uncorrupted

samples of the audio signal - taken from both sides around the corrupt segment, and x−(i) is the column-

wise partition of corrupted samples of the audio signal.

Thus, the estimated unknown audio partition can be obtaind by minimizing the cost function E with

respect to the unknown vector x−(i):

E =

N∑n=P+1

e2n = eTe

The least squares (LS) interpolator is now obtained as the interpolated data vector x(i) which mini-

mizes the sum squared prediction error E, since E can be regarded as a measure of the goodness of ‘fit’

of the data to the AR model. In other words, the solution is found as that unknown data vector x(i) which

minimizes E:

xLS(i) = arg minx(i)E

that can be solved for x(i), obtaining [10]:

xLS(i) = −(AT

(i)A(i)

)−1AT

(i)A−(i)x(i)

This solution for the missing data involves solution of a set of l linear equations, where l is the number

of missing samples in the block. Although the approximation is accurate even in bad audio conditions,

speed of processing does not allow a real-time usage.

4.5.5 Audio inpainting

Inpainting is the process of reconstructing lost or deteriorated parts of images, videos or sounds [23].

Inpainting is rooted in the restoration of images. Traditionally, inpainting has been done by professional

restorers. The underlying methodology of their work is as follows:

• The global picture determines how to fill in the gap. The purpose of inpainting is to restore the

unity of the work.

• The structure of the gap surroundings is supposed to be continued into the gap. Contour lines that

arrive at the gap boundary are prolonged into the gap.

• The different regions inside a gap, as defined by the contour lines, are filled with colors matching

for those of its boundary.

• The small details are painted, i.e. “texture” is added.

49


This method can be migrated in the sound world using different kinds of analysis such as the evolu-

tion of the spectrum in time or the analysis of formants edges.

UNLocBoX toolbox

The UNLocBoX [18] is a toolbox designed to solve convex optimization problems of the form

minx∈R

f1(x) + f2(x) (4.9)

where the fi are lower semi-continuous convex functions from RN to (−∞,+∞]. When both f1 and

f2 are smooth functions, gradient descent methods can be used to solve 4.9; however, gradient descent

methods cannot be used to solve 4.9 when f1 and/or f2 are not smooth. In order to solve such problems

more generally, in the toolbox several algorithms are implemented, falling into the class of proximal

splitting algorithms.

The term proximal refers to their use of proximity operators, which are generalizations of convex

projection operators. The proximity operator of a lower semi-continuous convex function f : RN → R

is defined by

proxf (x) , arg miny∈R

1

2‖x− y‖22 + f(y)

. (4.10)

Note that the minimization problem in 4.10 has a unique solution for every x ∈ RN , so proxf : RN →RN is well-defined. The proximity operator is a useful tool because x∗ is a minimizer in 4.9 if and only

if for any γ > 0,

x∗ = proxγ(f1+f2)(x∗). (4.11)

The term splitting refers to the fact that the proximal splitting algorithms do not directly evaluate

the proximity operator proxγ(f1+f2)(x), but rather try to find a solution to 4.11 through sequences of

computations involving the proximity operators proxγf1(x) and proxγf2(x) separately.

In order to test the toolbox, the demo MATLAB script proposed by the authors has been modified

in order to load BigEar data; then, the script tries to solve the sound inpainting problem expressed as in

4.12:

arg minx‖AG∗x− b‖2 + τ‖x‖1 (4.12)

where b is the signal at the non-holey part, A an operator representing the mask selecting the non-holey

part of the signal and G∗ is the Gabor synthesis operation. Here the general assumption is that the signal

is sparse in the Gabor domain.

This is the healing method that gives best results; indeed, the accuracy of the repaired speech signals

with inpainting is the same as the original signals (see Fig. 4.11a). However, the main drawback of the

inpainting method is its time complexity that makes it inappropriate for “real-time”speech recognition

for domotic control.

4.5.6 Healing methods comparison

A comparison of healing methods has been performed. Excluding Packet Loss Concealing (Section

4.5.3) because of its strong dependence on the nature of holes, and Noise Addition since it cannot be

50

4.6. Summary

Time (s)

Fre

qu

en

cy (

Hz)

Gabor transform of the original sound

12.5 13 13.5 14 14.5 150

500

1000

1500

2000

2500

3000

−80

−70

−60

−50

−40

−30

−20

−10

0

(a) Original stream

Time (s)

Fre

qu

en

cy (

Hz)

Gabor transform of the depleted sound

12.5 13 13.5 14 14.5 150

500

1000

1500

2000

2500

3000

−80

−70

−60

−50

−40

−30

−20

−10

0

(b) Reconstructed stream

Time (s)

Fre

quency (

Hz)

Gabor transform of the reconstructed sound

12.5 13 13.5 14 14.5 150

500

1000

1500

2000

2500

3000

−80

−70

−60

−50

−40

−30

−20

−10

0

(c) Inpainted stream

Figure 4.11: Gabor spectrograms of soggiorno and tapparelle words from the original audio stream,

the holey reconstructed file coming from BigEar simulator (see Chapter 3) and the restored stream

by means of Audio Inpainting.

considered in the strict sense a healing technique, other methods have been tested in term of speed of

processing. The depleted input file was the recording of the “Soggiorno” word (duration: 1.70 s) ob-

tained by BigEar Simulator described in Chapter 3. Results of tests - performed on a Intel Core i7

machine - are illustrated in Table 4.2. The wide difference in speed of processing, given by the Realtime

Performance Ratio (RPR) index (See Section 6.2.2) can be noted.

RPR < 1 states that the healing method cannot be used directly in real-time processing. Results of

speech recognition are reported for completeness Speech Recognition API - feeded with depleted record-

ing - returned as result the world ‘Giorno’ with 0.69 confidence level2. Comparing this information with

confidence values exposed in table 4.2, it can be noted that Audio Inpainting and LSAR are the tech-

niques that return better results. In any case, the recognized word itself should not be considered as a

good indicator since Speech Recognition is strongly dependent by the position of holes in the word. The

more the holes are superposed onto consonants positions, the less the result will be inaccurate.

4.6 Summary

In this chapter the Reconstruction Algorithm has been described. The algorithm receives as input the

audio samples obtained from the BigEar Simulator (Chapter 3) or from the acquisition stage (as it will

be described in Chapter 5); each audio sample is labeled with timing information that is exploited for

arranging audio samples in partial signals. One partial per sensor has been generated.

Partials are processed for performing energy compensation: to avoid effects of wrong input po-

larization each partial is unbiased and then normalized for compensating intensity differences due to

source-sensor distance.

Time-of-arrival compensation is performed by means of Cross-correlation analysis, that find the best

delay values for which the partial signals superpose at their best. Two methods for performing correlation

2Confidence level is the probability of the given word correct decoding. It is a value from 0 to 1 returned from Google Speech

API when the API is not 100% confident in its translation.

51


Method Time of processing (s) RPR ASR result (confidence)

Audio inpainting (30 iterations) 7.17 0.24 Giorno (0.83)



LSAR (5+5 samples) 15.57 0.11 Giorno (0.84)



B-spline (30+30 samples) 0.22 7.72 Giorno (0.72)



Table 4.2: Healing methods comparison

analysis have been tested: Direct Cross-correlation and Envelope Cross-correlation, that perform cross-

correlation analysis on the waveform shape of the signals, in order to avoid bad alignment due to the

presence of extremely regular noises like impulse trains or electromagnetic radio interferences.

Once the partials have been aligned, they are superposed in order to sum contributions coming from

each sensor. Two superposition methods has been tested: Weighted Sum of Contribution, that take

account of the number of contributors per each sample in order to prevent amplitude artifacts, and Holes

Replacement, that, given a reference stream, replaces sequences of empty samples with samples extracted

from other partials.

The result of partials processing, alignment and superposition is a reconstructed speech signal that,

depending on Network Interaction model, presents residual sequences of empty samples. Although

BigEar system, working with the best network parameters, ensures high chances for a successful speech

recognition, different methods for Signal Healing had been tested in order to explore signal improvement

when working with highly depleted (holey) speech signals.

Power of partial streams and delays applied to partial streams are returned as output; these data will

be discussed in Chapter 6 in order to examine how they can be exploited for coarse-grain localization.

Reconstruction Algorithm will be used both in the BigEar Simulator - described in Chapter 3 - and

at the end of acquisition chain that will be described in the following chapter.

52

5BigEar Implementation

Introduction

In this chapter the implementation of the system is described. In the first section the hardware implemen-

tation is illustrated, whose prominent part is the represented by audio interface design. This part focuses

on physical aspects of the signal that it is needed to be acquired and on interface constraints between the

signal conditioning stage and the MCU. In the second part of the chapter, the application logic will be

described starting from leaves of the system, represented by the audio sensors, and arriving to the Base

Station that deals with data collection and audio reconstruction.

5.1 Overview

As already mentioned in Chapter 2, the system is composed of a set of audio sensors that perform a space-

time sampling of a room. Before sampling, the audio signal converted by each microphone capsule has to

be amplified and biased in order to match to ADC characteristics. Each audio sensor samples the signal

and packs data into frames in order to send them to the receiver. The multi-buffer internal structure of

each transmitter allows an efficient application logic in which the sampling stage is managed by means of

a timer-handled interrupt routine, and the network logic is handled by the main loop of the applications.

Network structure can be layered onto several radio channels in order to reduce the number of collision.

A separate BigEar Receiver is needed for each radio channel.

Once the packets arrive to the BigEar receiver, they are converted in hexadecimal nibbles and serially

sent to the Base Station by means of the USB port. The Base Station, in its experimental form, is

composed of an acquisition application that listens to each BigEar Receiver connected to the USB ports

of the machine and stores data into a local raw data file. When the acquisition stage is stopped by the

user, raw data are converted into CSV format in order to feed a MATLAB script that reconstructs the

audio data by means of the reconstruction method described in Chapter 4.

53

Chapter 5. BigEar Implementation

VDD

GND

R_BIAS

V_ADC V_BIAS

C_OUT

Figure 5.1: BigEar Audio Capture Board Biasing circuit

5.2 Hardware Setup

Hardware design of the BigEar architecture can be divided in two part: the design of the transmitters

(BigEar Audio Capture boards) and the design of the BigEar Receiver.

The most relevant part in hardware setup is represented by the Audio Capture board, since it is a

needed analog stage for signal conversion and conditioning. This stage has been implemented following

the criterion of circuital simplicity in order to rapidly discover strengths and weaknesses and to minimize

prototyping costs. Then, the analog input stage is interfaced with Wixel board. Wixel board is a complete

system that needs only to be connected to two pushbuttons for allowing RESET procedure and for putting

Wixel into BOOTLOADER mode. Besides a power supply source, other hardware is not needed.

5.2.1 BigEar Audio Capture Board

As seen in section 2.3 the considered typical speech signal has an intensity level of 60 dB, evaluated

one meter far from the point source. According to conditions exposed in Section 2.3.1 and in particu-

lar referring to semi-reverberant fields and to Hopkins-Stryker equation (Eq. 2.4), a pressure intensity

pspeech ∈ (10, 02 mPa; 20 mPa) can be measured at 2 meters far from the source1.

The chosen microphones have a sensitivity of −38 dB (Appendix G) that is, 12.59 mV/Pa (Eq.

2.7). Sensitivity value implies that at the specified conditions the microphones produce an output signal

whose amplitude will be Vmic ∈ (0.126 mV ; 0.252 mV ).

Interfacing with Wixel ADC

The datasheet of CC2511 SoC reports: The positive reference voltage for analog-to digital conversions

is selectable as either an internally generated 1.25 V voltage, VDD on the AVDD pin, an external voltage

applied to the AIN7 input pin, or a differential voltage applied to the AIN6 - AIN7 inputs (AIN6 must

have the highest input voltage) [12, p. 138]. Using Wixel Programmable Module some limitations are

imposed due to the board design, so Wixel SDK documentation reports that the reference voltage can

1Lower limit is given only by the contribution of direct wave, as if the system were operating in anechoic conditions; upper

limit is calculated as if the speaker was talking at 1 m far from the sensor, that is a quite unlikely condition.

54

5.2. Hardware Setup

be selected between 2 values only: The internal-generated 1.25 V reference and the VDD = 3.3 V

value [6, adc.h documentation]. So it is needed to bias the signal in order to make it symmetric with

respect to the center of the ADC Range Vref/2. This can be done by feeding the ADC input by means

of a voltage divider connected to the amplifier by means of a large capacitor. Figure 5.1 illustrates the

biasing circuit. Bias point is determined by:

Vbias =Vref

2=

αRVαRV + (1− α)RV

· VDD ⇒ Vbias = α · VDD α ∈ [0, 1] (5.1)

where α is the position of the moving cursor of the potentiometer, expressed in the normalized interval

[0, 1], and VDD is the regulated power supply (3.3 V) of the Wixel board that can be found on the 3V3

pin of the board itself.

Choosing vref = 1.25V the position α of the potentiometer can be calculated as follow:

Vbias =Vref

2= 0.625V

α =VbiasVDD

=0.625V

3.3V= 0.189 (5.2)

As mentioned in section 5.3.1, a calibration procedure will help the operator in finding, by means of a

screwdriver, the right position of the moving cursor of the potentiometer.

In order to correctly dimension C1 capacitor in Fig. 5.1 we have to consider that, performing the AC

analysis, the network can be seen as a high pass RC passive filter having

C = C1 and

R = (1− α)RV //αRV =(1− α)RV · αRV(1− α)RV + αRV

The cutoff frequency of the filter is given by

fHP =1

2πRC(5.3)

so it is needed to choose R and C values in such a way that the cutoff frequency is out of the voice band,

and in particular fHP has to be less than the lower limit of the band.

Considering

fHP < 300Hz2 and RV = 200kΩ

from Eq. 5.3:

C >1

2πR · fHP⇒ C > 17.28nF

Signal Conditioning

In order to exploit the whole [0, Vref ] voltage range of ADC, a very high gain of the input stage is

required. Supposing that Vref = 1.25V and that the desired ADC input voltage Vadc = Vref − 6 dB =

1/2 · Vref :

A =VadcVmic

=12Vref

Vmic=

0.5 · 1.25 V

(0.126 mV ; 0.252 mV )= (2480; 4960) ⇒ AdB ≈ (68 dB; 74 dB)

(5.4)2The signal is considered in the [300 3000] Hz band. See section 2.3.1.

55


R3

R 2

C1

C2

R 5

R42

31 6

57

R1

V_M IC

V_ADC

Figure 5.2: BigEar Audio Capture Board Signal Conditioning stage

FREQUENCY (Hz)0.01 0.1

VOLT

AGE

GAIN

(dB)

1M 10M1 10 100 1k 10k 100k

140

120

100

80

60

40

20

0

–20

VS = ±15VVS = 5V, 0V

TA = 25°CCL = 100pF

Figure 5.3: LT 1013 Operational Amplifier - Voltage VS. Frequency Gain [25, p. 9]

With such a high gain like this it is important to filter the microphone signal in order to suppress out-of-

band noise and to prevent operational amplifier saturation.

Frequency compensation techniques adopted in operational amplifier design - specially the most

commonly used Dominant-pole compensation method - do not allow to obtain high gains on signals

having a relatively wide spectrum like audio signals in general or speech signals in particular. Figure 5.3

illustrates the maximum reachable gain with respect to the frequency of the input signal for the LT1013

Operational Amplifier; it can be seen that for a signal having maximum frequency fH = 3 kHz the

maximum reachable gain is around 40 dB. For this reason, the simplest circuital schema that could both

amplify and filter the signal is a double stage amplifier, composed of a first Active Band Pass Filter - for

filtering the input signal and applying part of the total desired gain - and of a second Inverting Amplifier

- for applying the residual gain. The circuit is depicted in Figure 5.2. This configuration exhibits some

interesting design features:

• Configurable input gain at the first stage (R1)

• Independent control of the cutoff frequencies and of the first stage gain

56

5.2. Hardware Setup

• Variable overall gain (replacing R5 with a variable resistor)

The parameters that drive the signal conditioning stage are:

Acb = 1 +R2

R3+R5

R4Overall gain, omitting the phase inversion due to the second stage. (5.5)

fL =1

2π R1 C1Cutoff frequency of the High Pass filter given from C1 and R1. (5.6)

fH =1

2π R2 C2Cutoff frequency of the Low Pass retroaction network C2//R2. (5.7)

According to technical specifications of the Microphone and of the Operational Amplifier chosen for the

realization of the prototype - described in Appendices G.1 and G.2 - the following design choices have

been implemented:

• Since the microphone has an Output Impedance Zout = 2.2 kΩ, thenR1 Zout, so the chosen

value for R1 is 120 kΩ.

• The first stage amplifies the microphone signal with a partial gain of about 34 dB, so the chosen

values for R2 and R3 are respectively 100 kΩ and 2.2 kΩ.

• The residual gain varies from 0 up to 40 dB and can be set by the operator by means of a variable

feedback resistor, so R4 = 1 kΩ and R5 is replaced by a series of a 1 kΩ resistor with a 100 kΩ

potentiometer.

• The first stage perform filtering at fL u 300 Hz and fH u 3 kHz; in order to reach these

cutoff frequencies the selected values are C1 = 4.7 nF and C2 = 470 pF .

• Since the whole circuit will be powered from a battery source, the input pins of the Operational

Amplifier that in Figure 5.2 are referenced to ground need to be referenced to Vcc/2 in order to

prevent the use of a dual supply.

Table 5.1 summarize the design choices and their effects on the circuit. The circuit has been tested

using PSpice - the well-known analog electronic circuit simulator - before its realization in order to

verify the frequency response of the system. Results are depicted in Figure 5.4; it can be seen that

simulation results are close to the desired behavior both in frequency response and in-band gain. Then,

the prototypes have been realized with the addition of power decoupling capacitors and the power supply

circuit for the microphone. The complete schematic and a picture of the realized prototype can be found

in Appendix E.1.

5.2.2 BigEar Receiver

The Hardware design of the BigEar Receiver is limited to a simple board that hosts the Wixel module

and connects it with two pushbuttons for allowing RESET procedure and for putting Wixel into BOOT-

LOADER mode. Since the Receiver is connected to a PC by means of a USB port, BigEar Receiver is

directly powered by means of the USB connection.

57


Design choices Effect

ADC Reference voltage Vref = 1.25 V (CC2511 internal reference)

Input impedance R1 = 120 kΩ Zin = 120 kΩ

First stage gainR2 = 100 kΩ

R3 = 2.2 kΩ

A1 u 33 dB

High Pass Filter C1 = 4.7 nF fL = 282 Hz

Low Pass feedback filter C2 = 470 pF fH = 3.39 kHz

Second stage gainR5 = 1 kΩ up to 101 kΩ

R4 = 1 kΩ

A2 u 0 up to 40 dB

Table 5.1: Design Choices for Signal Conditioning stage

10 Hz 100 Hz 1 kHz 10 kHz 100 kHz100 µV

1 mV

10 mV

100 mV

1 V

Frequency

Vadc

Frequency response with Vmic

= 126 µV

R5 = 100 kOhm

R5 = 5 kOhm

R5 = 1 kOhm

Figure 5.4: Frequency response of the simulated signal conditioning circuit

58

5.3. Software Implementation

In addition to these elements, the experimental board hosts two additional buttons connected re-

spectively to P0_0 and P0_1 input of the Wixel module. Although their presence is not needed for the

operation of the receiver, they could be used for debug purposes or for experimenting new features. Since

the CC2511 Microcontroller Unit - that is the core of Wixel programmable module - offers both pull-up

and pull-down internal resistors that can be configured via software, a jumper allows the user to config-

ure the pushbuttons to be active HIGH or active LOW. Also for experimental reasons, two pins for the

connection of the module with a power source have been designed.

The complete schematic and a picture of the realized prototype can be found in Appendix E.2.

5.3 Software Implementation

As previously discussed in chapter 2, Software implementation can be splitted in three different blocks:

• BigEar Audio Sensor application, that runs on the BigEar Audio Capture board and is depicted to

sample audio data, arrange them into packets and send them to the BigEar Receiver;

• BigEar Receiver application, that runs on the namesake board and performs a conversion between

radio packets and HEX nibbles to be sent via USB port to the machine;

• BigEar Base Station, that captures serial USB data and perform the reconstruction of the captured

audio stream.

In its experimental form, the Base Station application is mainly devoted to streams analysis and not to a

real-time reconstruction of the speech signal, so it is divided into two blocks:

• BigEar SerialPort, a C# application that captures serial USB data and arrange them into a CSV

data file;

• BigEar Reconstructor, a MATLAB script that attempt to recreate the speech signal starting from

captured data, preventively converted into CSV format by BigEar SerialPort application.

5.3.1 BigEar Audio Sensor

The application that runs on BigEar Audio Capture Boards can be seen as the superposition of two

routines that perform different tasks and are handled in two different ways. Once the system has been

initialized, a time-driven interrupt routine samples the analog signal read on P0_0 input of the Wixel

module and store it into a N-Buffer structure; at the same time, the main loop handles the transmission of

the packet data to the receiver. The data transmission is not continuous but it is dependent on the chosen

network policy. Figure 5.5 illustrates the flowcharts of the main application. The C code is reported in

Appendix E.3.

59


Start

Initialize system

Wait for startup

Start first AD Conversion

Determine time in-

stant of transmission

Read Buffer and send data

Handle “signal” Yellow LED

Call Wixel board routines

(a) Main loop

Start

ADC: end

of Conversion?

Read sampled value

Handle “signal” and

“saturation” LEDs

Start new AD Conversion

Store sample and timing info

Update Write Buffer status

End

yes

no

(b) Time-driven interrupt routine

Figure 5.5: Flowcharts of the BigEar Audio Capture application

60


N-Buffer Data Storage

In order to efficiently store data before sending them to the receiver, audio samples are written and

arranged in frames. In paragraph Samples Processing and Transmission it will be explained that each

frame contains 20 consecutive samples, and each frame is in turn arranged into an array of 3 elements.

In this way it is possible to bufferize data without conflicts related to reading and writing simultaneously

in the same memory location. The reading index irs_index and the writing index main_index are

written respectively from the Sampling Routine and from the Main loop, so - checking that irs_index

!= main_index - it will never happen that the main loop access a frame that is empty or incomplete.

Furthermore, each frame is capable to store not only the sample value but also timing information in

order to perform a good reconstruction of the samples positions in time.

Time-driven Sampling Routine

Sampling routine needs to be called periodically with frequency fs. In order to choose fs value, Nyquist-

Shannon Theorem states that in order to prevent aliasing, sampling frequency must be higher than upper

limit of the band of the signal. Since, as illustrated in Section 4.1, one of the goals of the Signal Recon-

struction stage is to prevent holes in the reconstructed speech signal, minimizing fs means to maximize

the sample duration and then to maximize audioPackets overlapping likelihood. So the design choice is

to select fs ' 6 kHz.

Flowchart 5.5b illustrates the work of the Sampling routine that samples the P0_0 input and store

data into the buffers. The function is handled like an Interrupt Service Routine raised by the internal

T3 timer of the Microcontroller. T3 is a 8-bit timer which supports typical timer/counter functions such

as output compare and PWM functions [12, p. 126]. It consists of an 8-bit counter that increments

(or decrements) at each active clock edge. The frequency of the active clock edges is given by internal

registers CLKCON.TICKSPD and T3CTL.DIV. With the default value of CLKCON.TICKSPD (initial-

ized by the Wixel SDK libraries) and setting the prescaler T3CTL.DIV at its maximum value (128), the

frequency of each active edge is ftick = 187.5 kHz.

In order to obtain a sampling frequency fs ' 6 kHz it is needed to set the timer to work in Modulo

Mode [12, p. 127] and to set the register T3CC0 to a value N such that:

fs =ftickN

and then:

N =

⌊ftickfs

⌋=

⌊187.5 kHz

6 kHz

⌋= 31

Some empirical tests have been performed with the help of a Logical Analyzer in order to verify the

value and the stability of the sampling frequency. During this tests it has been discovered that the best

value for T3CC0 is N = 30, that gives as result a sampling frequency of fs = 6.04± 0.005 kHz.

The routine, basically, reads from the ADCH+ADCL registers the result of the AD conversion and

stores it using the already described N-Buffer structure. For each sample it is needed to know exact

the timing information in order to perform a precise in-time reconstruction of the speech signal. Wixel

61


SDK exposes the uint32 getMs() function that returns a 32-bit unsigned integer that represents the

number of milliseconds that have elapsed since the system was initialized; the time distance between

two samples sampled at fs = 6.04 kHz is 0.166 ms so the granularity of 1 millisecond is not sufficient.

In order to obtain a higher precision, in addiction to the timestamp expressed in millisecond, the value

of T4CNT register is stored. This register contains the value of the T4 counter used by the system for

incrementing the milliseconds counter. The T4 counter is programmed to operate in Modulo Mode and

to count up to 187, in such a way that each millisecond the Timer raises an Interrupt, whose Service

Routine simply increments by 1 the milliseconds counter. In this way, storing T4CNT value means

storing a fractional part of millisecond expressed as units of 1/187 ms.

Another function of the Interrupt Service Routine is to handle the yellow and the red LEDs, used as

level indicators as illustrated in Figure 5.6:

• Signal LED: When the amplitude of the sample is greater than -6 dB3, yellow LED is switched on;

• Saturation LED: When the amplitude of the sample is equal to 0 dB, red LED is switched on.

The handling of Saturation LED requires only a simple check, while the handling of Signal LED requires

to set a boolean flag that will be checked in the main loop for switching on the yellow LED. This

limitation is due to the fact that the Yellow LED shares the same GPIO pin on CC2511 MCU with

the BOOTLDR button, so it is not possible to modify the configuration of that specific pin during the

execution of an Interrupt Service Routine, otherwise Wixel will interpret the state change as a BTLDR

request.

ADC Improved Reading

Wixel SDK provides the function uint16 adcRead(uint8 channel) that perform an Analog-

to-Digital Conversion and returns the sampled value. It is a blocking function and its use within the

Interrupt Service Routine described above causes strong dependences on the speed of the CPU. Indeed,

as illustrated in table 2.2, the use of adcRead for performing a 12-bit reading will lock the application

for 128 µs; this means that working at fs = 6.04 kHz only (166 − 128) µs = 38 µs remain for data

storage, processing and transmission. To overcome this problem an unlocking method has been adopted,

exploiting the capability of ADC to perform conversions without any interaction from the CPU.

The unlocking method consists in three steps (highligted in Figure 5.5):

1. Set ADC Registers before main loop starts

2. In the Sampling routine, wait for the end of the conversion before reading sampled value

3. After saving the data, rearm ADC

Since a 12-bit conversion lasts 128 µs < 166 µs, the Sampling routine will never wait for the end of the

conversion (except for the very first AD conversion). Moreover since ADC is rearmed by a time-driven

3With respect to 0 dB as the maximum amplitude value

62


0 20 40 60 80 100 120 140 160 180 200

0 dB

−6 dB

− inf

− 6 dB

0 dB

Samples

Am

plit

ud

eSignal / Saturation LED Handling Thresholds

Saturation limit

Signal threshold

Figure 5.6: Signal and Saturation LED indicators policy

routine, all of the ADC samplings will be synchronized with the same granularity and precision of T3

timer.

The use of 12-bit conversion causes a +6 dB increment the SNR Ratio with respect to the use of

10-bit conversion.

Given M the ADC resolution in bits:

vqn =Vref2M

If M = 10 and Vref = 1.25 V 4:

v10qn =1.25 V

210= 1.221 mV

SNR10 =Vrmsv10qn

Increasing ADC resolution to M = 12 and according to ADC Conversion Results limitations illustrated

in CC2511 datasheet [12, p. 139]:

v12qn =1.25 V

212/2= 0.610 mV

v12qn =1

2

(v10qn)

and so

SNR12 =Vrmsv12qn

=Vrms

12

(v10qn) = 2 · Vrms

v10qnu SNR10 + 6 dB (5.8)

4See Table 5.1.

63


ID Timestamp T4CNT1st

sample...

N-th

sample

1 byte 4 bytes 1 byte FRAME_SIZE · size(sample) bytes

Transm

itter ID

Timing data of first sample FRAME_SIZE samples

Figure 5.7: BigEar Data field structure

Samples Processing and Transmission

The Main loop, after system initialization and startup, performs two tasks: it handles the network pro-

tocol determining time instants at which packet data are sent to the receiver, and send audio samples.

Network protocol is handled by changing the status of the isTimeToTX flag that will be read from the

doProcessing() function in order to process and send buffered samples.

For the pure ALOHA protocol, as already described in Section 2.5.1, time instants of the transmissions

are determined by means of a random assignment that identifies a transmission delay ∆t ∈ U0,∆maxwhere U represents a random uniform distribution on the discrete interval 0,∆max. As soon as the

system timer indicates that ∆t milliseconds have elapsed from the time instant of the last transmission,

the isTimeToTX flag is set to true.

Then, samples are arranged within a radio data packet having the structure illustrated in Figure5.7.

The size of the BigEar Data field structure is:

N = 6 + FRAME_SIZE · size(sample) bytes (5.9)

According to CC2511 Radio Packet format [12, p. 191] - reported in Appendix A.2 - the duration of the

transmission Ttx is given by:

Ttx = (P + Sw + 8 · L+ 8 ·A+ 8 ·N + 16 · C) · 1

Drate(5.10)

where:

P = 64 bits Preample length

Sw = 32 bits Synch Word length

L = 1 Length field enabled

A = 0 Address field disabled

Df = N (from Eq. 5.9) Data field size (expressed in bytes)

C = 1 2-byte CRC enabled

Drate = 350kbit

sData rate

64


As it can be seen in the next paragraph, in order to reduce packet size and to decrease transmission

duration each 16-bit sample is compressed to an 8-bit µ-Law sample. An important design choice is to

correctly dimension FRAME_SIZE, i.e. the number of samples stored in one frame and transmitted

by means of a single transmission. The idea is to exploit the minimum granularity of 1 ms in time

handling functions in order to transmit the maximum amount of data without overstepping that value.

From Equations 5.10 and 5.9, the maximum number of samples per frame FRAME_SIZEmax can be

derived as:

FRAME_SIZEmax : Ttx ≤ 1 ms

FRAME_SIZEmax ≤⌊

1

8(Drate − 168 bits)

⌋FRAME_SIZEmax ≤ 22 (5.11)

FRAME_SIZE = 20 has been chosen in order to keep a small guard time. Indeed:

Ttx

∣∣∣∣∣∣FRAME_SIZE=20

= 0.9371 ms

µ-Law Compression and Expansion

The CC2511F32 SoC provides an industry standard I2S interface. The application exploit the capability

of the I2S interface to perform fast5 µ-Law Compressions or Expansions in order to efficiently code audio

samples.

µ-Law is an audio compression scheme (codec) defined by Consultative Committee for International

Telephony And Telegraphy (CCITT) G.711 Recommendation [13] which compress 16-bit linear PCM

data down to eight bits of logarithmic data. The compression process is logarithmic: the compression

ratio increases as the sample signals increase. In other words, the larger sample signals are compressed

more than the smaller sample signals. This causes the quantization noise to increase as the sample signal

increases. A logarithmic increase in quantization noise throughout the dynamic range of an input sample

signal keeps the SNR constant throughout this dynamic range.

Analog µ-Law compression For a given input x, the equation for µ-Law encoding is:

F (x) = sgn(x)ln(1 + µ|x|)

ln(1 + µ)− 1 ≤ x ≤ 1 (5.12)

Analog µ-Law expansion µ-Law expansion is then given by the inverse equation:

F−1(y) = sgn(y)(1/µ)((1 + µ)|y| − 1) − 1 ≤ y ≤ 1 (5.13)

The digital counterparts of the described equations are defined in ITU-T Recommendation G.711. [13]

5CC2511 I2S interface µ-Law compression and expansion take one clock cycle to perform.

65


Parameters Note

ADC Reference Voltage Vref = 1.25 Vref3V3 Parameter.

(0 = 1.25 V reference;

1 = 3.3 V reference)

ADC Input P0_0 Encoded in

ADC_CONFIGURATION variable

ADC Resolution 12 bit Encoded in

ADC_CONFIGURATION variable

Sampling frequency fs = 6.04 kHz Parameter T3_counter = 30

No. of buffers FRAMES = 3

No. of samples per buffer FRAME_SIZE = 20

Upper limit for random TX delay radio_max_delay Parameter.

Must be in the form (2n−1), n ∈ N

Transmitter’s ID address 2 ≤ address ≤ 255

Table 5.2: BigEar Audio Capture application Design Choices and Parameters

ConsideringFRAME_SIZE = 20, if each sample is stored as an unsigned 16-bit integer,N = 46 bytes

and Ttx = 1.3943 ms. Using µ-Law compression, N = 26 bytes and Ttx = 0.9371 ms, that corre-

sponds to a reduction of the transmission duration of about 33%.

Table 5.2 summarizes the design choices implemented in BigEar Audio Capture application and

configurable parameters.

5.3.2 BigEar Input Calibration

BigEar Input Calibration is a tool that can be used both for calibrating the DC polarization of the input

pin P0_0 of the Wixel module in order to find the right bias point. The application performs a sequential

reading of the P0_0 pin and calculate the average value over 255 consecutive readings; then, the obtained

value is used to generate a report mask that visualizes the data not only in a numerical way but also in a

simple graphical way that help the user in finding the right bias point.

The application do not use radio protocols, it simply generate an ASCII mask via USB serial port

using the VT100 command <ESC>[0;0H] [5] to move the cursor to the top-left corner of the terminal

window and rebuild the mask.

The choice to write a separate stand-alone application for the calibration is justified by the need to

keep the business logic applications as much lightweight as possible. The C code of the application is

listed in Appendix E.4.

66


Start

Initialize system

Wait for startup

Packet

received?

Radio Packet ⇒ HEX nibbles

Fill USB buffer and send

yes

no

Figure 5.8: Flowchart of the BigEar Receiver application

5.3.3 BigEar Receiver

The role of BigEar Receiver - illustrated in Figure 2.4 - is to act as a interface for data transmitted from

BigEar Audio Capture boards towards the Base Station. The flowchart of the application is illustrated

in Figure 5.8. After the system initialization, the application checks the radio buffers in order to re-

ceive data, convert them into hexadecimal nibbles and serially send them calling usbComTxSend()

function.

Since Pure ALOHA Protocol does not require the Base Station to send messages to the transmitters,

the BigEar Receiver implements application logic only for Transmitter-to-Base-Station communication.

The C code of the application can be found in Appendix E.5.

5.3.4 BigEar Base Station

The Base Station application is able to execute two tasks:

• Capture data received through USB serial ports (each BigEar Receiver is connected to a USB port)

67


ID Timestamp T4CNT1st

sample...

N-th

sampleCR

2 chars 8 chars 2 chars 2 · FRAME_SIZE · size_in_bytes(sample) chars 1 char

Transm

itter ID

Timing data of first sample FRAME_SIZE samplesTerm

inatio

n char

(a) BigEar Audio packet RAW format

#TS# ID #System

time#CR

4 chars 2 chars 1 char 16 chars 2 chars

Protoc

olW

ord

Transm

itter ID

Termina

tion str

ing

(b) BigEar New Transmitter Protocol message

Figure 5.9: BigEar SerialPort raw file formats

and save captured data into a RAW file;

• Open a RAW file and decode the content in order to create a CSV file suitable for the reconstruction

by means of the BigEar Reconstructor MATLAB Script

In its experimental form is implemented focusing on data and packet analysis, so it does not allow real-

time operations. However, as it can be seen in section 6.5.2, performance metrics confirm that elaboration

speed is largely higher than acquisition time.

Data Collection: BigEar SerialPort

The application logic of the data capture stage is illustrated in Figure 5.10. The user is asked to select

the USB Virtual serial ports at which the BigEar Receiver(s) is(are) connected. Then, one ComWorker

thread per each BigEar Receiver is instantiated; each instance listens for incoming data and saves re-

ceived BigEar Audio Packets in a raw temporary file. The ComWorker thread analyzes first two hex-

adecimal nibbles of each packet; whenever a new BigEar Audio Capture ID is identified, ComWorker

stores the system timestamp at which the packet has been received. In this way, the application cre-

ates a correlation between the board time of each BigEar Audio Capture board and the system capture

time given by the application. This relationship is used in the Raw-to-CSV decoding stage for allowing

BigEar Reconstructor Algorithm to work on packets belonging to the same temporal domain.

When the user stops the capture, ComWorker instances are terminated and the temporary files are

merged into a unique .BgrData file. In this way each thread is completely independent from each

68


Start

Serial port selection

Create temp files and instantiate one thread per serial port

Wait for USB data

New TX

discovered?

Save system

timestamp

for new TX

Save BigEar Audio Packet

Request

to stop?

Terminate threads and merge temp files

End

Second thread

yes

yes

no

no

Figure 5.10: BigEar SerialPort Capture flowchart

69


other, since there is no need to synchronize access to common resources.

The raw file contain information encoded in HEX format: each received packet is stored as a se-

quence of hexadecimal nibbles (Figure 5.9a) terminated by a line termination. In addition to audio

packets, the raw file stores also protocol messages that will be used in the Raw-to-CSV decoding stage.

The only protocol message generated during the capture stage is the communication of the system time

at which a Transmitter have been seen for the first time, like already explained few rows above. The

structure of this protocol message is described in Figure 5.9b.

Raw-To-CSV Conversion

BigEar Reconstructor MATLAB Script operates reading data from a CSV file, so it has to be built

starting from Raw data stored into .BgrData file. Each line of the raw file is used as parameter of the

constructor of the class Frame, so each Frame object will contains a set of samples that corresponds to

a BigEar Audio packet.

Each Frame object is added to a FrameList object, i.e. the container that represents the list of

all the audio packet captured and saved into the raw file. FrameList class offers in addiction some

methods for performing statistic analysis on the captured data.

As illustrated in previous paragraph, the application creates a correlation between the board time of

each BigEar Audio Capture board and the system capture time given by the application. The Raw-to-

CSV Conversion procedure transforms (by means of a timeshift) the temporal axis of each Audio packet

stream6 in order to give to the sources the same temporal domain. This re-alignment is carried out by

means of applySystemTime() method of each Frame object.

Given:

systemTimei

the system time instant at which an Audio packet is received for the first time from the ith trans-

mitter

initalTxDelayi

the board time instant of the first Audio packet received from the ith transmitter

delay(i,j)

the time instant of the jth Audio packet received from the ith transmitter, that corresponds to the

time instant of the first sample of the packet.

the re-aligned timestamp delay?(i,j) is given by:

delay?(i,j) = delay(i,j) − initalTxDelayi + systemTimei (5.14)

The complete Raw-to-CSV conversion procedure is illustrated in Figure 5.11.

As it can be seen in Class Diagram in Figure 5.12, Frame class is a complete representation of a

BigEar Audio packet in which each sample is stored in its 8-bit µ-Law form.6An Audio packet stream is the set of all the Audio packets belonging to the same BigEar Audio Capture board

70


Start

Read line

Protocol

message?

Istantiate

new Frame object

Update

systemTime[id]

ID already

known?

Update

initialTxDelay[id]

Re-align Frame with system time

Add Frame to FrameList

End

of file?

Save CSV file

End

yes

no

no

yes

yes

no

Figure 5.11: BigEar SerialPort Raw-to-CSV decoding

71


uLaw- expandedValue : int[ ]

+ GetExpandedValue : int

ComWorker- comPort : SerialPort

- DataCounter : uint

- known : bool[ ]

- port : string

- shouldStop : bool

- stream : FileStream

- sw : StopWatch

+ DoWork : void

+ GetDataCounter : uint

+ RequestStop : void

Frame- msFract : byte

- Source : byte

+ StartingMillisecond : uint

- samples : byte[ ]

+ ApplySystemTime : void

+ GetDelayOfSample : double

+ GetFrameCSV : string

+ GetSample : int

+ GetSamplesCSV : string

FrameList- avgFramesDistance : double[ ]

- finalDelay : uint[ ]

- framesTransmitted : uint[ ]

- hasBeenSeen : bool[ ]

- initialDelay : uint[ ]

+ Add : void

+ Clear : void

+ GetAvgFrameDistance : double

+ GetFramesTransmitted : uint

+ GetInitialDelayOfSource : uint

+ GetMaxDuration : float

+ HasBeenSeen : bool

+ Remove : bool

+ ToString : string

0..*

Figure 5.12: BigEar SerialPort Class Diagram

When the getSample(byte index) method is called to return the value of the indexth sample, the

method returns the expanded 16-bit value. The expansion is performed by means of a lookup table (see

uLaw class in Figure 5.12). Moreover, the Frame class exposes the getSamplesCSV() method that

returns the Comma-Separated-Value representation of the Frame.

BigEar SerialPort application GUI is reported in Appendix E.6.

Audio Reconstruction: BigEar Reconstructor MATLAB Script

The MATLAB Script devoted to the speech reconstruction prepares audio data for being processed by

means of the Reconstruction block already discussed in Chapter 4. Starting from CSV file generated

during the previous stage, the script builds audioPackets and positions matrices in which each

audioPackets(i,j) element represents the ith audio sample transmitted by the jth sensor. Position

in time of the given sample is specified in positions(i,j).

After data setup, the reconstruct function is called and the reconstructed signal is filtered by means

of a Third-order Type I Chebyschev filter (passband ripple) in order to suppress out-of-band noises and

artifacts. At the end of signal processing part, some statistics data are generated in order to analyze

performance data. The list of the metrics and statistical measures will be discussed in Section 6.2. The

Flowchart of BigEar Reconstructor MATLAB Script is illustrated in Figure 5.13.

72


Start

Generate audioPackets

and positions matrices

Reconstruction block

Generate filtered copy

of reconstructed speech

Generate metrics and statistic data

End

Figure 5.13: BigEar Reconstructor Flowchart

73

6BigEar Results Analysis

Introduction

In this chapter the results of the simulation of the system and the measures obtained by real prototype are

discussed. After the definition of the metrics adopted and after the description of the experimental setups,

the consistency between the theoretical model - implemented in the BigEar Simulator - and the real-

world implementation of the prototypes will be shown; then, reconstructed signal metrics are discussed

focusing on relevant properties such as the number of holes and the overlapping between contributors.

Finally, software metrics allow to evaluate robustness of the system in term of speed of processing and

of quality of the reconstruction.

6.1 Overview

Once the system has been implemented and the prototype realized, in order to examine the system

behavior some metrics are needed in order to perform a direct comparison between the data captured

by means of the BigEar prototype and the data obtained by means of the BigEar Simulator described

in Chapter 3; the defined metrics are related to those speech signal characteristics that are identified as

relevant or critic for the application.

One of the critical aspects is represented by the holeyness of the reconstructed speech signal. As

already seen, since reconstructed signal is paying by the superposition of audio packets sampled by

different sensors at random time instants, it can be affected by sequences of empty samples. These

holes influence the result of the speech recognition algorithm, so defined metrics help in finding best

parameters and best working conditions.

Parameters for the best speech signal in output are not evaluated only looking at the number of

holes in the reconstructed stream, but also giving attention to the degree of overlapping between the

audio streams received from the transmitters and to the quality of the reconstruction since superposition

algorithms affect the quality of the final result. Another important factor is the speed of processing:

reconstruction algorithms has to be fast in order to be used - in future - in real-time audio capture.

75

Chapter 6. BigEar Results Analysis

6.2 Metrics Definition

In order to compare the audio speech signal generated by the BigEar Simulator with audio stream cap-

tured and reconstructed by means of the prototypes described in Chapter 5, some metrics have been

defined. These measures characterize the results in terms of quality of the reconstructed speech signal

and in term of the amount of informative contribution given by each BigEar Audio Capture board. Other

metrics are used to measure software performances.

6.2.1 Reconstructed Signal Metrics

The metrics defined in this sections provide quality measures concerning the reconstructed speech signal.

As already mentioned in Section 4.5, the success of speech recognition is influenced by the number and

the size of holes in the reconstructed signal. Moreover, the BigEar Reconstruction algorithm convergence

is influenced by the amount of information that can be overlapped for the Cross-correlation alignment.

Fill ratio

Fill_ratio =No. of samples

Nwhere N = Length of the stream (in samples) (6.1)

Already defined in Eq. 4.1. Referring to the reconstructed signal, it represents the amount of

samples with respect to the total length of the stream. The more the value is close to 1, the more

the reconstructed signal is complete.

Numer of holes

NoH =No. of 0-ed sequences in totalCoverage

N(6.2)

Size of holes

SoH = Average size of 0-ed sequences in totalCoverage (6.3)

In conjunction with NoH , this metric characterize the distribution of empty samples (holes) into

the reconstructed signal. In case of constant Fill_ratio, SoH andNoH allow to compare whether

empty samples are gathered into few big blocks or diffused into many small blocks.

Support factor

Sf = E [totalCoverage0] where totalCoverage0 = ti ∈ totalCoverage : ti 6= 0(6.4)

Sf gives a measure of the contribution of each single transmitter to the construction of the final

speech signal. Sf ∈ (0, NTX ] where NTX is the number of transmitters. The higher Sf , the

higher the overlapping of the streams obtained by the different transmitters.

76

6.2. Metrics Definition

Support factor standard deviation

σSf =

√√√√ 1

N

N∑i=1

(ti − Sf)2 where ti ∈ totalCoverage0 (6.5)

σSf measures the dispersion of totalCoverage0. σSf close to 0 indicates that the number of

contributors per sample tends to be very close to Sf while a high σSf indicates that the number of

contributors is volatile.

6.2.2 Software Performance Metrics

Each BigEar Audio Capture board samples a signal that is the summation of the direct sound and of the

environment contributions given by the room reflections, so each audio stream is more or less different

from others. Where two or more streams are superposed, at the beginning and at the end of the over-

lapping areas artifacts are generated (see section 4.3.2). While it is impossible to estimate the scope of

the artifacts, it is possible to count the positions on which an artifact could be generated. The higher the

number of potential artifacts, the higher the probability to degrade the speech quality.

Potential Artifacts Ratio

Aws =

NTX∑k=1

edgeskN

(6.6)

where edgesk = 2 ·NoHk

and NoHk = no. of holes in the stream produced by kth sensor

Ahr =

NTX∑k=1

(edgesk − edgesh<kk

)N

(6.7)

where edgesh<kk = edges in the kth stream covered by samples of previous streams

Since number of potential artifacts is dependent on the chosen superposition policy, two different

calculation methods are needed. Aws is the metric used for Weighted Sum reconstruction; in this

case the method considers all the edges in each stream since all of the contributions are summed

up. On the other hand, reconstruction by Holes Replacement starts from a given stream (selected

by means of signal power analysis) and only samples that are needed to fill the holes are taken

from the other streams (See Section 4.3.2); in this case, therefore, Ahr takes into account, for each

stream, only the edges that are not covered by samples given by the previous steps.

Realtime Performance Ratio

RPR =∆Trec∆Telab

=(length of reconstructed signal) · 1

Fs

∆Telab(6.8)

77


The Realtime Performance Ratio measures the ability of the system to operate in real time, giving

a ratio between the duration of the reconstructed stream and the time needed to process data. This

measure is dependent on the number of transmitters since the higher the number of transmitters,

the higher the data flow. So the metric can be used as a global tradeoff parameter:

RPR > 1 states that the whole system is able to bufferize, send and process data faster than

sampling.

6.3 Simulation Setup

Simulations have been performed using BigEar Simulator described Chapter 3 and changing some pa-

rameters in order to study system behavior under different configurations. The fixed parameters are

given by design choices illustrated in Chapter 5, while the parameters that have been changed are the

ones related to:

• Number of BigEar Capture boards used for acquisition

• BigEar Audio Capture boards positions in the room

• Radio channel configuration of each BigEar Audio Capture board (how many transmitters com-

municating on the same radio channel, as seen in Section 2.4)

• Maximum delay between adjacent transmission of the same transmitter (See ALOHA Protocol

Family, Section 2.5.1)

Table 6.1 summarize parameters given to BigEar Simulator for performing its work. It can be noted

that selected values for TMAXDELAY parameter are the same values that can be used in the BigEar Audio

Capture Board Application, as illustrated in Table 5.2. For each simulation parameters set, in order

to give statistical relevance to data, a number of 50 experiments have been performed using a speech

signal of a duration of 16 seconds. Transmitters positions and channel configurations are summarized in

Appendix F.

From these simulations, Statistic data and Metrics have been calculated according to Section 6.2.1

and plotted onto charts that will be discussed in Section 6.5.

6.4 On-field Setup

Once the prototypes have been realized, they have been tested for checking the correct realization. Then,

in order to set the overall gain and to find the right bias point at the interface between the amplifier stage

and the P0_0 input of the Wixel board (Section 5.2.1), each transmitter has been calibrated using BigEar

Input Calibration application (Section 5.3.2).

On-field tests have been divided in two stages: Near-field and Far-field tests.

78

6.4. On-field Setup

Description Parameter Value

N-buffer: number of frames FRAMES 3

N-buffer: frame size (Eq. 5.11) FRAME_SIZE 20

Duration of radio transmission (Sec. 5.10) Tbusy 1 [ms]

Max Delay between transmissions (Sec. 2.5.1) TMAXDELAY 1, 3, 7, 15, 31, 63 [ms]

Sampling frequency (Sec. 5.3.1) Fs 6040 [Hz]

Room dimensions (x, y, z) (Sec. 3.2.2) ROOM_DIMENSIONS 5, 3, 2.8 [m]

Source position (x, y, z) (Sec. 3.2.2)

SOURCE_X

SOURCE_Y

SOURCE_Z

3.6 [m]

1.25 [m]

1.45 [m]

Number of transmitters 1 up to 4, in different positions

Radio channels (Sec. 2.4) 1 or 2 channels, in different configurations

Audio input 16 seconds speech signal

Number of experiments per parameter set 50

Table 6.1: BigEar Simulator parameters

Near-field Tests During Near-field tests, the consistence between the model and the real world has

been probed. In this setup, BigEar Audio Capture boards were placed side by side on a plane surface,

and the speaker has been asked to talk at a distance of about 0.6 m far from the microphones. Then data

have been captured using different configurations:

• Number of transmitters and channel configuration. Character sequences indicate the number of

channel and how many transmitters are transmitting on the same channel; e.g. AAB means three

transmitters: the first two transmitters on radio channel A, the third one on radio channel B:

– One transmitter: A

– Two transmitters: AA - AB

– Three transmitters: AAA - AAB

– Four transmitters: AAAA - AAAB - AABB

• Maximum delay between adjacent transmissions from the same transmitter (TmaxDelay): 1 - 3 -

7 - 15 - 31 - 63 ms

Far-field Tests During Far-field tests, the focus has shifted on the Reconstruction Algorithm. This is a

test stage close to real situation since BigEar Audio Capture boards have been fixed to poles 1.60 m high

from ground level and have been placed in a medium-size room. The talker has been asked to speak from

an asymmetric position, in order to examine the signal power differences between the different streams.

Then data have been captured using different configurations:

79


• Number of transmitters and channel configuration:

– One transmitter: A

– Two transmitters: AA - AB

– Three transmitters: AAB

– Four transmitters: AABB

• Maximum delay between adjacent transmissions from the same transmitter (TmaxDelay): 1 ms

6.5 Results Discussion

In this section results obtained from simulations and On-field tests are compared in order to verify the

consistence between BigEar Simulator and the behavior of BigEar Prototype running into real world;

then, metrics defined in Section 6.2.1 are observed in order to analyze how the number of transmitters

and the TmaxDelay parameter affect the reconstructed speech signal. Software performance metrics

(Section 6.2.2) will be discussed in terms of speed of processing and of quality of reconstruction.

6.5.1 Reconstructed Signal Comparison

In order to examine the similarity of the BigEar model with the realized prototype, Fill_ratio, NoH ,

SoH and Sf ± σSf metrics have been plotted varying TmaxDelay parameter - obtaining thus many

plots as there are transmitter configurations. In all of the plots black asterisks mark real values obtained

from the prototypes, while lines indicates the simulated ones.

Figure 6.1 illustrates that both varying TmaxDelay parameter and the number of transmitter, the

curves of prototype and simulation are asymptotic. Differences are notables where TmaxDelay ∈1, 3, 7 i.e. where the average distance between adjacent transmissions1 of the same transmitter are

comparable with the duration of a frame of samples (20 · 16040 = 3.31 ms).

This difference is given by the modular structure of the BigEar Simulator: the N-buffer Internal Model

(Sec. 3.3.2) does not communicate to its predecessor - the Radio Transmission Model (Sec. 3.3.1) -

information about the buffer status. In the real world, if the buffer is empty, no transmission happens,

instead the Radio Model makes no considerations on the buffer status, with the result that virtual trans-

mitters that have no data to transmit also contribute to the saturation of the audio channel and then to the

valid packet loss.

Looking at Fill_ratio it can be observed that in most cases the real Fill_ratio is slightly higher

than the simulated Fill_ratio. The motivation is that the model adopt Tbusy =1 ms as duration of

the transmission. In Equation 5.10 it has been already observed that the duration of a transmission is

0.9371 ms and not 1im.

In general, comparing Figure 6.1a with Figure 6.1b it can be observed that doubling the number of

transmitter and working on 2 channels instead of 1, a big increment in Fill_ratio and in Sf (support1Since with ALOHA Protocol the delay between transmission is chosen in U(0,TmaxDelay), the mean of this uniform

distribution is TmaxDelay/2.

80

6.5. Results Discussion

0 10 20 30 40 50 60 700

0.2

0.4

0.6

0.8

1

TmaxDelay [ms]

Nor

mal

ized

0:1

rat

io

Collisions ratioFill_ratio

0 10 20 30 40 50 60 700

20

40

60

80

100

TmaxDelay [ms]

No.

of s

ampl

es

SoH

0 10 20 30 40 50 60 700.5

1

1.5

2

2.5

3

3.5

4

TmaxDelay [ms]

No.

of t

rans

mitt

ers

S

f (± sigma

Sf)

0 10 20 30 40 50 60 700

500

1000

1500

2000

2500

TmaxDelay [ms]

Num

ber

NoH

(a) Test case: 2 transmitters on the same channel (AA)

0 10 20 30 40 50 60 700

0.2

0.4

0.6

0.8

1

TmaxDelay [ms]

Nor

mal

ized

0:1

rat

io

Collisions ratioFill_ratio

0 10 20 30 40 50 60 700

20

40

60

80

100

TmaxDelay [ms]

No.

of s

ampl

es

SoH

0 10 20 30 40 50 60 700.5

1

1.5

2

2.5

3

3.5

4

TmaxDelay [ms]

No.

of t

rans

mitt

ers

Sf (± sigma

Sf)

0 10 20 30 40 50 60 700

500

1000

1500

2000

2500

TmaxDelay [ms]

Num

ber

Noh

(b) Test case: 4 transmitters on the two channels (AABB)

Figure 6.1: Reconstructed signal metrics plotted as a function of TmaxDelay parameter

81


0 10 20 30 40 50 60 700

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

TmaxDelay [ms]

No

rma

lize

d 0

:1 r

atio

1 Tx

2 Tx (AA)

3 Tx (AAA)

4 Tx (AAAA)

(a) From 1 to 4 transmitters on the same radio channel

0 10 20 30 40 50 60 70

0.4

0.5

0.6

0.7

0.8

0.9

1

TmaxDelay [ms]

No

rma

lize

d 0

:1 r

atio

4 Tx (AAAA)

4 Tx (AAAB)

4 Tx (AABB)

(b) 4 transmitters with different radio channel configurations

Figure 6.2: Fill_ratio as a function of TmaxDelay

82


factor) are obtained, thus improving the quality of signal (in term of size of holes) and the support factor,

i.e. the quantity of overlapped samples between the streams.

Figure 6.2 illustrates different plots of Fill_ratio as a function of TmaxDelay. In the first figure

(Fig. 6.1a) the series are drawn incrementing the number of transmitters that share the same radio

channel while in the second figure (Fig. 6.1b) the number of transmitters is constant but the number

of transmitters per each radio channel varies. It can be observed that for small values of TmaxDelay,

the higher the number of transmitters on the same channel, the lower Fill_ratio. This is due to the

number of collisions, that increases as increase the channel gets more crowded. It can be noted also the

big difference, in Fig. 6.1b, between 4 transmitters on one channel and 2 transmitters on channel A and

2 transmitters on channel B. For big values of TmaxDelay the curves tend to be asymptotic since the

occupation of the channel(s) in time decreases, and thus also the number of collision.

A special case: TmaxDelay = 1 ms It is worth commenting the particular condition of TmaxDelay

= 1. In this case, the delay is chosen as a random delay within the uniform distribution U(0, 1) ⇒µ(∆) = 0.5. Since the delay can be expressed only as an integer value, ∆ ∈ 0, 1. With these

preconditions it can be stated that the system is not working with a pure ALOHA protocol: the application

simply send audio packets as soon as possible, i. e. when a buffer is ready to go, it is sent. Since the

duration of a single transmission is 0.9371 ms and, in an audio packet 20 · 1/Fs = 3.3112 ms are sent,

the time occupation of the radio channel per each transmitter is 28%. This means that for a small number

of transmitters, the system can work efficiently with a small number of collisions. However, this buffer-

and-send protocol is subject to the risk of synchronization: if two transmitters start simultaneously to

transmit, it is very likely that all of their packages will be systematically lost.

6.5.2 Software Performance Tests

Speed of Processing Metrics

For each test case discussed in previous sections, the time needed to BigEar Reconstructor MATLAB

Script for building the speech signal has been measured in order to estimate Realtime Performance Ratio

metric (Section 6.2.2). Results are plotted in Figure 6.3, divided in different TmaxDelay sets; within

each set the number of transmitters is in increasing order starting from 1 transmitter and arriving to 4

transmitters, following this pattern:

83


A 1 transmitter, 1 channel

AA 2 transmitters, 1 channel

AB 2 transmitters, 2 channels

AAA 3 transmitters, 1 channel

AAB 3 transmitters, 2 channels

AAAA 4 transmitters, 1 channel

AAAB 4 transmitters, 2 channel

AABB 4 transmitters, 2 channel

Tests have been performed using:

• Desktop PC - Intel Core2 Quad CPU Q9400 @ 2.66 GHz, 2GB Ram. Only O.S. services and

MATLAB running.

• Same machine under full load conditions (15 working applications).

• Laptop - Intel Core i7 CPU L640 @ 2.13 GHz, 8GB Ram. Only O.S. services and MATLAB

running.

• Same machine under full load conditions (15 working applications) and in power saving mode.

All of the tests denoted high values for RPR, i.e. the processing speed of BigEar Reconstructor

MATLAB Script is faster than speed of sampling. In Figure 6.3, the red horizontal dashed/dotted line

close to the abscissa axes indicates the unity value. If RPR was less than 1, BigEar Reconstructor could

not be used for realtime capturing. In this case, RPR is much greater than 1.

RPR plot points out that speed of processing is dependent on the number of packets received: the

higher TmaxDelay, the lower is the number of packets received in the time unit, the faster the algorithm;

It can be noted that there is a big gap between the first measure of each sample (RPR measured with one

transmitter) and the second one (RPR measured with two transmitters), than measures tend to become

steps in decreasing order with the increase of the number of transmitters. The big gap mentioned above

is due to the fact that in presence of only one transmitter, the BigEar Reconstructor MATLAB Script

has only to arrange audio data onto its timeline and perform energy compensation steps (Section 4.2), so

there is no need to perform internal cycles or to perform Cross-correlation function.

Clock Tests During the implementation of the hardware part of the BigEar prototypes, the timer sta-

bility have been probed since one of the crucial points of the application is that every BigEar Audio

Capture board samples the analog signal at the right sampling frequency fs. Moreover, it is important to

observe that the N-Buffer mechanism works perfectly in order to avoid corrupted data that could generate

unattended behaviors in following stages.

In order to check timer stability, an output pin of the Wixel Programmable Module has been pro-

grammed in order to switch its state every time a sampling occurred, then it has been connected to a

84


1 3 7 15 31 630

100

200

300

400

500

600

700

800

TmaxDelay sets [ms]

RPR − Realtime Performance Ratio

Minimum = 23.4167

Tre

c /

Tela

b r

atio

Figure 6.3: Realtime Performance Ratio plotted for different test cases, divided by TmaxDelay sets

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

Buer change

Radio TX

Sampler

BigEar Timing analysis

Time (s)

Figure 6.4: BigEar Capture Board Application timing analysis

Logic Analyzer. The application has run on different Wixels for up to 3 hours, then the data have been

exported and analyzed, with the result that during the inspection period no significant variations has been

observed.

Then N-Buffer robustness has been tested using not 1 but 3 output pins of WIxel Programmable

Module: the first pin was connected to sample - like described above; the second pin was used to inspect

duration of the Radio transmission (when the radio module is in TX mode, the pin is LOW [12, pp. 225

– 226]) and the third pin was programmed to change its state whenever a frame of the N-Buffer system

is full and the internal counter jumps on the next free frame. Figure 6.4 illustrates the right behavior of

BigEar Capture Board Application: every time the main loop flushes a frame, the irs_index frame

counter jumps to the next free frame.

85


1 3 7 15 31 630

0.02

0.04

0.06

0.08

0.1

0.12

0.14

TmaxDelay sets [ms]

PAR − Potential Artifacts Ratio

No

rma

lize

d r

atio

of

po

ten

tia

l a

rtifa

cts

Aws

− Weighted Sum

Ahr

− Holes Replacement

Superposition Policy

Figure 6.5: Potential Artifacts Ratio plotted for different test cases, divided by TmaxDelay sets

Reconstruction Quality Metric

During Far field tests, the speech signal was reconstructed using both Weighted Sum method (Sec. 4.3.1)

and Holes Replacement method. Listening tests have denoted big differences in reconstructed speech

signal depending on the superposition policy adopted. As explained in section 4.3.2, the higher the

distances between BigEar Audio Capture boards, the higher the differences in the audio signals due to

different environment reflections and diffusions. These differences cause discontinuity artifacts in the

reconstructed signal at the positions where different contributions are superposed in the attempt to fill

the holes in the reconstructed signal (described in Section 4.1).

In order to examine how superposition methods could affect the presence of artifacts, Potential Artifact

Ratio metric counts the number of positions where artifacts could be generated and normalizes it with

respect to the length of the signal, obtaining thus a comparable metric.

Figure 6.5 shows that for each TmaxDelay set, Weighted Sum method (whose Artifact Ratio is

denoted with Aws) is more prone to artifacts creation than Holes Replacement method. Moreover, as

expected, Potential Artifacts Ratio grows with the number of transmitters that compose the system,

in particular when multiple transmitters operate on multiple channel: since there is high overlapping

between audio packets, Weighted Sum has more data to superpose.

The approach of the Holes Replacement policy (Section 4.3.2) is different: it adopt as reference the

more powerful signal, then it uses other streams for holes replacement; in this way, the Potential Artifacts

Ratio metric gives better results, keeping low the number of point in which an artifact can be generated.

86


6.5.3 Coarse-grain localization

As mentioned in Chapter 4, power of signals and delays applied for streams alignment are related to

the distance between the source and the distance between sensors, so if the position of BigEar Audio

Capture Boards is known, those information can be exploited for coarse-grain localization of the source.

Figure 6.6 shows the relation between signal power and delay used for streams alignment. In order to

give significance to the graphical representation of data, values have been normalized between 0 and

1; furthermore, since delay grows with distance while power decreases, a flipped version of the delay

(called advance) has been plotted:

advance[i] = 1− delay[i]

max(delay)

It can be noted that alignment delays and power of signals provide coherent information that can be used

for source localization.

87


S2 S3 S40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Transmitter

No

rma

lize

d r

atio

Relative localization data for 3 sensors acquisition

Normalized advance (1 − normalized delay)

Normalized Signal Power

(a) Test case: 3 transmitters

S2 S3 S4 S50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Transmitter

No

rma

lize

d r

atio

Relative localization data for 4 sensors acquisition

Normalized advance (1 − normalized delay)

Normalized Signal Power

(b) Test case: 4 transmitters

Figure 6.6: Signal power and advance plot. Since delay grow with distance and power - instead -

decreases, Advances has been chosen instead of delays in order to make time-of-arrival measure

coherent with signal power measure

88

7Conclusions and Future Work

7.1 Conclusions

In this work of thesis, starting from requirements of Minimum costs, Wireless, Distribution and Modu-

larity, the prototype of a voice capture application has been introduced.

BigEar application has been implemented based on a distributed Wireless Sensor Network that per-

form a space-time audio sampling of an environment. Core of the application is Base Station, that is the

device devoted to data collection. Wireless Sensor Network can be multi-layered by the use of multiple

radio channels; for each radio channel a BigEar Receiver acts as interface between the Base Station and

the BigEar Audio Capture Boards belonging to the same layer.

In order to analyze and evaluate architecture features, a simulator has been implemented. The BigEar

Simulator adopts an interaction model composed of different sequential blocks: Audio Model block per-

forms a physical characterization of the speech signal captured by each sensor; Sensor Network Model

simulates the behavior of the architecture by means of a Network Interaction Model and an N-Buffer

Internal Model, which account respectively for handling the time instants each sensor transmits audio

Packet, and for simulating the behavior of the internal buffering system of each transmitter.

Once audio packets have been received by BigEar Base Station, they need to be superposed in order

to build a unique speech signal from the contributions of each sensor. Superposition is performed by the

Reconstruction Block that exploits Cross-correlation function for finding the right time alignment of the

audio data. Before superposing, audio packets need to get unbiased and normalized for compensating

intensity differences due to the different sensor-source distances and for correcting wrong sensor calibra-

tions. Then, two different methods have been tested for the superposition of samples: Weighted Sum of

samples is robust in term of local issues handling (like saturation of the input stage) at the expense of a

higher number of discontinuity points (artifacts); conversely, Holes Replacement method is less prone to

artifacts generation, but additional methods are required for handling local issues. Reconstructed speech

signal is subject to the presence of sequences of empty samples (holes) that could influence Speech

Recognition stage. Several methods have been tested in order to fill these sequences.

The system has been implemented for examining the real-world behavior of the previously modeled

89

Chapter 7. Conclusions and Future Work

architecture. BigEar Audio Capture board has been implemented using a cheap microphone and a signal

conditioning circuit composed of a double stage operational amplifier having a fixed band-pass filter

and a variable overall gain. The application running on each BigEar Capture module allows to perform

ADC readings at full 12-bit resolution and is able to compress each 12-bit sample into an 8-bit µLaw

compressed sample, optimizing data rate and incrementing Quantization SNR. The BigEar Receiver

application acts as a wireless-USB dongle between BigEar Audio Capture modules and BigEar Base

Station. Finally, BigEar Base Station has been implemented as a modular application in order to focus

on debug and testing purposes.

7.2 Strengths and Weaknesses

Result analysis has exposed strengths and weaknesses of the BigEar system. Strengths can be related

with Minimum costs, Wireless, Distribution and Modularity requirements discussed in Section .

7.2.1 Strengths

Minimum cost

Low costs can be seen from different points of view:

• Architecture is based on Wixel Prototyping boards, whose cost is around 20 $ each one; costs

for speech acquisition circuit is under 10 $ per board1;

• BigEar Simulator can be used to perform an a-priori analysis in order to identify best param-

eters (no. of sensors, position of sensors, no. of channels, software-configurable parameters)

for a specific use case, minimizing production and installation costs.

Wireless

The use of wireless transmission between nodes gives to the system flexibility and fast reconfig-

urability, while it reduces efforts for house adaptation eliminating need for signal or power cables.

Distributed

The distributed approach overcomes issues related to sensors temporary failures and, in particular

with local signal problems like saturation.

Data Intelligence

Ubiquitous approach allows Data Intelligence mechanisms in integration with BRIDGe architec-

ture, e.g. performing a coarse-grain localization in order to add informative content that could dis-

ambiguate context-free vocal commands (“Turn off the light” or - better - “Turn off this light” could

be integrated with localization information in order to determine which light has to be switched

off).

1Quotations: second quarter of 2015

90

7.3. Future Work

Modular

Whole architecture is scalable and it can be easily reconfigured by adding or removing sensors

from the sensor network.

Responsive

Software Performance Metrics shows that BigEar Reconstructor algorithm - even though in its

experimental form - can perform real-time speech reconstruction, so responsivity of the system is

related to speed of Automatic Speech Recognition algorithms.

7.2.2 Weaknesses

Hardware-related weaknesses are represented by the presence of electrical noise generated by the Wixel

antenna and through the Wixel VIN pin - despite the presence of decoupling capacitors, and the lack

of mechanisms for preventing long-term sensor clock drift that can cause time warping and long-term

misalignments between audio streams.

A Model related weakness can be identified in the lack of a backward communication mechanism

between the Radio Transmission Model and the N-Buffer Model used in BigEar Simulator; this is the

cause of big differences between the simulated data and real world measures when 0 < TmaxDelay <

N · 1/Fs.From the Reconstruction point of view, an aspect to take into account is the generation of superpo-

sition artifacts when dealing with signals having big differences in spectral content (e.g. signals coming

from sensors away from each other); moreover, Audio Inpainting method for healing reconstructed signal

(by heuristic filling of sequence of empty samples) need to be optimized for a real-time use.

7.3 Future Work

Buffered reconstruction

BigEar Reconstructor is a monolithic algorithm implemented for experimental purposes; Actually,

the algorithm operates off-line on a CSV file that contains audio data preventively captured. It

needs to be implemented in order to perform on-line reconstruction.

Reconstructed Signal Processing

In order to neutralize effects of superposition artifacts, Filtering or Far-field Speech Processing

methods can be integrated into BigEar Reconstructor algorithm; moreover, periodical training

stages can be adopted for identifying physical and spectral characteristics of the ambient noise.

Network Interaction Models and Protocols

The Network Interaction Model could be extended to other network protocol than pure ALOHA

family in order to explore how Reconstructed Signal Metrics are influenced by different Network

Interactions. In particular, different Network Protocols might help in reducing superposition arti-

facts; furthermore, Network Protocol could include synchronization mechanisms to prevent sensor

clock drift.

91

Chapter 7. Conclusions and Future Work

I2S Interface Library

Texas Instrument CC2511F SoC provides an industry standard I2S interface. The I2S interface

can be used to transfer digital audio samples between the CC2510Fx/CC2511Fx and an external

audio device, so it would be possible to replace the signal conditioning analog stage with a Single-

chip digital microphone. Wixel SDK, at the time of writing this work, does not expose any library

in this direction.

BRIDGe Integration

Once BigEar System will be implemented for real-time operations, it is needed to implement the

interface that allows BigEar to feed Automatic Speech Recognition applications.

92

ATI CC2511 MCU Key features

• Radio

– High-performance RF transceiver based on the market-leading CC2500

– Excellent receiver selectivity and blocking performance

– High sensitivity (-103 dBm at 2.4 kBaud)

– Programmable data rate up to 500 kBaud

– Programmable output power up to 1 dBm for all supported frequencies

– Frequency range: 2400 - 2483.5 MHz

– Digital RSSI / LQI support

– Current Consumption

– Low current consumption (RX: 17.1 mA @ 2.4 kBaud, TX: 16 mA @ -6 dBm output power)

– 0.3µA in PM3 (the operating mode with the lowest power consumption)

• MCU, Memory, and Peripherals

– High performance and low power 8051 microcontroller core.

– 8/16/32 kB in-system programmable flash, and 1/2/4 kB RAM

– Full-Speed USB Controller with 1 kB USB FIFO

– I2S interface

– 7 - 12 bit ADC with up to eight inputs

– 128-bit AES security coprocessor

– Powerful DMA functionality

– Two USARTs

– 16-bit timer with DSM mode

– Three 8-bit timers

– Hardware debug support

– 19 GPIO pins

• General

– Wide supply voltage range (2.0V - 3.6V)

– Green package: RoHS compliant and no antimony or bromine, 6x6mm QFN 36

93

Appendix A. TI CC2511 MCU Key features

A.1 CC2511F32 Logic Scheme

94

A.2. CC2511F32 Radio Packet format

A.2 CC2511F32 Radio Packet format

CC2510Fx / CC2511Fx

SWRS055G Page 191 of 236

XOR-ed with a 9-bit pseudo-random (PN9) sequence before being transmitted as shown in Figure 50. At the receiver end, the data are XOR-ed with the same pseudo-random sequence. This way, the whitening is reversed,

and the original data appear in the receiver. The PN9 sequence is reset to all 1’s.

Data whitening can only be used when PKTCTRL0.CC2400_EN=0 (default).

TX_OUT[7:0]

TX_DATA

8 7 6 5 4 3 2 1 0

7 6 5 4 3 2 1 0

The first TX_DATA byte is shifted in before doing the XOR-operation providing the first TX_OUT[7:0] byte. The second TX_DATA byte is then shifted in before doing the XOR-operation providing the second TX_OUT[7:0] byte.

Figure 50: Data Whitening in TX Mode

13.8.2 Packet Format

The format of the data packet can be configured and consists of the following items:

• Preamble

• Synchronization word

• Length byte or constant programmable packet length

• Optional Address byte

• Payload

• Optional 2 byte CRC

Preamble bits(1010...1010)

Syn

c w

ord

Leng

th fi

eld

Add

ress

fiel

d

Data field

CR

C-1

6

Optional CRC-16 calculationOptionally FEC encoded/decoded

8 x n bits 16/32 bits 8bits

8bits 8 x n bits 16 bits

Optional data whiteningLegend:

Inserted automatically in TX,processed and removed in RX.

Optional user-provided fields processed in TX,processed but not removed in RX.

Unprocessed user data (apart from FECand/or whitening)

Figure 51: Packet Format

CC2511F8 - Not Recommended for New Designs

95

BBigEar simulator MATLAB implementation

B.1 Radio Transmission Model

1 function [ call_time, call_valid_flag, call_differentials ] = generatecalls( ...n_samples, Fs, TRANSMITTERS, Tbusy, TmaxDelay )

2

3 %% generate a number of calls that allow certainly to cover the duration of ...sampled set

4 duration = 1000 * n_samples/Fs; % in ms because all of other time costants ...are in ms

5 N_CALLS = ceil(duration/Tbusy);6

7 %% generate random time instants for calls8 call_differentials = rand(N_CALLS,TRANSMITTERS)*TmaxDelay+Tbusy;9 call_time = cumsum(call_differentials,1);

10

11 %% truncate calls that exceed the duration of sampled data.12 % in order to avoid exceeding limits of audio data, I truncate the call_time13 % vector at the position where the first transmitter reach the duration14 positions = zeros(1, TRANSMITTERS);15 for i=1:TRANSMITTERS16 positions(i) = find(call_time(:,i)<duration, 1, 'last' );17 end18 call_time = call_time(1:min(positions), 1:end);19 call_differentials = call_differentials(1:min(positions), 1:end);20

21 %% generate matrix that, for each radio transmission, call_time(x,y)22 % indicates the colliding packets. This will be used in the buffer model in23 % order to keep internal buffers updated.24

25 call_valid_flag = true(size(call_time,1), TRANSMITTERS);26 temp1 = false(TRANSMITTERS*size(call_time,1),1);27 temp2 = temp1;28

29 for i=1:size(call_time,1)30 for j=1:TRANSMITTERS31 if call_valid_flag(i,j)32 temp1 = false(TRANSMITTERS*size(call_time,1),1);33 temp2 = temp1;34

35 % conditions of collision36 lBound = call_time(i, j) - Tbusy; % - Tguard37 uBound = call_time(i, j) + Tbusy; % + Tguard38

39 % flags timings that generate colliding situations

97

Appendix B. BigEar simulator MATLAB implementation

40 temp1(call_time≥lBound) = 1;41 temp2(call_time≤uBound) = 1;42 temp1 = reshape(temp1, size(call_time,1), TRANSMITTERS);43 temp2 = reshape(temp2, size(call_time,1), TRANSMITTERS);44

45 temp = temp1 & temp2;46

47 % if sum(temp(:)==1)>1 then the element is colliding. This48 % because "find" doesn't exclude the testing element itself; so49 % there will be at least one "1" in the matrix (the element is50 % wrongly denoted as colliding with itself)51 if sum(temp(:)==1)≤152 temp(i,j) = 0;53 end54 % update valid packets matrix55 call_valid_flag = call_valid_flag & ¬temp;56 end57 end58 end59 end

B.2 N-Buffer Internal Model

1 function [ output, positions ] = generatetransmitteroutputs(call_time, ...call_valid_flag, samples, Fs, FRAME_SIZE, FRAMES )

2 %GENERATETRANSMITTEROUTPUTS generates output packets for transmitters3 % Simulate buffer models4

5 TRANSMITTERS = size(call_time, 2);6 %% Initialize buffer and variables7 % each element represents the number of the frame used by the x-th wixel to ...

store samples8 irs_idx = ones(TRANSMITTERS, 1);9 % each element represents the number of the frame used by the x-th wixel to ...

push samples for transmission10 main_idx = ones(TRANSMITTERS, 1);11 % buffer(x,y): index of the first sample pointed in y-th frame by x-th transmitter12 buffer = zeros(TRANSMITTERS, FRAMES);13 buffer(:,1) = ones(TRANSMITTERS,1);14 % TODO: Cambiare nome15 last = ones(TRANSMITTERS, 1);16

17 %output matrices18 output = zeros(size(samples,1) , TRANSMITTERS);19 positions = zeros(size(samples,1) , TRANSMITTERS);20 %support indices21 out_idx = ones(TRANSMITTERS, 1);22

23 % FRAME_SIZE duration (in milliseconds)24 Tframe = number2ms(FRAME_SIZE, Fs);25

26 %% Scan array of calls27 for transmitter = 1:TRANSMITTERS28 for iCall = 1:size(call_time,1)29 call = call_time(iCall, transmitter);30 lastTime = number2ms(last(transmitter), Fs);31

32 Dt = call - lastTime;33

34 % generate output values (output samples + one timestamp for each35 % sample)36 if (irs_idx(transmitter)6= main_idx(transmitter)) || (Dt > Tframe)37 start_idx = buffer(transmitter, main_idx(transmitter));38 end_idx = start_idx + FRAME_SIZE - 1;39 out_start = out_idx(transmitter);40 out_end = out_idx(transmitter) + FRAME_SIZE - 1;

98

B.3. Signal Reconstruction block

41

42 % output only if the packet is valid. Otherwise, samples are43 % sent to the hell.44 if call_valid_flag(iCall, transmitter)45 output(out_start:out_end, transmitter) = ...

samples(start_idx:end_idx , transmitter);46 out_idx(transmitter) = out_idx(transmitter) + FRAME_SIZE;47 positions(out_start:out_end, transmitter) = (start_idx:end_idx)';48 end49 % regardless of the validity of the packet, the buffer has been50 % flushed out51 main_idx(transmitter) = mod(main_idx(transmitter), FRAMES)+1;52 end53

54 framedistance = fix(Dt/Tframe);55

56 if framedistance≥157 % update 'last' index58 last(transmitter) = buffer(transmitter, irs_idx(transmitter)) + ...

FRAME_SIZE;59 % update IRS index and prepare new buffer60 temp = buffer(transmitter, irs_idx(transmitter));61 irs_idx(transmitter) = mod(irs_idx(transmitter), FRAMES)+1;62 buffer(transmitter, irs_idx(transmitter)) = temp + FRAME_SIZE;63 end64

65 for i = 2 : min(framedistance-1, FRAMES-1)66 % move buffers trying to cover distance67 temp = buffer(transmitter, irs_idx(transmitter));68 if (mod(irs_idx(transmitter), FRAMES)+1) 6= main_idx(transmitter)69 % I have a free buffer, so I use it to store data70 irs_idx(transmitter) = mod(irs_idx(transmitter), FRAMES)+1;71 end % otherwise I use the same frame, moving it (I lost ...

intermediate data)72 buffer(transmitter, irs_idx(transmitter)) = temp + FRAME_SIZE;73 end74

75 % When last call is previous than FRAMES*FRAMES_SIZE I have to move the76 % last buffer in the right position (last element of the frame77 % correspond to the last sampling)78 if Dt/Tframe > FRAMES79 if (mod(irs_idx(transmitter), FRAMES)+1) 6= main_idx(transmitter)80 % I have a free buffer, so I use it to store data81 irs_idx(transmitter) = mod(irs_idx(transmitter), FRAMES)+1;82 end % otherwise I use the same frame, moving it (I lost ...

intermediate data)83 buffer(transmitter, irs_idx(transmitter)) = ms2number(call, Fs)- ...

FRAME_SIZE + 1;84 end85 end86 end87 end

B.3 Signal Reconstruction block

1 function [ reconstructed, totalCoverage ] = reconstructsignal( audioPackets, ...positions, Model, Room, bypass )

2 %RECONSTRUCTSIGNAL Reconstruct signal3 % audioPackets and positions are two M-by-N matrix. each column of4 % audioPackets contains audio samples transmitted by the Nth sensor; Each5 % Mth element in positions represent the time location of the corresponding6 % Mth audio sample in audioPackets7

8 % Inputs:9 % - audioPackets = audio frames transmitted by sensors

10 % - positions = positions of the audio samples

99

Appendix B. BigEar simulator MATLAB implementation

11 % - bypass = flag that allows to bypass the reconstruction of12 % the signal by means of crosscorrelation analysis.13 bypassCorrelation = 0;14 if nargin==515 bypassCorrelation = bypass;16 end17

18 %%map samples coverage of each transmitter output19 coverage = zeros(size(positions), 'single');20 for i=1: size(audioPackets,2)21 coverage(positions(positions(:,i)>0,i),i) = 1;22 end23

24 %%unbias signals25 bias = mean(audioPackets, 2);26 if size(audioPackets, 2) == 127 bias = mean(audioPackets);28 end29 audioPackets = bsxfun(@minus, audioPackets, bias);30

31 %% normalize energy32 maxAmplitude = max(abs(audioPackets),[],1);33 audioPackets = bsxfun(@rdivide, audioPackets, maxAmplitude);34

35 %% generate partials36 partials = zeros(size(audioPackets));37

38 for i = 1 : size(audioPackets,2)39 partials(positions(positions(:,i)>0,i),i) = audioPackets(1: ...

size(positions(positions(:,i)>0)),i);40 %try to fill empty positions with mean of partials, since signal could be not ...

zero-mean41 %avgSignal = mean(audioPackets(1: size(positions(positions(:,i)>0)),i));42 %partials(coverage(:,i)==0,i) = avgSignal;43 end44

45 if bypassCorrelation46 % simple sum of signals:47 reconstructed = sum(partials, 2);48 totalCoverage = sum(coverage, 2);49 %scale the sum by the number of partials50 divider = totalCoverage;51 divider(divider==0) = 1;52 reconstructed = reconstructed./divider;53

54 else55 % use correlation to find delays between audio streams56

57 %% estabilish order for xcorr analysis58 % order for testing xcorrelation is given by number of samples in a stream.59 % the bigger is the number of samples in the stream, the more xcorr60 % will work fine.61 numberOfSamples = sum(coverage, 1);62 [¬, order]=sort(numberOfSamples, 'descend');63

64 % limits for xcorr = maximum delay given from room dimensions + 2 seconds of ...margin

65 timeMargin = 2;66 Ds = norm(Room.Dim);67 Dt = timeMargin + Ds / SoundSpeed(Room.Temp);68 maxLag = ceil(Dt * Model.Fs);69

70 %% analyze couples of streams71 reconstructed = partials(:, order(1));72 partialCoverage = coverage(:, order(1));73 for i = 2 : size(numberOfSamples,2)74 temp2 = partials(:, order(i));75 [acor,lag] = xcorr(reconstructed, temp2, maxLag);

100

B.3. Signal Reconstruction block

76 [¬,I] = max(abs(acor));77 delay = lag(I);78

79 if delay < 080 %temp2 is delayed --> anticipate it81 display(['Sensor ' num2str(order(i)) ' is delayed by ' ...

num2str(abs(delay)) ' samples --> ANTICIPATE IT'])82 temp2 = [temp2(abs(delay)+1:end); zeros(abs(delay),1)];83 partials(:, order(i)) = temp2;84 coverage(:, order(i)) = [coverage(abs(delay)+1:end, order(i)); ...

zeros(abs(delay),1)];85 elseif delay > 086 %temp2 is anticipated --> delay it87 display(['Sensor ' num2str(order(i)) ' is anticipated by ' ...

num2str(delay) ' samples --> DELAY IT'])88

89 temp2 = [zeros(delay,1); temp2(1:end-delay) ];90 partials(:, order(i)) = temp2;91 coverage(:, order(i)) = [zeros(delay,1); coverage(1:end-delay, order(i))];92 end93 %sum signals94 reconstructed = reconstructed + temp2;95 % scale signals where the samples are summed96 partialCoverage = partialCoverage + coverage(:, order(i));97 divider = partialCoverage;98 divider(divider==0) = 1;99 reconstructed = reconstructed./divider;

100 %once samples are scaled, restore original totalCoverage101 partialCoverage(partialCoverage > 0) = 1;102 end103 %store in totalCoverage the distribution of sensor data coverage104 totalCoverage = sum(coverage, 2);105 end106

107 %display('Generated reconstructed mono signal.');108 end

101

CBigEar Reconstruction MATLAB Script

1 function [ reconstructed, totalCoverage, stats ] = ...2 reconstructsignal( audioPackets, positions, Fs, bypass, ...3 aPolicy, sPolicy )4 %RECONSTRUCTSIGNAL Reconstruct signal5 % audioPackets and positions are two M-by-N matrix. each column of6 % audioPackets contains audio samples transmitted by the Nth sensor; Each7 % Mth element in positions represent the time location of the corresponding8 % Mth audio sample in audioPackets9

10 % Inputs:11 % - audioPackets = audio frames transmitted by sensors12 % - positions = positions of the audio samples13 % - Fs = sampling frequency of signal14 % - aPolicy = Delays Analysis policy15 % 1 for Samples Correlation, 2 for Envelopes Corr.16 % - sPolicy = Delays Analysis policy17 % 1 for Weighted Sum, 2 for Holes Replacement18 % - bypass = flag that allows to bypass Delays Analysis19

20 coverage = false(size(positions));21 sigPwr = zeros(1, size(positions,2));22 for i=1: size(audioPackets,2)23 %%map samples coverage of each transmitter output24 coverage(positions(positions(:,i)>0,i),i) = 1;25 %%detrend signals and store signal powers26 audioPackets(positions(:,i)>0,i) = detrend(audioPackets(positions(:,i)>0,i));27 %% store signal powers for finding best reference signal. Take only28 %% in-band power, and limit analysis to non zeropadded streams29 sigPwr(i) = bandpower(audioPackets(positions(:,i)>0,i), Fs, [300 3000]);30 % normalize power measures (because they are dependent on the length of31 % each stream32 sigPwr(i) = sigPwr(i) / length(audioPackets(positions(:,i)>0,i));33 end34

35 %% normalize energy36 maxAmplitude = max(abs(audioPackets),[],1);37 audioPackets = bsxfun(@rdivide, audioPackets, maxAmplitude);38

39 %% generate partials40 partials = zeros(size(audioPackets));41 for i = 1 : size(audioPackets,2)42 partials(positions(positions(:,i)>0,i),i) = audioPackets(1: ...

size(positions(positions(:,i)>0)),i);43 end44

45 delays = zeros(1, size(positions,2));

103

Appendix C. BigEar Reconstruction MATLAB Script

46

47 if bypass48 % simple sum of signals:49 reconstructed = sum(partials, 2);50 totalCoverage = sum(coverage, 2);51 %scale the sum by the number of partials52 divider = totalCoverage;53 divider(divider==0) = 1;54 reconstructed = reconstructed./divider;55 else56 display('Performing correlation analysis between signals')57 % use correlation to find delays between audio streams58

59 %% estabilish order for xcorr analysis60 % order for testing xcorrelation is given by number of samples in a stream.61 [¬, order]= sort(sigPwr, 'descend');62

63 [a,b]=butter(2,0.004);64

65 %% analyze couples of streams66 display(['Use as reference Sensor ' num2str(order(1))])67 reconstructed = partials(:, order(1));68 reference = reconstructed;69 partialCoverage = coverage(:, order(1));70 for i = 2 : size(sigPwr,2)71 temp2 = partials(:, order(i));72 %% Delays Analysis method73 if aPolicy == 174 % Samples Crosscorrelation75 [acor,lag] = xcorr(reference, temp2);76 elseif aPolicy == 277 % Envelopes Crosscorrelation78 % GENERATE ENVELOPES79 rec_up = reconstructed;80 rec_up(rec_up<0) = 0;81 rec_up = filter(a, b, rec_up);82 temp2_up = temp2;83 temp2_up(temp2_up<0) = 0;84 temp2_up = filter(a, b, temp2_up);85

86 % PERFORM CORRELATION BETWEEN ENVELOPES87 [acor,lag] = xcorr(rec_up, temp2_up);88 else89 error('Error. \naPolicy must be 1 or 2.')90 end91 %evaluate crosscorrelation for findinf optimal delay92 [¬,I] = max(abs(acor));93 delay = lag(I);94 delays(order(i)) = delay;95 %% shift streams96 if delay < 097 %temp2 is delayed --> anticipate it98 display(['* Sensor ' num2str(order(i)) ' is delayed by ' ...

num2str(abs(delay)) ' samples --> ANTICIPATE IT'])99 temp2 = [temp2(abs(delay)+1:end); zeros(abs(delay),1)];

100 partials(:, order(i)) = temp2;101 coverage(:, order(i)) = [coverage(abs(delay)+1:end, order(i)); ...

zeros(abs(delay),1)];102 elseif delay > 0103 %temp2 is anticipated --> delay it104 display(['* Sensor ' num2str(order(i)) ' is anticipated by ' ...

num2str(delay) ' samples --> DELAY IT'])105

106 temp2 = [zeros(delay,1); temp2(1:end-delay) ];107 partials(:, order(i)) = temp2;108 coverage(:, order(i)) = [zeros(delay,1); coverage(1:end-delay, order(i))];109 end110 %% Superposition policy

104

111 if sPolicy == 1112 % Weighted Sum113 reconstructed = reconstructed + temp2;114 % scale signals where the samples are summed115 partialCoverage = partialCoverage + coverage(:, order(i));116 divider = partialCoverage;117 divider(divider==0) = 1;118 reconstructed = reconstructed./divider;119 %once samples are scaled, restore original totalCoverage120 partialCoverage(partialCoverage > 0) = 1;121 elseif sPolicy == 2122 % Holes Replacement123 %determine fill mask124 fillMask = (partialCoverage == 0) & coverage(:, order(i));125 %replace holes with signal126 reconstructed = reconstructed + fillMask .* temp2;127 %update coverage map128 partialCoverage = (partialCoverage==1) + fillMask;129 %diagnostic display130 display(['Sensor ' num2str(order(i)) ' has filled ' ...

num2str(sum(fillMask)) ' samples']);131 else132 error('Error. \nsPolicy must be 1 or 2.')133 end134 end135 %store in totalCoverage the distribution of sensor data coverage136 totalCoverage = sum(coverage, 2);137 end138 %% Calculate Potential Artifacts Ratio139 if sPolicy == 1140 % Weighted Sum PAR141 edges = false(size(coverage));142 if size(coverage, 2)>1143 for i=1 : size(coverage,2)144 edges(:,i) = [0; diff(coverage(:,i))6=0];145 end146 end147 PotentialArtifacts = sum(sum(edges,2)6=0);148 elseif sPolicy == 2149 % Holes Replacement PAR150 PotentialArtifacts = 0;151 a = coverage(:,1)>0;152 for i=1 : size(coverage, 2) - 1153 d1 = [false; diff(a)6=0];154 b = coverage(: , i+1).*¬coverage(:, i);155 d2 = [false; diff(b)6=0];156 PotentialArtifacts = PotentialArtifacts + sum(d1.*d2);157 a = (coverage(:, 1) + b)>0;158 end159 else160 error('Error. \nsPolicy must be 1 or 2.')161 end162

163 stats = struct('maxAmplitudes', maxAmplitude, ...164 'signalPower', sigPwr, ...165 'delays', delays, ...166 'PAR', PotentialArtifacts/length(reconstructed));167 end

105

DCross-correlation convergence test

1 clear all2 close all3 clc4

5 %% GLOBAL PARAMETERS6 %number of experiments for each simulation7 experiments = 50;8 %delay in seconds to be applied to second copy of signal9 Tau = 0.1;

10 %SNR between signal and noise11 SNR = 0.01;12

13 %% READ AUDIO FILE14 [x, Fs] = audioread('comandi_audio_8000.wav');15 % Tau delay expressed in samples16 Ds = round(Fs * Tau);17

18 %% FIRST SIMULATION: VARY SIZE OF HOLES19

20 % MAX dimension of holes in seconds21 HoleMaxSize_V = 0.0005 : 0.001 : 0.060;22 % AVG Distance between holes in seconds23 AvgTimeDistanceOfHoles = 0.003;24

25 D_measured = zeros(size(HoleMaxSize_V));26 r_measured = zeros(size(HoleMaxSize_V));27

28 for j = 1 : length(HoleMaxSize_V)29 for e = 1 : experiments30

31 Mask = true(size(x));32 MaxSize = HoleMaxSize_V(j) * Fs; % Max size of holes33 NoH = round( ( length(x) / (AvgTimeDistanceOfHoles * Fs) ) -1 );34

35 %% generate first holey copy36 CoH = round(length(x) * rand(NoH, 1));37 CoH = sort (CoH);38

39 SoH = round(MaxSize * rand(NoH, 1));40

41 for i=1:length(CoH)42 lbound = CoH(i) - round(SoH(i)/2);43 ubound = CoH(i) + round(SoH(i)/2);44 if lbound < 145 lbound=1;

107

Appendix D. Cross-correlation convergence test

46 end47 if ubound > length(x)48 ubound = length(x);49 end50 Mask(lbound : ubound) = 0;51 end52 y1 = x .* Mask;53

54 %% generate second holey copy55 Mask = true(size(x));56 CoH = round(length(x) * rand(NoH, 1));57 CoH = sort (CoH);58


61 for i=1:length(CoH)62 lbound = CoH(i) - round(SoH(i)/2);63 ubound = CoH(i) + round(SoH(i)/2);64 if lbound < 165 lbound=1;66 end67 if ubound > length(x)68 ubound = length(x);69 end70 Mask(lbound : ubound) = 0;71 end72 y2 = x .* Mask;73

74 %% delay second signal75 y2d = [zeros(Ds,1); y2(1 : end - Ds)];76

77 %% noise signals78 noiseAmplitude = max(abs(y1)) * SNR;79 y1n = y1 + noiseAmplitude * randn(size(y1));80 scale = max(abs(x))/max(abs(y1n));81 y1n = y1n .* scale;82

83 noiseAmplitude = max(abs(y2d)) * SNR;84 y2n = y2d + noiseAmplitude * randn(size(y2d));85 scale = max(abs(x))/max(abs(y2n));86 y2n = y2n .* scale;87

88 %% perform correlation89 [r, lags] = xcorr(y1n,y2n);90 [¬, I] = max(abs(r));91 %save measured data92 D_m = lags(I);93

94 D_measured(j) = D_measured(j) + D_m / experiments;95 r_measured(j) = r_measured(j) + max(abs(r)) / experiments;96 end97 end98 D_measured = abs(D_measured);99 D_difference = abs(D_measured - Ds);

100

101 figure102 stem(HoleMaxSize_V.*1000, D_difference);103 title ('Delay difference D - D_m in function of the size of holes')104 xlabel('Maximum size of holes (ms)')105 ylabel('Measured difference (s)')106 figure107 plot(HoleMaxSize_V.*1000, r_measured, 'r');108 title ('Maximum cross-correlation value R_f_g[n^*] in function of the size of holes')109 xlabel('Maximum size of holes (ms)')110 ylabel('Measured difference (s)')111

112 %% SECOND SIMULATION: VARY AVERAGE TIME DISTANCE BETWEEN HOLES113

108

114 HoleMaxSize = 0.012; % MAX dimension of holes in seconds115 AvgTimeDistanceOfHoles_V = 0.030 : -0.001 : 0.001;116

117 %% read audio file118 [x, Fs] = audioread('comandi_audio_8000.wav');119 Ds = round(Fs * Tau);120

121 D_measured = zeros(size(AvgTimeDistanceOfHoles_V));122 r_measured = zeros(size(AvgTimeDistanceOfHoles_V));123

124 for j = 1 : length(AvgTimeDistanceOfHoles_V)125 for e = 1 : experiments126

127 Mask = true(size(x));128 MaxSize = HoleMaxSize * Fs; % Max size of holes129 NoH = round( ( length(x) / (AvgTimeDistanceOfHoles_V(j) * Fs) ) -1 );130

131 %% generate first holey copy132 CoH = round(length(x) * rand(NoH, 1));133 CoH = sort (CoH);134



150 %% generate second holey copy151 Mask = true(size(x));152 CoH = round(length(x) * rand(NoH, 1));153 CoH = sort (CoH);154



170 %% delay second signal171 y2d = [zeros(Ds,1); y2(1 : end - Ds)];172

173 %% noise signals174

175 noiseAmplitude = max(abs(y1)) * SNR;176 y1n = y1 + noiseAmplitude * randn(size(y1));177 scale = max(abs(x))/max(abs(y1n));178 y1n = y1n .* scale;179

180 noiseAmplitude = max(abs(y2d)) * SNR;181 y2n = y2d + noiseAmplitude * randn(size(y2d));

109

Appendix D. Cross-correlation convergence test

182 scale = max(abs(x))/max(abs(y2n));183 y2n = y2n .* scale;184

185 %% perform correlation186 [r, lags] = xcorr(y1n,y2n);187 [¬, I] = max(abs(r));188 %save measured data189 D_m = lags(I);190

191 D_measured(j) = D_measured(j) + D_m / experiments;192 r_measured(j) = r_measured(j) + max(abs(r)) / experiments;193 end194 end195 D_measured = abs(D_measured);196 D_difference = abs(D_measured - Ds);197

198 figure199 stem(AvgTimeDistanceOfHoles_V.*1000, D_difference);200 title ('Delay difference D - D_m in function of the number of holes')201 xlabel('Average time distance between holes (ms)')202 ylabel('Measured difference (s)')203 figure204 plot(AvgTimeDistanceOfHoles_V.*1000, r_measured, 'r');205 title ('Maximum cross-correlation value R_f_g[n^*] in function of the number of ...

holes')206 xlabel('Average time distance between holes (ms)')207 ylabel('Measured difference (s)')

110

EBigEar Implementation schematics and pictures

E.1 BigEar Audio Capture board

Circuit schematic

Sp iceO rder 1 Sp iceO rder 2

GND

Vin

P0_0

P0_1

P0_2

P0_3

P0_4

P0_5

VALT

P2_1

P2_2

GND

P1_0

P1_1

P1_2

P1_3

P1_4

P1_5

P1_6

P1_7

RST

3.3v

GND

82

2k2

2k2

100k

100n 100u

100u

4.7

n

47

0p

F

22

0u

1N4007

V-

+3

V3

+3

V3

V -

V+

1k +3

V3

V -

200k

100k

V-

1k

1k

LT1013N

LT1013N

12

0k

100n

V+

10

k1

0k

V -

VC

C/2

VC

C/2

VC

C/2

V -VC

C/2

4 .7u

V-

V+

R8

R7

R3

R 2

C1IC C2IC

C2M

C1

C2

C_

OU

T

D 1

W 1

R6

BATTERY

12

R_BIAS

A E

S

12

R5_VAR

1 3

2

RST

12

BOOT

12

PW R1

23

1

R5_FIX

R4

IC1A

2

31

IC1B

6

57

R1

C 1M

R1

0R

9

C 3

+

+

1 3

+

V_m ic

V_adc V_b ias

111

Appendix E. BigEar Implementation schematics and pictures

Circuit prototype

E.2 BigEar Receiver

BigEar Receiver schematic

GND

Vin

P0_0

P0_1

P0_2

P0_3

P0_4

P0_5

VALT

P2_1

P2_2

GND

P1_0

P1_1

P1_2

P1_3

P1_4

P1_5

P1_6

P1_7

RST

3.3v

GND

VC

C

VC

C

1k

BA

TT

ER

Y

GND

GND

GND

GND

GND

W 1

11

22

RST

11

22

BOOTLDRR1

B1

+-

JP1

123

11

22

P0_0

11

22

P0_1

Jum per on 3-2: active H IGH

Jum per on 1-2: active LOW

BigEar Receiver prototype

112

E.3. BigEar Audio Capture application

E.3 BigEar Audio Capture application

1 #include <board.h>2 #include <usb.h>3 #include <usb_com.h>4 #include <random.h>5 #include <gpio.h>6 #include <radio_queue_mod.h>7 #include <adc.h>8 #include <stdio.h>9 #include <string.h> //needed by memset

10

11 #define FRAME_SIZE 2012 #define FRAMES_NUMBER 313 #define OVERHEAD_SIZE 614 #define START_TIME 500015 #define BLINK_PERIOD 50016 #define UPPER_LIMIT 204717 #define LOWER_LIMIT 018 #define ZERO_LEVEL 102419 #define SAT_BOUND 020 #define SIG_BOUND 51221

22 int32 CODE param_T3_counter = 30; //30 corresponds to 6.04 kHz (tolerance = 0.005)23 int32 CODE param_address = 2; //ID of the transmitter24 int32 CODE param_Vref3V3 = 0; //when 1, Vref for ADC = 3v3 pin. otherwise, Vref ...

is 1.25v (internal)25

26 BIT signal;27

28 uint8 ADC_CONFIGURATION;29

30 typedef struct 31 uint16 samples[FRAME_SIZE];32 uint32 tss[FRAME_SIZE];33 uint8 t4cnt[FRAME_SIZE];34 uint8 start;35 Frame;36

37 Frame XDATA frames[FRAMES_NUMBER];38 uint8 main_index;39 uint8 irs_index;40 uint8 irs_count;41 uint8 msFraction;42 uint32 timestamp;43 uint32 lastLedSwitch;44

45 BIT isTimeToTX;46 uint32 nextTX;47

48 void pinsInit();49 void appInit();50 void handleYellowLed();51 void determineTXInstant();52 void doProcessing();53 void switchOnTimerAndStartFirstConversion();54 void waitStartup();55 int8 ulawCompression(int sample);56 int ulawExpansion(int8 ulawcode);57

58 void putchar(char c) 59 usbComTxSendByte(c);60 61

62 void main(void)

113


63 appInit();64 waitStartup();65 //begin sampling66 switchOnTimerAndStartFirstConversion();67 while (1) 68 determineTXInstant();69 doProcessing();70 handleYellowLed();71 //system service routines72 boardService();73 usbComService();74 75 76

77 ISR(T3, 0)78 Frame XDATA *cFrame;79 uint8 nextIndex;80 uint16 result;81 //wait until ADC conversion is completed82 while(!ADCIF);83 if (ADCH & 0x80) 84 // Despite what the datasheet says, the result was negative.85 result = 0;86 else 87 // Note: Despite what the datasheet says, bits 2 and 3 of ADCL are not88 // always zero (they seem to be pretty random). We throw them away89 // here.90 result = ADC >> 4;91 92 //Handle LEDs93 LED_RED(result ≤ LOWER_LIMIT + SAT_BOUND | result ≥ UPPER_LIMIT - SAT_BOUND);94 signal = (result ≤ ZERO_LEVEL - SIG_BOUND | result ≥ ZERO_LEVEL + SIG_BOUND);95 //rearm ADC and start new conversion96 ADCIF = 0;97 ADCCON3 = ADC_CONFIGURATION;98 // save timestamp and fraction of millisecond (1/187) of new conversion99 msFraction = T4CNT;

100 timestamp = getMs();101

102 nextIndex = ((irs_index + 1) ≥ FRAMES_NUMBER) ? 0 : irs_index + 1;103

104 // if I read at least FRAME_SIZE samples and there is a free frame105 if ((irs_count == FRAME_SIZE) && (nextIndex != main_index)) 106 irs_index = nextIndex; //go to next frame107 irs_count = 0; // reset number of samples read108 109 cFrame = &frames[irs_index]; //obtain the frame110

111 cFrame->samples[cFrame->start] = result;112 cFrame->tss[cFrame->start] = timestamp; // save timestamp113 cFrame->t4cnt[cFrame->start] = msFraction; // save ticks (1 tick = 1/187-th ...

of millisecond)114 cFrame->start = ((1 + cFrame->start) ≥ FRAME_SIZE) ? 0 : 1 + cFrame->start; ...

//increment starting point115 // if frame is not full116 if (irs_count < FRAME_SIZE) 117 irs_count++;118 119 120

121 void pinsInit()122 // Disable pull-ups and pull-downs for all pins on Port 0.123 P0INP = 0x3F;124 // Configure pin 0 on Port 0 to be used as an analog pin,125 ADCCFG |= 0b00000011;126 127

128 void appInit()

114

E.3. BigEar Audio Capture application

129 //init vars130 main_index = 0;131 irs_index = 0;132 irs_count = 0;133 memset(frames, 0, sizeof(frames));134 signal = 0;135 //init system136 systemInit();137 usbInit();138 pinsInit();139 randomSeedFromAdc();140 radioQueueInit();141 //init ADC configuration142 if (param_Vref3V3 !=0 )143 //use external 3V3 pin as Vref144 ADC_CONFIGURATION = 0b10110000;145 else146 //use internal REF147 ADC_CONFIGURATION = 0b00110000;148 // 10 = 3V3 (AVDD) ref. **------149 // 00 = 1V25 int. ref. **------150 // 11 = 12 bit resol. --**----151 // 0000 = AIN0 (P0_0) ----****152 153

154 void handleYellowLed()155 if (getMs() - lastLedSwitch > BLINK_PERIOD && signal == 0)156 lastLedSwitch = getMs();157 LED_YELLOW_TOGGLE();158 159 else160 if (signal==1)161 LED_YELLOW(signal);162 163

164 void determineTXInstant()165 if (getMs() > nextTX)166 isTimeToTX = 1;167 nextTX = getMs() + (randomNumber() & param_radio_max_delay);168 169 170

171 void doProcessing()172 uint8 XDATA * txBuf;173 // There is a frame ready for being sent174 if (main_index != irs_index) 175 if ((isTimeToTX == 1) && (txBuf = radioQueueTxCurrentPacket()))176 Frame XDATA *cFrame = &frames[main_index];177 uint8 i;178 int out;179 uint32 XDATA * ptrTimeStamp = (uint32 XDATA *) &txBuf[2];180 uint8 XDATA * ptrData = (uint8 XDATA *) &txBuf[7];181 //set source address182 *(txBuf + 1) = param_address;183 //timestamp of the first sample184 *(ptrTimeStamp) = cFrame->tss[cFrame->start];185 //fraction of timestamp of the first sample186 *(txBuf + 6) = cFrame->t4cnt[cFrame->start];187 //samples188 for (i=0; i<FRAME_SIZE; i++)189 uint8 index;190 index = i + cFrame->start;191 if (index ≥ FRAME_SIZE)192 index -= FRAME_SIZE;193 //maximize signal in order to exploit u-Law compression194 out = (cFrame->samples[index] << 5) - 32767;195 *(ptrData + i) = ulawCompression(out);196

115


197 //radio content length198 *txBuf = OVERHEAD_SIZE + FRAME_SIZE;199 //send packet200 radioQueueTxSendPacket();201 isTimeToTX = 0;202 //Flush frame203 main_index = (main_index + 1 ≥ FRAMES_NUMBER) ? 0 : main_index + 1;204 205 206 207

208 void switchOnTimerAndStartFirstConversion()209 T3CC0 = param_T3_counter;210 // DIV=111: 1:128 prescaler211 // START=1: Start the timer212 // OVFIM=1: Enable the overflow interrupt.213 // CLR = 0: Disable counter reset214 // MODE=10: Modulo215 T3CTL = 0b11111010;216 T3IE = 1; // Enable Timer 3 interrupt. (IEN1.T3IE=1)217 EA = 1; // Globally enable interrupts (IEN0.EA=1).218 // save timestamp and fraction of millisecond (1/187) of the beginning of the ...

conversion219 msFraction = T4CNT;220 timestamp = getMs();221 //start first AD conversion of P0_0222 ADCIF = 0;223 ADCCON3 = ADC_CONFIGURATION;224 225

226 void waitStartup() 227 uint32 appStart, lastLedSwitch;228 //wait for START_TIME milliseconds blinking RED led229 LED_RED(1);230 LED_YELLOW(1);231 appStart = getMs();232 lastLedSwitch = appStart;233 while (getMs() - appStart < START_TIME) 234 if (getMs() - lastLedSwitch > BLINK_PERIOD) 235 lastLedSwitch = getMs();236 LED_RED_TOGGLE();237 238 boardService();239 usbComService();240 241 LED_RED(0);242 LED_YELLOW(0);243 //save led switch timer244 lastLedSwitch = getMs();245 //determine instant of next (first) transmission246 nextTX = getMs() + (randomNumber() & param_radio_max_delay);247 isTimeToTX = 0;248 249

250 // Enable mu-law compression(I2SCFG0.ULAWE cleared & I2SCFG0.ULAWC set)251 // Write uncompressed data to I2SDATH:I2SDATL registers. Result in I2SDATH252 int8 ulawCompression(int sample) 253 I2SCFG0 = 0x10;254 I2SDATL = (uint8) sample;255 I2SDATH = (uint8) (sample >> 8);256 return I2SDATH;257 258

259 // Enable mu-law expansion (I2SCFG0.ULAWE set & I2SCFG0.ULAWC cleared)260 // Write compressed data to I2SDATH register. Result in I2SDATH:I2SDATL261 int ulawExpansion(int8 ulawcode) 262 uint16 expandedValue;263 I2SCFG0 = 0x20;

116

E.4. BigEar Input Calibration

264 I2SDATH = ulawcode;265 expandedValue = I2SDATL;266 expandedValue |= I2SDATH << 8;267 return expandedValue;268

E.4 BigEar Input Calibration

1 #include <board.h>2 #include <usb.h>3 #include <usb_com.h>4 #include <gpio.h>5 #include <adc.h>6 #include <stdio.h>7 #include <string.h> //needed by memset8

9 #define FRAME_SIZE 25510 #define BLINK_PERIOD 50011 uint32 lastLedSwitch;12

13 int32 CODE param_Vref3V3 = 0;14 int32 CODE param_ADCTolerance = 2;15 uint8 ADC_CHANNEL;16

17 void pinsInit();18 void doProcessing();19

20 void putchar(char c) 21 usbComTxSendByte(c);22 23

24 void main(void) 25 //init system26 systemInit();27 usbInit();28 pinsInit();29 //adc configuration30 if (param_Vref3V3 !=0 )31 //use external 3V3 pin as Vref32 ADC_CHANNEL = 0 | ADC_REFERENCE_VDD | ADC_BITS_12;33 else34 //use 3V3 pin35 ADC_CHANNEL = 0 | ADC_REFERENCE_INTERNAL | ADC_BITS_12;36 //begin37 lastLedSwitch = getMs();38 while (1) 39 doProcessing();40 if (getMs() - lastLedSwitch > BLINK_PERIOD) 41 lastLedSwitch = getMs();42 LED_RED_TOGGLE();43 44 boardService();45 usbComService();46 47 48

49 void pinsInit()50 // Disable pull-ups and pull-downs for all pins on Port 0.51 P0INP = 0x3F;52 // Configure pin 0 on Port 0 to be used as an analog pin,53 ADCCFG |= 0b00000011;54 55

56 void printBar(const char * name, uint16 adcResult)57 58 uint8 i, width;

117


59

60 printf("%-4s %4d mV |", name, adcConvertToMillivolts(adcResult));61 width = adcResult >> 5;62 for(i = 0; i < width; i++) putchar('#'); 63 for(; i < 63; i++) putchar(' '); 64 putchar('|');65 putchar('\r');66 putchar('\n');67 68

69 void doProcessing()70 if (usbComTxAvailable() ≥ 128)71 uint8 i;72 int32 val;73 val = 0;74 //calibrate ADC75 adcSetMillivoltCalibration(adcReadVddMillivolts());76 //sample77 for (i = 0; i<FRAME_SIZE; i++)78 val += adcRead(ADC_CHANNEL);79 80 val /= FRAME_SIZE;81 //led yellow ON when the signal is correctly biased82 LED_YELLOW( (1024-val) > -param_ADCTolerance && (1024-val) < ...

param_ADCTolerance );83 printf("\x1B[0;0H"); // VT100 command for "go to 0,0"84 printBar("P0_0", val);85 printf("Measured = %8lu. Difference: %5ld", val, 1024-val);86 87

E.5 BigEar Receiver application

1 #include <board.h>2 #include <usb.h>3 #include <usb_com.h>4 #include <random.h>5 #include <gpio.h>6 #include <radio_queue_mod.h>7 #include <stdio.h>8

9 #define FRAME_SIZE 2010 #define FRAMES_NUMBER 311 #define OVERHEAD_SIZE 612 #define TERMINATION_SIZE 213 #define START_TIME 500014 #define BLINK_PERIOD 50015

16 uint32 lastLedSwitch;17 /* *** RADIO PDU STRUCT ***/18 typedef struct 19 uint8 length;20 uint8 source;21 uint32 timestamp;22 uint8 msFract;23 uint8 sample[FRAME_SIZE];24 radioPacket;25

26 void doProcessing();27 void waitStartup();28 void uint8ToHex(char *buf, uint8 val);29 void uint32ToHex(char *buf, uint32 val);30

31 void main(void) 32 //init system33 systemInit();

118

E.5. BigEar Receiver application

34 usbInit();35 radioQueueInit();36 waitStartup();37 lastLedSwitch = getMs();38 while (1) 39 doProcessing();40 //blink red LED41 if (getMs() - lastLedSwitch > BLINK_PERIOD) 42 lastLedSwitch = getMs();43 LED_RED_TOGGLE();44 45 boardService();46 usbComService();47 48 49

50 void doProcessing()51 radioPacket XDATA * rxPacket;52 uint8 XDATA bufferForUSB[ (FRAME_SIZE * 2 + OVERHEAD_SIZE * 2 + ...

TERMINATION_SIZE) ];53 if ((rxPacket = (radioPacket XDATA *) radioQueueRxCurrentPacket())) 54 uint8 i;55 LED_RED(1);56 //prepare USB buffer57 uint8ToHex(bufferForUSB, rxPacket->source); // 1) ID of source58 uint32ToHex(bufferForUSB + 1 * 2, rxPacket->timestamp); // 2) timestamp ...

in millisecond of first sample59 uint8ToHex(bufferForUSB + 5 * 2, rxPacket->msFract); // 3) fraction of ...

milliseond in (1/187) units60

61 for (i=0; i< FRAME_SIZE; i++)62 uint8ToHex(bufferForUSB + OVERHEAD_SIZE * 2 + 2*i, ...

rxPacket->sample[i]);// 4) data63 64 *(bufferForUSB + (FRAME_SIZE * 2 + OVERHEAD_SIZE * 2)) = '\r';65 *(bufferForUSB + (FRAME_SIZE * 2 + OVERHEAD_SIZE * 2 + 1)) = '\n';66 //If USB buffer is free, send data; otherwise, waste them.67 if ((usbComTxAvailable() ≥ (FRAME_SIZE * 2 + OVERHEAD_SIZE * 2 + ...

TERMINATION_SIZE)))68 usbComTxSend(bufferForUSB, (FRAME_SIZE * 2 + OVERHEAD_SIZE * 2 + ...

TERMINATION_SIZE));69 70 else71 //Yellow LED indicates that a packed is discarded due to temporary ...

USB overload72 LED_YELLOW_TOGGLE();73 74 radioQueueRxDoneWithPacket();75 LED_RED(0);76 77 78

79 void waitStartup() 80 uint32 appStart, lastLedSwitch;81 LED_RED(1);82 LED_YELLOW(1);83 appStart = getMs();84 lastLedSwitch = appStart;85 while (getMs() - appStart < START_TIME) 86 if (getMs() - lastLedSwitch > BLINK_PERIOD) 87 lastLedSwitch = getMs();88 LED_RED_TOGGLE();89 90 boardService();91 usbComService();92 93 LED_RED(0);94 LED_YELLOW(0);

119


95 96

97 //fast conversion between uint8 to hex char[]98 void uint8ToHex(char *buf, uint8 val)99 char hexDigit[] =

100 '0', '1', '2', '3', '4', '5', '6', '7',101 '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'102 ;103 *buf = hexDigit[(val >> 4) & 0x0F];104 *(buf + 1) = hexDigit[val & 0x0F];105 106 void uint32ToHex(char *buf, uint32 val)107 char hexDigit[] = 108 '0', '1', '2', '3', '4', '5', '6', '7',109 '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'110 ;111 *buf = hexDigit[(val >> 28) & 0x0000000F];112 *(buf + 1) = hexDigit[(val >> 24) & 0x0000000F];113 *(buf + 2) = hexDigit[(val >> 20) & 0x0000000F];114 *(buf + 3) = hexDigit[(val >> 16) & 0x0000000F];115 *(buf + 4) = hexDigit[(val >> 12) & 0x0000000F];116 *(buf + 5) = hexDigit[(val >> 8) & 0x0000000F];117 *(buf + 6) = hexDigit[(val >> 4) & 0x0000000F];118 *(buf + 7) = hexDigit[val & 0x0000000F];119

E.6 BigEar SerialPort application GUI

120

FBigEar Simulation

F.1 Simulation setups

Common data are reported in Table 6.1. The variables listed in the following subsections are:

• X, Y, Z: Vectors defining the position of each sensor in the room.

• CH: Vector defining the channel number of each sensor

• NAMES: Vector defining sensor names.

1 transmitters on one channel

X = [2 1 2 8 6]'/3;Y = [2 8 11 3 7]'/4;Z = 2.75*ones(5,1);CH = [1 1 1 2 2]';NAMES = ['S1'; 'S2'; 'S3'; 'S4'; 'S5'];

121

Appendix F. BigEar Simulation

3 transmitters on 1 channel

X = [2 1 2]'/3;Y = [2 8 11]'/4;Z = 2.75*ones(3,1);CH = [1 1 1]';NAMES = ['S1'; 'S2'; 'S3'];

3 transmitters on 2 channels

X = [2 1 2]'/3;Y = [2 8 11]'/4;Z = 2.75*ones(3,1);CH = [1 2 1]';NAMES = ['S1'; 'S2'; 'S3'];

122

F.1. Simulation setups


X = [2 1 2 8]'/3;Y = [2 8 11 3]'/4;Z = 2.75*ones(4,1);CH = [1 1 1 1 ]';NAMES = ['S1'; 'S2'; 'S3'; 'S4'];

4 transmitters on 2 channels

X = [2 1 2 8]'/3;Y = [2 8 11 3]'/4;Z = 2.75*ones(4,1);CH = [1 2 1 2 ]';NAMES = ['S1'; 'S2'; 'S3'; 'S4'];

123

Appendix F. BigEar Simulation


X = [2 1 ]'/3;Y = [2 8 ]'/4;Z = 2.75*ones(2,1);CH = [1 1]';NAMES = ['S1'; 'S2'];

1 transmitter

X = [2 ]'/3;Y = [2]'/4;Z = 2.75*ones(1,1);CH = [1 ]';NAMES = ['S1'];

124

GDatasheets

G.1 Kingstone KECG2738PBJ-A Electret Microphone

125

Appendix G. Datasheets

126

G.2. Linear Technology LT1013 Dual Precision Op Amp

G.2 Linear Technology LT1013 Dual Precision Op Amp

LT1013/LT1014

10134fd

Typical applicaTion

DescripTion

Quad Precision Op Amp (LT1014)Dual Precision Op Amp (LT1013)

The LT ®1014 is the first precision quad operational amplifier which directly upgrades designs in the industry standard 14-pin DIP LM324/LM348/OP-11/4156 pin configuration. It is no longer necessary to compromise specifications, while saving board space and cost, as compared to single operational amplifiers.

The LT1014’s low offset voltage of 50µV, drift of 0.3µV/°C, offset current of 0.15nA, gain of 8 million, common mode rejection of 117dB and power supply rejection of 120dB qualify it as four truly precision operational amplifiers. Particularly important is the low offset voltage, since no offset null terminals are provided in the quad configuration. Although supply current is only 350µA per amplifier, a new output stage design sources and sinks in excess of 20mA of load current, while retaining high voltage gain.

Similarly, the LT1013 is the first precision dual op amp in the 8-pin industry standard configuration, upgrading the per-formance of such popular devices as the MC1458/MC1558, LM158 and OP-221. The LT1013’s specifications are similar to (even somewhat better than) the LT1014’s.

Both the LT1013 and LT1014 can be operated off a single 5V power supply: input common mode range includes ground; the output can also swing to within a few millivolts of ground. Crossover distortion, so apparent on previous single-supply designs, is eliminated. A full set of specifica-tions is provided with ±15V and single 5V supplies.

FeaTures

applicaTions

n Single Supply Operation Input Voltage Range Extends to Ground Output Swings to Ground While Sinking Currentn Pin Compatible to 1458 and 324 with Precision Specsn Guaranteed Offset Voltage: 150µV Max n Guaranteed Low Drift: 2µV/°C Maxn Guaranteed Offset Current: 0.8nA Maxn Guaranteed High Gain 5mA Load Current: 1.5 Million Min 17mA Load Current: 0.8 Million Minn Guaranteed Low Supply Current: 500µA Maxn Low Voltage Noise, 0.1Hz to 10Hz: 0.55µVP-Pn Low Current Noise—Better than 0P-07, 0.07pA/√Hz

n Battery-Powered Precision Instrumentation Strain Gauge Signal Conditioners Thermocouple Amplifiers Instrumentation Amplifiersn 4mA to 20mA Current Loop Transmittersn Multiple Limit Threshold Detectionn Active Filtersn Multiple Gain Blocks

–

+LT1014

1

4

11

2

3

5V5V

1M4k

OUTPUT A10mV/°C

–

+LT1014

7

6

5

1M

OUTPUT B10mV/°C

4k

1.8k

YSI 440075kAT 25°C

260Ω

1684Ω

299k3k

LT10041.2V

14

12

13–

+LT1014

USE TYPE K THERMOCOUPLES. ALL RESISTORS = 1% FILM.COLD JUNCTION COMPENSATION ACCURATE TO ±1°C FROM 0°C TO 60°C.USE 4TH AMPLIFIER FOR OUTPUT C.

LT1014 Distribution of Offset Voltage3-Channel Thermocouple Thermometer

INPUT OFFSET VOLTAGE (µV)–300 0 200–200 –100 100 300

NUM

BER

OF U

NITS

700

600

500

400

300

200

100

0

VS = ±15VTA = 25°C425 LT1014s(1700 OP AMPS)TESTED FROM THREE RUNS J PACKAGE

1013/14 TA02

L, LT, LTC, LTM, Linear Technology and the Linear logo are registered trademarks of Linear Technology Corporation. All other trademarks are the property of their respective owners.

127

Appendix G. DatasheetsLT1013/LT1014

10134fd

elecTrical characTerisTics

SYMBOL PARAMETER CONDITIONS

LT1013AM/AC LT1014AM/AC

LT1013C/D/I/M LT1014C/D/I/M

UNITSMIN TYP MAX MIN TYP MAX

VOS Input Offset Voltage LT1013 LT1014 LT1013D/I, LT1014D/I

40 50

150 180

60 60

200

300 300 800

µV µV µV

Long-Term Input Offset Voltage Stability

0.4 0.5 µV/Mo.

ISO Input Offset Current 0.15 0.8 0.2 1.5 nA

IB Input Bias Current 12 20 15 30 nA

en Input Noise Voltage 0.1Hz to 10Hz 0.55 0.55 µVP-P

en Input Noise Voltage Density fO = 10Hz fO = 1000Hz

24 22

24 22

nV/√Hz nV/√Hz

in Input Noise Current Density fO = 10Hz 0.07 0.07 pA/√Hz

Input Resistance – Differential Common Mode

(Note 2) 100 400 5

70 300 4

MΩ GΩ

AVOL Large-Signal Voltage Gain VO = ±10V, RL = 2k VO = ±10V, RL = 600Ω

1.5 0.8

8.0 2.5

1.2 0.5

7.0 2.0

V/µV V/µV

Input Voltage Range 13.5 –15.0

13.8 –15.3

13.5 –15.0

13.8 –15.3

V V

CMRR Common Mode Rejection Ratio VCM = 13.5V, –15.0V 100 117 97 114 dB

PSRR Power Supply Rejection Ratio VS = ±2V to ±18V 103 120 100 117 dB

Channel Separation VO = ±10V, RL = 2k 123 140 120 137 dB

VOUT Output Voltage Swing RL = 2k ±13 ±14 ±12.5 ±14 V

Slew Rate 0.2 0.4 0.2 0.4 V/µs

IS Supply Current Per Amplifier 0.35 0.50 0.35 0.55 mA

TA = 25°C. VS = ±15V, VCM = 0V unless otherwise noted.

SYMBOL PARAMETER CONDITIONS

LT1013AM/AC LT1014AM/AC

LT1013C/D/I/M LT1014C/D/I/M

UNITSMIN TYP MAX MIN TYP MAX

VOS Input Offset Voltage LT1013 LT1014 LT1013D/I, LT1014D/I

60 70

250 280

90 90

250

450 450 950

µV µV µV

IOS Input Offset Current 0.2 1.3 0.3 2.0 nA

IB Input Bias Current 15 35 18 50 nA

AVOL Large-Signal Voltage Gain VO = 5mV to 4V, RL = 500Ω 1.0 1.0 V/µV

Input Voltage Range 3.5 3.8 – 0.3

3.5 0

3.8 – 0.3

V V

VOUT Output Voltage Swing Output Low, No Load Output Low, 600Ω to Ground Output Low, ISINK = 1mA Output High, No Load Output High, 600Ω to Ground

4.0 3.4

15 5

220 4.4 4.0

25 10

350

4.0 3.4

15 5

220 4.4 4.0

25 10

350

mV mV mV

V V

IS Supply Current Per Amplifier 0.31 0.45 0.32 0.50 mA

TA = 25°C. VS+ = 5V, VS

– = 0V, VOUT = 1.4V, VCM = 0V unless otherwise noted

128

G.2. Linear Technology LT1013 Dual Precision Op AmpLT1013/LT1014

10134fd

Typical perForMance characTerisTics

Offset Voltage Drift with Temperature of Representative Units

TEMPERATURE (°C)–50

INPU

T OF

FSET

VOL

TAGE

(µV)

200

100

0

–100

–200

0 50 75–25 25 100 125

VS = ±15V

1013/14 TPC01

TIME AFTER POWER ON (MINUTES)0

CHAN

GE IN

OFF

SET

VOLT

AGE

(µV)

5

4

3

2

1

041 2 3 5

VS = ±15VTA = 25°C

LT1013 CERDIP (J) PACKAGE

LT1013 METAL CAN (H) PACKAGE

LT1014

1013/14 TPC03

Warm-Up Drift

BALANCED SOURCE RESISTANCE (Ω)1k 3k 10k 30k 100k 300k 1M 3M 10M

INPU

T OF

FSET

VOL

TAGE

(mV)

10

1

0.1

0.01

VS = 5V, 0V, –55°C TO 125°C

VS = ±15V, 0V, –55°C TO 125°C

VS = 5V, 0V, 25°C

VS = ±15V, 0V, 25°C

–

+RS

RS

1013/14 TPC02

Offset Voltage vs Balanced Source Resistance

Common Mode Rejection Ratio vs Frequency 0.1Hz to 10Hz Noise

Power Supply Rejection Ratio vs Frequency

FREQUENCY (Hz)10

COM

MON

MOD

E RE

JECT

ION

RATI

O (d

B)

120

100

80

60

40

20

0100 1k 10k 100k 1M

VS = 5V, 0V VS = ±15V

TA = 25°C

1013/14 TPC04

FREQUENCY (Hz)0.1

POW

ER S

UPPL

Y RE

JECT

ION

RATI

O (d

B)

120

100

80

60

40

20

0100 10k1 10 1k 100k 1M

POSITIVESUPPLY

NEGATIVESUPPLY

VS = ±15V + 1VP-P SINE WAVETA = 25°C

1013/14 TPC05

TIME (SECONDS)0

NOIS

E VO

LTAG

E (2

00nV

/DIV

)

82 4 6 10

TA = 25 CVS = 2V TO 18V

1013/14 TPC06

10Hz Voltage Noise DistributionNoise Spectrum Supply Current vs Temperature

FREQUENCY (Hz)1

VOLT

AGE

NOIS

E DE

NSIT

Y (n

V/√H

z)CU

RREN

T NO

ISE

DENS

ITY

(fA/√Hz

)

1000

100

10

300

30

10 100 1k

CURRENT NOISE

VOLTAGE NOISE

1/f CORNER 2Hz

TA = 25°CVS = ±2V TO ±18V

1013/14 TPC07

VOLTAGE NOISE DENSITY (nV/√Hz)10

NUM

BER

OF U

NITS

200

180

160

140

120

100

80

60

40

20

05020 30 40 60

VS = ±15VTA = 25°C328 UNITS TESTEDFROM THREE RUNS

1013/14 TPC08


SUPP

LY C

URRE

NT P

ER A

MPL

IFIE

R (µ

A)

460

420

380

340

300

2600 50 75–25 25 100 125

VS = ±15V

VS = 5V, 0V

1013/14 TPC09

129

Appendix G. DatasheetsLT1013/LT1014

10134fd


INPUT BIAS CURRENT (nA)0CO

MM

ON M

ODE

INPU

T VO

LTAG

E, V

S =

+5V,

0V

(V)

5

4

3

2

1

0

–1

COMM

ON MODE INPUT VOLTAGE, V

S = ±15V (V)

15

10

5

0

–5

–10

–15–5 –10 –15 –20 –25 –30

TA = 25°C

VS = 5V, 0VVS = ±15V

1013/14 TPC10

Input Bias Current vs Common Mode Voltage


INPU

T BI

AS C

URRE

NT (n

A)

–30

–25

–20

–15

–10

–5

025 75–25 0 50 100 125

VCM = 0V

VS = 5V, 0V

VS = ±15V

VS = ±2.5V

1013/14 TPC12


INPU

T OF

FSET

CUR

RENT

(nA)

1.0

0.8

0.6

0.4

0.2

00 50 75–25 25 100 125

VCM = 0V

VS = 5V, 0VVS =

±2.5V

VS = ±15V

1013/14 TPC11

Input Bias Current vs Temperature

Large-Signal Transient Response, VS = ±15V

5V/D

IV

AV = +1 50µs/DIV 1013/14 TPC15

Large-Signal Transient Response, VS = 5V, 0V

AV = +1 10µs/DIV 1013/14 TPC18

NO LOADINPUT = 0V TO 4V PULSE

4V

2V

0V

Small-Signal Transient Response, VS = ±15V

20m

V/DI

V

AV = +1 2µs/DIV 1013/14 TPC14

Large-Signal Transient Response, VS = 5V, 0V

AV = +1 10µs/DIV 1013/14 TPC17

RL = 4.7k TO 5VINPUT = 0V TO 4V PULSE

4V

2V

0V

Output Saturation vs Sink Current vs Temperature

TEMPERATURE (°C)–50 –25 0 25 50 75 100 125

SATU

RATI

ON V

OLTA

GE (V

)

10

1

0.1

0.01

V+ = 5V TO 30VV– = 0V

ISINK = 10mA

ISINK = 5mA

ISINK = 1mA

ISINK = 100µAISINK = 10µA

ISINK = 0

1013/14 TPC13

AV = +1 20µs/DIV 1013/14 TPC16

RL = 600Ω TO GROUNDINPUT = 0V TO 100mV PULSE

Small-Signal Transient Response, VS = 5V, 0V

100mV

50mV

0

Input Offset Current vs Temperature

130

G.2. Linear Technology LT1013 Dual Precision Op AmpLT1013/LT1014

10134fd


Voltage Gain vs Frequency

FREQUENCY (Hz)0.01 0.1

VOLT

AGE

GAIN

(dB)

1M 10M1 10 100 1k 10k 100k

140

120

100

80

60

40

20

0

–20

VS = ±15VVS = 5V, 0V

TA = 25°CCL = 100pF

1013/14 TPC21

LOAD RESISTANCE TO GROUND (Ω)100

100k

VOLT

AGE

GAIN

(V/V

)

1M

10M

1k 10k

VO = 20mV TO 3.5VWITH VS = 5V, 0V

TA = 25°C, VS = ±15V

TA = –55°C, VS = ±15V

TA = 125°C, VS = ±15V

TA = –55°C, VS = 5V, 0V

TA = 25°C, VS = 5V, 0V

TA = 125°C, VS = 5V, 0V

VO = ±10V WITH VS = ±15V

1013/14 TPC20

Output Short-Circuit Current vs Time

TIME FROM OUTPUT SHORT TO GROUND (MINUTES)0

SHOR

T-CI

RCUI

T CU

RREN

T (m

A)SI

NKIN

G

SOUR

CING

1 2

40

30

20

10

0

–10

–20

–30

–403

–55°C

25°C

25°C

125°C

125°C

–55°C VS = ±15V

1013/14 TPC19

Voltage Gain vs Load Resistance

Gain, Phase vs FrequencyChannel Separation vs Frequency

applicaTions inForMaTion

FREQUENCY (MHz)0.1 0.3

VOLT

AGE

GAIN

(dB)

20

10

0

–10

PHASE SHIFT (DEGREES)80

100

120

140

160

180

200

1 3 10

TA = 25°CVCM = 0VCL = 100pFPHASE

±15V

5V, 0V

±15V

5V, 0V

GAIN

1013/14 TPC22

FREQUENCY (Hz)10

CHAN

NEL

SEPA

RATI

ON (d

B)

160

140

120

100

80

60100k100 1k 10k 1M

LIMITED BYTHERMAL

INTERACTIONRS = 1kΩ

RS = 100Ω

VS = ±15VTA = 25°CVIN = 20Vp-p to 5kHzRL = 2k

LIMITED BYPIN TO PIN

CAPACITANCE

1013/14 TPC23

Single Supply Operation

The LT1013/LT1014 are fully specified for single supply operation, i.e., when the negative supply is 0V. Input common mode range includes ground; the output swings within a few millivolts of ground. Single supply operation, however, can create special difficulties, both at the input and at the output. The LT1013/LT1014 have specific circuitry which addresses these problems.

At the input, the driving signal can fall below 0V—in-advertently or on a transient basis. If the input is more than a few hundred millivolts below ground, two distinct

problems can occur on previous single supply designs, such as the LM124, LM158, OP-20, OP-21, OP-220, OP-221, OP-420:

a) When the input is more than a diode drop below ground, unlimited current will flow from the substrate (V – terminal) to the input. This can destroy the unit. On the LT1013/LT1014, the 400Ω resistors, in series with the input (see Schematic Diagram), protect the devices even when the input is 5V below ground.

131

Bibliography

[1] Norman Abramson. The aloha system: another alternative for computer communications. InProceedings of the November 17-19, 1970, fall joint computer conference, pages 281–285. ACM,1970.

[2] Dennis A. Adams, R. Ryan Nelson, and Peter A. Todd. Perceived usefulness, ease of use, and usageof information technology: A replication. MIS Q., 16(2):227–247, June 1992.

[3] I. F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci. Wireless sensor networks: a survey.Computer Networks, 38:393–422, 2002.

[4] Jont B. Allen and David A.’ Berkley. Image method for efficiently simulating small room acoustics.The Journal of the Acoustical Society of America, 65(4):943–950, 1979.

[5] ASCII-Table.com. Ansi escape sequences - vt100 / vt52. http://ascii-table.com/ansi-escape-sequences-vt-100.php.

[6] Pololu corp. Wixel sdk documentation. http://pololu.github.io/wixel-sdk/.

[7] George Coulouris, Jean Dollimore, Tim Kindberg, and Gordon Blair. Distributed Systems: Con-cepts and Design. Pearson, fifth edition, 2012.

[8] Federal Standard. 1037c. telecommunications: Glossary of telecommunication terms. Institute forTelecommunications Sciences, 7, 1996.

[9] Galen Carol Audio. Decibel (loudness) comparison chart. http://www.gcaudio.com/resources/howtos/loudness.html.

[10] Simon Godsill and Peter Rayner. Digital audio restoration. Department of Engineering, Universityof Cambridge, pages 1–71, 1997.

[11] Renate Heinz. Binaural room simulation based on an image source model with addition of statisticalmethods to include the diffuse sound scattering of walls and to predict the reverberant tail. AppliedAcoustics, 38(2–4):145 – 159, 1993.

[12] Texas Instruments. Low-power soc (system-on-chip) with mcu, memory, 2.4 ghz rf transceiver, andusb (rev. g). http://www.ti.com/lit/ds/symlink/cc2511.pdf, 2013.

[13] International Telecommunication Union. ITU-T recommendation G.711 – Pulse Code Modulation(PCM) of Voice Frequencies. Technical report, 1993.

[14] International Telecommunication Union. ITU-T recommendation G.711 appendix I – A High Qual-ity Low-Complexity Algorithm for Packet Loss Concealment. Technical report, 1999.

133

http://ascii-table.com/ansi-escape-sequences-vt-100.php

http://ascii-table.com/ansi-escape-sequences-vt-100.php

http://pololu.github.io/wixel-sdk/

http://www.gcaudio.com/resources/howtos/loudness.html

http://www.gcaudio.com/resources/howtos/loudness.html

http://www.ti.com/lit/ds/symlink/cc2511.pdf

Bibliography

[15] Benjamin Lecouteux, Michel Vacher, and François Portet. Distant Speech Recognition in aSmart Home: Comparison of Several Multisource ASRs in Realistic Conditions. In InternationalSpeech Communication Association, editor, Interspeech 2011 Florence, pages 2273–2276, Flo-rence, Italy, August 2011. 4 pages.

[16] Jerad Lewis. Understanding microphone sensitivity. Analog Dialogue, pages 46–05, 2012.

[17] S. Mangano, H. Saidinejad, F. Veronese, S. Comai, M. Matteucci, and F. Salice. Bridge: Mutualreassurance for autonomous and independent living. Intelligent Systems, IEEE, 30(4):31–38, July2015.

[18] Nathanael Perraudin, David Shuman, Gilles Puy, and Pierre Vandergheynst. Unlocbox a matlabconvex optimization toolbox using proximal splitting methods. arXiv preprint arXiv:1402.0779,2014.

[19] Matthew Rhudy, Brian Bucci, Jeffrey Vipperman, Jeffrey Allanach, and Bruce Abraham. Micro-phone array analysis methods using cross-correlations. In ASME 2009 International MechanicalEngineering Congress and Exposition, pages 281–288. American Society of Mechanical Engineers,2009.

[20] Lawrence G Roberts. Aloha packet system with and without slots and capture. ACM SIGCOMMComputer Communication Review, 5(2):28–42, 1975.

[21] Dirk Schroeder, Philipp Dross, and Michael Vorlaender. A fast reverberation estimator for virtualenvironments. In Audio Engineering Society Conference: 30th International Conference: Intelli-gent Audio Environments, Mar 2007.

[22] Douglas Self. Audio engineering explained. Taylor & Francis, 2009.

[23] Jianhong Shen. Inpainting and the fundamental problem of image processing. SIAM news, 36(5):1–4, 2003.

[24] Julius O. Smith. Physical Audio Signal Processing. http://ccrma.stanford.edu/-˜jos/pasp/, accessed 2015. online book, 2010 edition.

[25] Linear Technology. LT1013/1014 quad precision op amp (lt1014) – dual precision op amp (lt1013).http://cds.linear.com/docs/en/datasheet/10134fd.pdf, 2013.

[26] Saeed V Vaseghi. Algorithms for restoration of archived gramophone recordings. PhD thesis,University of Cambridge, 1988.

[27] Saeed V Vaseghi and Peter JW Rayner. A new application of adaptive filters for restoration ofarchived gramophone recordings. In Acoustics, Speech, and Signal Processing, 1988. ICASSP-88.,1988 International Conference on, pages 2548–2551. IEEE, 1988.

[28] SV Vaseghi and PJW Rayner. Detection and suppression of impulsive noise in speech communica-tion systems. IEE Proceedings I (Communications, Speech and Vision), 137(1):38–46, 1990.

[29] Raymond NJ Veldhuis, Lodewijk B Vries, et al. Adaptive interpolation of discrete-time signalsthat can be modeled as autoregressive processes. Acoustics, Speech and Signal Processing, IEEETransactions on, 34(2):317–330, 1986.

[30] Andrew Wabnitz, Nicolas Epain, Craig Jin, and André van Schaik. Room acoustics simulationfor multichannel microphone arrays. In Proceedings of the International Symposium on RoomAcoustics, 2010.

[31] L. Zevi. Il nuovissimo manuale dell’architetto. Mancosu Editore, 20 edition, 2003.

134

http://ccrma.stanford.edu/~jos/pasp/

http://cds.linear.com/docs/en/datasheet/10134fd.pdf

u B iquitous w I reless low-bud G et spe E ch c A pturing inte R face

POLITECNICO DI MILANO Como Campus · con la pazienza, la comprensione e il sacriﬁcio. ... BigEar...

Documents

Transcript of POLITECNICO DI MILANO Como Campus · con la pazienza, la comprensione e il sacriﬁcio. ... BigEar...