Dipartimento di Informatica, Bioingegneria, Robotica ed ...Universita degli Studi di Genova`...

Dipartimento di Informatica, Bioingegneria,Robotica ed Ingegneria dei Sistemi

A Weakly Supervised Framework for Adaptive SignalRepresentation

by

Alessandra Stagliano

Theses Series DIBRIS-TH-2014-01

DIBRIS, Universita di GenovaVia Opera Pia, 13 16145 Genova, Italy http://www.dibris.unige.it/

Universita degli Studi di Genova

Dipartimento di Informatica, Bioingegneria,

Robotica ed Ingegneria dei Sistemi

Dottorato di Ricerca in Informatica

Ph.D. Thesis in Computer Science

A Weakly Supervised Framework for AdaptiveSignal Representation

by

Alessandra Stagliano

February, 2014

Dottorato di Ricerca in InformaticaDipartimento di Informatica, Bioingegneria, Robotica ed Ingegneria dei Sistemi

Universita degli Studi di Genova

DIBRIS, Univ. di GenovaVia Opera Pia, 13

I-16145 Genova, Italyhttp://www.dibris.unige.it/

Ph.D. Thesis in Computer Science (S.S.D. INF/01)

Submitted by Alessandra StaglianoDIBRIS, Univ. di Genova

[email protected]

Date of submission: February 2014

Title: A Weakly Supervised Framework for Adaptive Signal Representation

Advisor:

Alessandro VerriDipartimento di Informatica, Bioingegneria, Robotica ed Ingegneria dei Sistemi

Universita di [email protected]

Francesca OdoneDipartimento di Informatica, Bioingegneria, Robotica ed Ingegneria dei Sistemi


Ext. Reviewers:Christine de Mol

Department of Mathematics and ECARESUniversite Libre de Bruxelles

[email protected]

Domenico TegoloDipartimento di Matematica e Informatica


Abstract

The capability of extracting information and knowledge from raw data is a key aspect of intel-ligence. In the case of artificial systems recent success stories are based on the availability of alarge number of labeled data. Unlike unlabeled data, labeled data are annotated by experts and,typically, do not take into explicit account the specific structural properties of the problem athand. Inspired by the strategy apparently implemented by biological, in this thesis we explorethe potential of methods leveraging on the use of massive amount of unlabeled data, often in veryhigh dimension. In the case of image sequences, for example, spatial and temporal correlationprovide natural cues for developing methods aiming at finding the most appropriate representa-tions for categorization tasks.By using an overcomplete dictionary of prototype elementary signals, called atoms, we representdata by sparse, linear combinations of atoms. The coefficients of the linear combinations are thenused as features. In the last twenty years these types of sparse representations, mostly derivedfrom fixed dictionaries designed to satisfy certain requirements (e.g. wavelets), proved to be veryeffective in a number of applications. However, a fixed dictionary may not be able to capture therelevant characteristics of different kinds of data. A different approach is to build adaptive dic-tionaries by learning them directly from data, a method known as dictionary learning, which hasbeen shown to yield adaptive representations that may perform better than fixed ones.In this thesis we present a dictionary learning paradigm based on the optimization of a cost func-tional consisting of a data term and sparsity inducing penalties. We have used sparse codes forapplications with no labels, proving that they are very powerful data descriptors.We have applied dictionary learning in three different scenarios. In the biomedical domain, wehave developed a method to segment tissues in Dynamic Contrast Enhanced Magnetic ResonanceImaging. After the injection of a contrast agent, we consider the behavior of each single voxeland learn dictionaries of curves describing the different tissues.In time varying images from a video camera, we exploited the robustness of the obtained repre-sentations to build a background modeling system, that is adaptive and completely unsupervised.The method can be the basis for a fully adaptive surveillance system.Finally, we have applied the scheme to the automatic recognition of emotions from body ges-tures. We learned dictionaries of emotion-related features that helped in the discrimination ofbasic emotions. This application can be very useful in the context of serious game development.

1

To Gianna and Giorgio

Hey man, slow down.

(Radiohead)

Acknowledgements

I wish to thank my advisors, prof. Francesca Odone and prof. Alessandro Verri, for their supportand guidance. Thanks to Matteo Santoro and Curzio Basso, that introduced me inside the worldof Dictionary Learning, and Nicoletta Noceti for all the help she gave me. Thanks to prof. MarioBertero, prof. Massimiliano Pontil, and prof. Antonio Camurri for giving me the chance to workin new topics and environments. Thanks to all the SlipGuru and InfoMusLab teams. A specialthanks goes to my PhD siblings: Gabriele Chiusano, Giovanni Fusco, Salvatore Masecchia, andLuca Zini. Finally I reached you! Last but not least, thanks to my family and Ale, for all theirsupport and patience.

Table of Contents

List of Figures 5

List of Tables 9

Chapter 1 Introduction 11

1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.2 Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Chapter 2 Sparse Coding and Dictionary Learning 19

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.1 Fixed and adaptive dictionaries . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.2 Dictionary learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3 Sparse representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Learning a dictionary from data . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4.1 K-SVD algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4.2 `1-DL algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4.3 Minimization via proximal methods algorithm . . . . . . . . . . . . . . 28

2.5 A case of study: denoising of medical images . . . . . . . . . . . . . . . . . . . 32

2.5.1 Experimental assessment . . . . . . . . . . . . . . . . . . . . . . . . . . 33

1

2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Chapter 3 Dynamic Contrast Enhanced MRI Analysis 41

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3 The proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.1 Motion compensation . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.2 Learning the representation . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.3 Tissue segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.4.2 Real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Chapter 4 Background Modeling 61

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63


4.3.1 Dictionary learning for background modeling . . . . . . . . . . . . . . . 65

4.3.2 The BMTDL method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3.3 Run time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3.4 Foreground detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.3.5 Computational costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.4 The method at work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.4.1 Uniform illumination changes . . . . . . . . . . . . . . . . . . . . . . . 71

4.4.2 Background stable changes and local shadows . . . . . . . . . . . . . . . 72

4.5 Method assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.5.1 The SABS dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

2

4.5.2 The Change Detection dataset . . . . . . . . . . . . . . . . . . . . . . . 77

4.6 Experiments across long time observations . . . . . . . . . . . . . . . . . . . . . 79

4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

Chapter 5 Automatic Emotion Recognition 87

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88


5.3.1 Movement features for emotion recognition . . . . . . . . . . . . . . . . 89

5.3.2 Data representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.3.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.4 Method assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.4.1 Data validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.4.2 Classification results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.5 Application to the design of serious games . . . . . . . . . . . . . . . . . . . . . 106

5.5.1 Graphical user interface . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.5.2 Body emotion game demonstration . . . . . . . . . . . . . . . . . . . . 107

5.5.3 Emotional charades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

Chapter 6 Conclusions 109

Bibliography 111

3

List of Figures

1.1 Examples of dynamic data we process in this thesis. Top row: DCE-MRI kidneydata; middle row: video streams from fixed camera; bottom row: tracking ofpoints on the body of a person . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1 On the left, a fixed DCT dictionary. On the right an adaptive dictionary obtainedwith the Lee method [LBRN06]. . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2 Examples of denoising results on two MR images: on the top row a slice fromthe Brainweb dataset, on the bottom row the image of a wrist. From left to right,the first column shows the original image, the second the image with gaussiannoise (µ = 0, σ = 10), the third column the denoised image (with `1-DL topand K-SVD bottom), and the fourth the difference between noisy and denoisedimage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.3 Examples of learned dictionaries by (a) K-SVD and (b) `1-DL, compared with(c) the fixed DCT dictionary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.4 Comparison of PSNR achieved by K-SVD (initialized with DCT or randomly)and by DCT, for (a) varying sizes of the patches and (b) varying levels of sparsity. 36

2.5 The noise-free images used to train a dictionary based on the anatomical regionof the kidneys. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.6 Denoising performance of K-SVD (top) and `1-DL (bottom) for varying level ofnoise. We show the PSNR achieved by training the dictionary on the same imageto be denoised, and on a different training set. . . . . . . . . . . . . . . . . . . . 39

2.7 Denoising on an MRI of kidneys. (a) Original image, (b) image with noise,(c) image denoised with K-SVD trained on the same noisy image, (d) imagedenoised after training on a different set of noise-free images. Results with `1-DL were qualitatively similar. . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5

3.1 Pipeline of the proposed DCE-MRI analysis process. . . . . . . . . . . . . . . . 43

3.2 An example of motion compensation. (a): image at time t1. (b): image at timet2, in which a wide movement of right kidney in the ROI is highlighted. (c):the computed displacement field in the ROI, displayed at a larger scale. (d): theimage at time t2 after motion compensation. . . . . . . . . . . . . . . . . . . . . 44

3.3 Synthetically generated enhancement curves. (a): templates of the enhancementcurves corresponding to the five different types of simulated tissue. (b): the fiveatoms most used in the reconstruction of synthetic data corrupted by noise. (c):the five principal components computed through principal component analysis.The vertical axes have been rescaled manually. . . . . . . . . . . . . . . . . . . 48

3.4 Percentage of segmentation accuracy of our method on synthetic data as a func-tion of the percentage of non-null entries in the code matrix U . The accuracy,computed as in Equation (3.2), has been averaged over ten runs obtained byadding zero-mean Gaussian noise with σ = 20 and for K = 70. The grey bandshows the standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.5 Denoising and signal reconstruction on synthetic data. The dashed curve in lightgrey is the noise corrupted version of the template curve depicted in black. Thedashed-dotted and dashed grey curves are the reconstructions obtained by ourmethod and by the method described in the work by Agner et al. [AXF+09]respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.6 Qualitative assessment of kidney segmentation. The black and white curves de-note the manual segmentation and the segmentation obtained by our method onthe left, (a), and right, (b), kidney respectively. Exact superposition is shown ingrey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.7 Qualitative assessment of parenchyma segmentation. (a): manual segmentation.(b): segmentation obtained by our method. . . . . . . . . . . . . . . . . . . . . . 53

3.8 Renal dysfunction dataset dictionary. Histogram of atom usage and plot of thetwo atoms most used in the reconstruction of the enhancement curves of theclusters corresponding to the parenchymal, (a) and (b), and the conductive ducttissue, (c) and (d), respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.9 Bland-Altman plots of the SRF between our method and the first expert, (a),our method and the second expert, (b), and the first and second expert, (c). Ineach plot the continuous line is the average of all the differences, whilst the twodashed lines delimit the confidence interval, computed as the average plus orminus ±1.96× σ, with σ the corresponding standard deviation. . . . . . . . . . . 55

6

3.10 Qualitative comparison of the segmentation obtained by means of our method,(a), and by the method by Agner [AXF+09], (b). Vessels and synovial volumeare in white and black respectively. . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.11 JIA dataset dictionary. The black and grey atoms, mostly used in the reconstruc-tion of the enhancement curves in the clusters corresponding to synovial volumeand vessel tissue, resembles the expected behavior of the contrast uptake as afunction of the time steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.1 The structure of the proposed system. . . . . . . . . . . . . . . . . . . . . . . . 67

4.2 The three steps needed to obtain a pixel-precision binary map from the patch-precision map. Left: the coarse patch-precision binary map. Center: the abso-lute difference between the current patch and the most similar background atom.Right: the pixel-precision binary map obtained by thresholding. . . . . . . . . . 68

4.3 The BMTDL method applied to a simple outdoor sequence from the ChangeDetection dataset. The changes are detected correctly and the method does notdetect any false positives due to illumination changes. . . . . . . . . . . . . . . . 72

4.4 An analysis of the coefficients used for the reconstruction of five image patches,under different levels of illumination and using a fixed dictionary. Above: threeframes corresponding to different illumination conditions; the 5 patches we con-sider are marked in red . Below: bar plots of the u vectors for the 5 patches underthe three different illuminations show consistent variations on the coefficients. . 73

4.5 The effect of illumination changing slowly in the scene, because of a blade of sunlight cast on the floor. Left: a meaningful set of atoms of the dictionary associatedwith an unstable image patch, accumulated at the end of the phenomenon (to beread column-wise). Right: three samples of the scene as the illumination changeoccur — the patch used in the analysis is marked with the black square. . . . . . 74

4.6 An example of a stable change to be stored in the background model. On the rightwe report sample frames and their corresponding patch-based change detectionmaps: at Frame #100, the initial empty scene; Frame #1096, some people enterthe scene; Frame #2153, a door is opened, inducing a stable background change;Frame #3241, the modified empty scene. Black squares mark patches we use asa reference in our discussion. On the left the three dictionaries corresponding tothe patches highlighted, the numbers below refer to the frame index at which theatoms have been included in the models. . . . . . . . . . . . . . . . . . . . . . . 75

4.7 The effect of varying the amount of overlap (left) and the patch size (right) on theF-measure obtained on the baseline videos (see text). Sample frames are reportedbelow as a reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

7

4.8 A comparison of the trends of the average reconstruction error over sunday (left)and monday (right) morning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.9 A zoom in on the reconstruction error in a small time span on a region of inter-est marked with a red rectangle shows how the peaks correspond to interestingevents only. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.10 A zoom in on the reconstruction error to compare the average reconstructionerrors during sunday (left) and monday (right) morning. Here we notice how, onsimilar illumination condition, the monday morning trend is much more stableand the peaks correspond to actual dynamic events. . . . . . . . . . . . . . . . . 83

4.11 Dictionary size after one hour acquisition on sunday, at the end of acquisition onsunday, at the end of acquisition on monday. . . . . . . . . . . . . . . . . . . . 83

4.12 Percentage of patches in an anomaly state on sunday and monday. . . . . . . . . 84

5.1 3D skeleton and associated labels. . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.2 Kinetic energy-based segmentation of gestures primitives . . . . . . . . . . . . . 96

5.3 30 bins cumulative histograms of the Contraction Index of 100 segments peremotion produced from the recordings of 12 people: anger (top left), sadness(top right), happiness (middle left), fear (middle right), surprise (bottom left),and disgust (bottom right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.4 An example of the stimuli used for data validation with humans. . . . . . . . . . 100

5.5 RS - Comparison of the classification results obtained with histograms (bluebars), Us (red bars), and Cxs (yellow bars), for the 4 classes (left) and 6 classes(right) problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.6 The accuracy values obtained varying τ and η. The maximum is achieved forτ = 0.9 and η = 0.7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.7 LOSO - Comparison of the classification results obtained with histograms (bluebars), Us (red bars), and Cxs (yellow bars), for the 4 classes (left) and 6 classes(right) problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.8 Two different screens of the GUI. . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.9 An example of interaction between two players . . . . . . . . . . . . . . . . . . 108

8

List of Tables

2.1 On the left, comparison between the PSNR achieved by the three dictionarymethods on the different images we chose (σ = 20). On the right, PSNRachieved by the different methods with growing level of noise. . . . . . . . . . . 37

3.1 Segmentation accuracy of the proposed method on synthetic data for differentnoise and dictionary size. With the sparsity rate kept fixed at 30%, each entryshows the percentage accuracy - computed as in Equation (3.2) - for additivezero-mean Gaussian noise with σ = 5, 10, 15 and 20 and for the number ofatoms K = 7, 14, 28 and 70. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.2 Segmentation accuracy for kidneys as a whole. Comparison of the dice metriccomputed as in Equation (3.3) for our method (top row) and for the techniquebased on raw data and described by Zoellner [ZSR+09]. From left to right thecolumns report the dice metric plus or minus its standard deviation for the left(L), the right (R), and both kidneys (L+R). . . . . . . . . . . . . . . . . . . . . . 52

3.3 Quantitative assessment of the correlation between SRF scores. Top row: Pear-son coefficient for our method (OM) vs. the first expert, 1st, our method vs.the second expert, 2nd, and first vs. second expert. Bottom row: same for theSpearman coefficient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.4 Comparison of the signal recovery performance and computational time betweenour method (top row) and two other approaches (bottom rows), on the wristdataset. From the leftmost to the rightmost column: PSNR average and standarddeviation, avgPSNR and σPSNR, and computational time, avgtime and σtime, overthe entire dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.1 F-measure on a validation set taken from the ”Baseline” group of the ChangeDetection dataset for different values of the regularization parameter. A higher τcorresponds to a sparser solution. . . . . . . . . . . . . . . . . . . . . . . . . . 71

9

4.2 A table reporting the overall reconstruction error of the three images in Figure 4.4with respect to the fixed dictionary. Notice the stability of the reconstruction error. 73

4.3 Comparison (F-measures) on the SABS dataset as in [BHH11]. (1): basic se-quences; (2) Dynamic background; (3) bootstrap; (4) Darkening; (5) Light switch;(6) noisy night; (7) camouflage; (8) no camouflage); (9) H.264 compression (40kbps) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.4 Detailed results of BMTDL over the whole Change Detection dataset . . . . . . 79

4.5 BMTDL compared with the most promising methods (according to http://www.changedetection.net) . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.6 BMTDL compared with the most promising methods, ranking obtained to the5th and the 3rd figure (see text). . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.1 List of all the features and their provenience . . . . . . . . . . . . . . . . . . . . 95

5.2 Human validation results. First column: label of the emotion. Second column:results obtained with 6 different emotions. Third column: results obtained with4 different emotions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.3 Comparison of the early results obtained with data captured with Qualisys andKinect, represented as 30 bins histograms . . . . . . . . . . . . . . . . . . . . . 101

5.4 Comparison of the accuracy obtained optimizing the functional without or withthe C term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.5 RS - summary of the best classification rates . . . . . . . . . . . . . . . . . . . . 102

5.6 LOSO - summary of the best classification rates . . . . . . . . . . . . . . . . . . 103

10

http://www.changedetection.net


Chapter 1

Introduction

1.1 Motivations

The amount of data we come in contact with everyday is amazing. Modern technologies, media,storage models allow us to capture, share, download practically everything we need to know. Asa result, we are totally overwhelmed by this cascade of information, which may be precious butalso difficult to understand and appreciate. This is true not only in everyday life, but also in dataanalisys research. It is not easy to move randomly inside the jungle of information composedby your data, and, if you are lucky enough to find a good path, when you achieve your goal youwill have spent so much time walking on it I bet you will not simply say ”Dr. Livingstone, Ipresume?”1. Unless you are British, of course.Looking at the literature [HCF+08, BC11, Loh12], it is clear that the data analysis problem is atpresent a hot topic. The scientific community is not compact on judging the phenomenon: a partthinks that we are gathering too much data already, and that reducing the amount of informationwould lead to better lives and decisions [She98] while others are more optimistic on the benefitswe can get from tha data at our disposal [D+00].A key for managing all the information coming from data is to select only the relevant part forthe task that is being tackled. This can be done manually, a costly, slow, and highly subjectiveprocedure, or automatically. The latter approach has many advantages: being automatic it doesnot need the intervention of an expert, saving time and money, and thus it is not biased by a sub-jective idea of the problem that can interfere on the result. Obviously the knowledge of a domainexpert can be very precious, because it carries important information, but, since we are talkingabout big data, we have to deal with the fact that a manual study could even be unfeasible.The best thing that can be done is to combine a good automatic selection of the information andsome knowledge of an expert, leading to a weakly supervised framework, where only a part of

1Henry Stanley, How I Found Livingstone

11

the data are provided with labels.In signal processing it is a common practice to devise a set of atomic signals and use them asa basis to approximate other, possibly more complicated, signals living in the same space. Theprocedure, in this case, is to fix/choose the basis and then estimate the weights to approximatethe signal correctly.The same tools are used in machine learning, but in this case the goal is to have also a good gen-eralization power. This means that an input-output relationship learnt from training data shouldalso be effective on new, unknown data. The basis can be derived analitically or, more in general,fixed a priori. In this case the problem is related to the computation of the decomposition of thesignal with respect to the dictionary. Examples of representations of this kind are the popularFourier Transform, the Cosine Transform (for an insight on the topic see, for instance the bookby Bertero and Boccacci [BB98]) and the Wavelets [Mal89].If we want only a few basis signals to participate in the reconstruction of the new signal, we talkabout sparse coding. In this case, a sparsity constraint is added to the problem.Computer Vision is another field in which data representations are crucial to achieve good results.In this field, often methods rely on fixed dictionaries that are tailored to fit the specific applica-tion. Viola and Jones presented rectangular features derived by Haar wavelets [VJ01], that wereused to form dictionaries effective for face detection [DDMOV09b]. Other descriptors of localpatterns which have been used to build dictionaries are HOG and SIFT [DDMOV09a, ZO11].Since the seminal work of Olshausen and Fields [O+96], there has been a growing interest in bio-logical inspired representations of data which can be obtained by selecting meaningful elementsfrom overcomplete dictionaries. Biological visual systems initially decompose the input signalcoming from the eyes in small parts with an orientation, and after this step they reconstruct theimages at different levels, up to all the details. The cells that do the initial decomposition arevery similar to the Gabor filters.This fact has been inspiring for the research in this field: in order to improve the descriptionpower of the representations, it is possible to learn the basis set, or dictionary, directly from data,rather that fixing it a priori. This techniques is called dictionary learning and the idea is to use asbasis signals some prototyped versions of the input data, so to have their relevant characteristicsstored in the dictionary [AEB05, LBRN06, AEA06, EA06, BSVV11, MBPS09, BDE09, RB11].In this thesis we address the problem of learning dictionaries from data and deriving sparse rep-resentations to facilitate further processing. We consider the setting of dictionary learning andsparse coding in different applications, like data approximation, clustering, and classification.All the applicative fields are very different from one another because we aimed at checking thegenerality of the method we chose for data representation. In the next paragraph we will describein details the data we have faced and their characteristics.

What Data? There are many different classes of data that would benefit from a selection ofthe information they carry with them. We have focused our attention on medical images, videostreams, and 3D coordinates streams. The problems we address have been inspired by collabo-

12

rations with domain experts. They are all very different to one another, we can thus argue on thegenerality of the data representation model we consider.In the Medical Image analysis field, after a case of study on medical image denoising, we havedeveloped a system to automatically segment different tissues inside a 3D MR volume varyingover time (3D+t), basing our answers on the behavior of each voxel with respect to a contrastagent. In this case the input data are curves describing the different grey values assumed by thevoxels over time.The second kind of data we have been working on are video streams coming from a fixed cam-era. The application, in this case, is background modeling, and the input data are patches offrames varying over time. Even in this case the application can be seen as data approxima-tion/reconstruction.Finally, we have been working on streams of 3D coordinates acquired with a motion capturesystem (e.g. Microsoft Kinect). The coordinates represented points on the body of a person, andin this case the goal was to automatically recognize the emotion the person was expressing.

Figure 1.1 shows examples of the data we have been working on, varying over time.

Despite they look very different from one another, they have important common points. Theyhave large dimensionality and they are higly available. They are difficult to be dealt with, in thesense that it is not easy to understand what piece of information is useful for the problem we aretackling, since they are redundant and noisy. Despite these difficulties, the data we process carrywith them some prior knowledge that should be exploited to improve the results we can obtain.We can say that the whole framework we present in this thesis relies on the presence of thisprior/hidden information that characterizes the data we want to process. The prior knowledgeexploited by our methods is described below:

• 3D+t MR images: in this case we have different priors. First, the human body has a generalstructure that does not change very much from person to person. Selecting a specificanatomical region, e.g., the kidneys, the brains, the heart, most of the structures and shapeswill repeat themselves in the different patients. Second, the way a person breathes is quitestandard. Since part of the artifact in DCE-MRI data are due to movements of the patient,part of the noise can be removed compensating the breath movements, that can be quiteeasily modeled. Thanks to the motion compensation each voxel has a precise position inthe space. Third, we are helped by experts. They gave us the information that differenttissues react differently to a contrast agent, that is the core of the application we developed,leading us to use our data as curves, rather than volumes or images.

• Video streams from fixed camera: in this case the main prior come from the fact thatthe camera acquiring the scene is not moving. Because of this, we can say that thingsover time change smoothly, meaning that from frame f to frame f + 1 there will not bemany differences. Another effect is that background objects are well-localized in space.

13

t = 1 t = 2 t = 3

Figure 1.1: Examples of dynamic data we process in this thesis. Top row: DCE-MRI kidneydata; middle row: video streams from fixed camera; bottom row: tracking of points on the bodyof a person 14

Moreover, we can say that the vast majority of the scene is background, meaning that onlya few changes, caused by foreground objects, will appear in the scene, once the backgroundmodel is stable.

• 3D coordinates streams: in this case the prior comes from psychological studies, that havedetected a set of emotion related dimensions. Those dimensions take different values whenpeople express different emotions. One example of these dimension is, for instance, theglobal direction of the motion of the arms: if a person is happy, her or his arms will tend tomove upwards, while if this person is expressing anger, the arms will tend to move down-wards. This psychological research has been the starting point for a sort of digitalization ofthe emotions: we have translated the emotion related dimensions in precise measures thatcould be extracted from the coordinates streams. Without this prior knowledge it wouldhave been very difficult to understand what is characterizing of each emotion. In this caseas well, the motion capture system is fixed.

1.2 Contributions of the thesis

The main contribution of this thesis is to devise data representations based on sparse codingand dictionary learning on three different application domains. The models are effective andcompetitive since they do not simply cast the general representation scheme to the different inputdata, but try to exploit all prior knowledge available, on the data and on the application. In thissection I summarize the main contributions of my work, and I present the structure of the thesis.

Chapter 2 In this chapter we present the mathematical tools and the principal methods con-cerning dictionary learning and sparse coding. A review of the relevant literature is given as well.In the first part of the chapter we introduce the main ideas behind such methods, and then wedescribe in details the formulation of the problems and possible solutions. In the second part wepresent related literature, both from the mathematical and the applicative point of view. In thethird part of the chapter we present a case of study of the application of the described techniquesto the denoising of medical images, that is one of the first filtering steps that must be done on rawdata. The noise is usually part of the high frequencies of an image, together with the small details.It has been shown that one of the side effects of dictionary learning is a smoothing/denoising ofthe data: this is because the reconstruction procedure tends to maintain the low frequencies, dis-carding the noise. The denoising of natural images via dictionary learning is well known (see forinstance [AEA06]). The case of study presented in this chapter has been published in MedicalComputer Vision. Recognition Techniques and Applications in Medical Imaging [SCBS11]. Wehave shown that learning a dictionary over a noisy magnetic resonance, or over denoised imagesof the same anatomical region, leads to very good denoising results in terms of PSNR. Moreover,

15

the majority of the internal structures are maintained, meaning that the removed part is almostonly noise.

Chapter 3 In this chapter we present a system for automatic tissue segmentation of 3D MRimages varying over time. The work that led to this system has been published in MachineLearning in Medical Imaging [CSBV11] and is going to appear in Artificial Intelligence inMedicine [CSBV]. In this case we want to understand the relations that exist between the differ-ent tissues that compose an anatomical region.Our idea was to learn a dictionary of prototype curves of each voxel and then cluster their rep-resentations over the dictionary. The learning phase is preceded by a motion compensation step,because the patient can move the body during the acquisition.The assessment of this method has been made with kidneys and wrists data: in the former casethe tissues are much softer and the registration step is more evidently needed with respect to thelatter. The results show that it is possible to better discriminate the tissues learning a dictionaryof curves.

Chapter 4 Video streams are another example of big data with an intrinsic coherence that canhelp in their analysis. We focus on video streams acquired with fixed cameras, as in the videosurveillance setting. In this setting we may assume the observations at time t − 1 and t will berather similar, unless some important change occur in the scene. The contribution of this thesisin this field is related to background modeling. The method we propose has been presented atthe International Conference on Image Processing (oral presentation) [SNOV13], while a longerversion with a deeper experimental assessment has been submitted to a major journal in thisfield [SNOV]. This is a crucial pre-processing step useful for a plethora of methods to analizethe video. In general, it consists in the creation of a model of the background of the scene, sothat when foreground objects arrive in the scene they can be easily detected. The backgroundof a natural scene in general is not static, it presents different challenges that can lead to thegeneration of false positives. For instance, illumination changes must not be exchanged for fore-ground events. There can be parts of the scene where the background is dynamic, presentingdifferent configurations over time, with different frequencies. In an outdoor scene we shouldinclude the leaves of a tree shaken by the wind in the model, as well as in an indoor scene wewant to consider a door that is opened and closed different times as background. Starting fromthese considerations, we built a dictionary-based framework that divides the video in patchesand processes each patch independently from the others. The model is made up of dictionariesthat are learnt and kept up to date online, to capture all the possible background configurations.A cleaning procedure ensures that dictionaries don’t grow too much, and that only relevant in-formation about the background is maintained. Our method does not require initial frames freefrom foreground objects because of its adaptivity.

16

Chapter 5 The goal of the analysis of these data is to automatically detect the emotions thata person is expressing through body gestures. In this case we do not focus on visual stimuli di-rectly, but we start from streams of 3D coordinates coming from the body of a person and then weextract a set of emotion related features to describe the movements performed by the body. The3D coordinates are acquired through a motion capture systems that tracks points on the body, orbody joints. We focus on the six basic emotions, namely happiness, anger, fear, sadness, disgust,and surprise. Two short works about the feature extraction and classification results have beenpublished in International Workshop on Intelligent Digital Games for Empowerment and Inclu-sion 2013 [PSCO13] and 2014 [PSO+] proceedings. Examples of such features are the kineticenergy, the velocity of the hands, the symmetry, and so on. Analysing these features, that canbe an average for the whole body or specific of one or two joints, leads to data with very largedimensionality. It is not easy to understand which dimension is relevant for which emotion. Thestudy made in this thesis on automatic emotion recognition compares the classification resultsobtained on simple data representations (histograms of each feature for a given gesture) and theones obtained learning a dictionary for each feature. Adaptive representations sistematically out-perform histograms, a result that points out how the dictionaries learned are capable of extractingthe relevant characteristics of the data.

Chapter 6 This chapter is left to conclusions and future work.

17

Chapter 2

Sparse Coding and Dictionary Learning

In this chapter we present the mathematical tools related to Sparse Coding (SC) and DictionaryLearning (DL), that are at the basis of all the applications described in the remainder of thethesis. We start by reviewing the most relevant related literature, then discuss on the theory ofSC and DL, also considering optimization issues. We conclude by presenting the application tothe problem of medical image denoising as a study case.

2.1 Introduction

During the past several decades, in signal processing research field, much effort has been devotedto finding a good model to represent signals. This can be viewed as the fundamental basis toprocess signals and to retrieve different information from them, since the idea behind a goodrepresentation is to maintain the relevant information carried by the data while discarding noise.This can help in many applications, in particular in cases where data come in very big amountsand/or in large dimensionality. In these situations it is not easy to understand what piece ofinformation is useful to the problem we are dealing with, and an automatic reduction of the datacan help in finding a solution.

In particular, there has been a growing interest in the use of sparse and redundant representations:the idea is to decompose the signal with respect to an overcomplete dictionary made up of basicstimuli, or atoms. This means that we assume that a signal can be seen as a combination of (few)atoms that compose the dictionary, plus some noise. The resulting representation is the vectorcontaining the coefficients of the sparse linear combination whose result is (an approximation of)the original signal. The information reduction in this case is not relative to the dimension of therepresentations (that can be, instead, even bigger than the one of the original data), but to the factthat only few atoms of the basis will be used to represent the signal, resulting in a sparse vector,

19

that is, a vector with only a few non-zero entries. Actually, the redundancy can be intended bothwith respect to the dimension of the data or to the dimension of the problem, considered as, forinstance, the number of the classes the data can be divided in.

The representation of signals through linear combination of basis is not a novelty: it is as thebasis, for instance, of the poular Fourier Transform, that represents (possibly) complex signalsas the sum of simple waves mathematically represented by sines and cosines.

The chapter is organized as follows: in Section 2.2 we will have an overview of the work thathas been done in this field, from the use of fixed dictionaries to the use of adaptive ones, passingthrough the methods developed to build representations and learn dictionaries, the introductionof reconstructive and/or discriminative dictionaries and having a look at the applications that canbenefit from these representations. In Section 2.3 we will describe the techniques to computesuch representations with fixed dictionaries, and in Section 2.4 we will see how to learn dictionar-ies directly from data, with an in-depht description of three different algorithms. In Section 2.5we will present a case of study we made to assess the possibility to apply the dictionary learningmethods to the medical images analisys field, and Section 2.6 is left to conclusions.

2.2 Related work

In this section we review the relevant literature about Sparse Coding and Dictionary Learning,both from the methodological and the applicative point of view.

2.2.1 Fixed and adaptive dictionaries

The best-known example of signal decomposition is the Fourier transform, where there is aunique representation of each signal and the atoms of the dictionary are frequencies and form anorthonormal basis (ONB).

Actually, the interest today is mostly focused on overcomplete dictionaries, since there are sev-eral works that justify their use [Mal99, KC07]. For instance, overcomplete analytic (or fixed)representations can be multiscale Gabor functions [MZ93, QC94], systems defined by algebraiccodes [SH03], amalgams of wavelets and sinusoids [Che95, CDS01, EB02], libraries of win-dowed cosines with a range of different widths and locations [Wic92, Wic94], multiscale win-dowed ridgelets [SCD02], systems generated at random [DE03], and amalgams of wavelets andlinelike generating elements [Huo99].

Overcompleteness leads to have a wider range of generating signals, and then, intuitively, wecan have more flexibility and a greater power of expression in the representation of signals.Concerning the stability of sparse representations with respect to noise, it has been studied in the

20

work of Donoho et al. [DET05], where the authors show that under certain sparsity constraintsthe optimal representation of a noisy signal can differ from the optimal representation of thenoiseless signal by at most a constant multiple of the noise level.

Analytic dictionaries are in general formulated as tight frames (meaning that DDTx = x for everyx) [BF03]: the transposed dictionary can be used as an encoder or a bank of filters, to directlyobtain a representation of the signal over the dictionary.

An even more powerful and flexible representation can be achieved not using a fixed dictionary,but one that directly derives from data. This idea was firstly introduced by the work of Ol-shausen and Fields [OF97], that suggested an algorithm to learn the dictionary from data: thisidea was born from the observation of the visual system of mammals. From that moment on,several methods have been proposed to perform dictionary learning, described in the followingsubsection.

2.2.2 Dictionary learning

Figure 2.1 shows two dictionaries, the first one is fixed and the second adaptive. The maindifference is that in the adaptive case very characterizing structures of the signal are captured,that are not included in the fixed, and more general, one.

Figure 2.1: On the left, a fixed DCT dictionary. On the right an adaptive dictionary obtained withthe Lee method [LBRN06].

Dictionary Dearning (DL) methods attempt at solving an optimization problem whose objectivefunction is based on the reconstruction error obtained by reconstructing the training data as alinear combination of the unknown atoms. Sparsity is induced by adding a prior on the decom-position coefficients.

In general, most of the DL methods follow the schema of minimizing the cost function in twosteps, the first fixing the dictionary and finding the coefficients, the second using these coeffi-

21

cients to find the optimal dictionary, but different formulations can be found for instance in thework of Maggioni and colleagues [CM10, ACM12].

The main difference among DL methods is the specific technique to induce the coefficientvectors to be sparse. The most intuitive prior is the so-called `0 norm, that is the countingof the nonzero coefficients in a feature vector. This constraint is used in the works of En-gan [EAHH99, ERKD99], and Aharon [AEA06], that aim at minimizing the same cost function.In the first two cases it was introduced the method of Optimal Directions (MOD), a very effec-tive method that suffers from the relatively high complexity of the matrix inversion needed forthe dictionary update. In the third case it was introduced the KSVD algorithm, that, instead ofinverting the matrix, updates the dictionary atom-by-atom in a simple and efficient process.

Another well-known prior, used in the works of Lee et al. [LBRN06], and Mairal et al. [MBPS10],is the `1 norm, introduced to avoid non-convexity. This prior is used also in another work fromMairal and colleagues [MBPS09], where the authors present an online technique to exploit one(or a little batch) of the examples at a time to avoid problems coming from the greedy processingof a huge amount of data.

In general, once the dictionary has been learned, every time we want to decompose a signal wehave to solve an optimization problem to find the sparse vector of coefficients needed for the taskwe are interested in. This is different from the formulation of tight frames, introduced with theanalytic model. Actually, in these years some methods have been proposed to avoid this complexstep of the framework.

In the works of Ranzato et al. [RPCL06, RHBL07, BCL+07] the authors propose to learn a pairof dense encoding and decoding transformations. This means that, once trained the dictionary,we do not need to perform an optimization step to represent a new signal. The sparsity is inducedby a non-linear transformation between the encoding and decoding modules.

In the work from Basso and colleagues [BSVV11] there is a proposal for an algorithm, PADDLE,capable of learning from data a dictionary endowed with properties similar to that of tight frames.The output of this algorithm contains both the optimal dictionary and its (approximate) dual, alinear operator that decomposes new signals to their optimal sparse representations.

A good review of dictionaries for sparse representation modeling can be found in a work ofRubinstein et al. [RBE10].

The dictionaries we have seen so far are purely reconstructive and they are mostly used forsignal restoration. As we will see in the next section, in a work of Mairal et al. [MBP+08a]there has been the introduction of discriminative dictionaries. The first idea is to have trainingsignals belonging to different classes and to learn a (reconstructive) dictionary for each class,then using the residual obtained reconstructing new data with respect to every dictionary as adiscriminative criterion. The main innovation of discriminative dictionaries is that they are notonly good for reconstructing the signals of their class, but they are also bad for recostructing

22

signals belonging to the other classes. This is achieved adding a discriminative term to thefunctional to be minimized, the softmax discriminative cost functions, that have properties similarto the hinge loss function coming from the SVM literature.

2.2.3 Applications

Sparse and redundant representations, with learned or analytic dictionaries, can be applied toa great number of tasks in image processing and analysis. In general these methods deal withnatural images.

Elad and colleagues [EA06] apply them to image denoising. The authors learn the dictionaryover the noisy image and then perform the denoising reconstructing the signal with respect tothis dictionary. The underlying idea is that noise is not a characteristic part of the overall signal,and then is discarded during the reconstruction. This process is performed over small overlappingpatches of the image, that are averaged at the end to prevent border effects.

The same authors, in [EFM10], apply sparse and redundant representations to linear inverse prob-lems such as inpainting, deblurring and superresolution reconstruction. All of these applicationsare over natural grayscale images.

This approach has been extended to multiscale [MSE08], color images [MES08], and videos [PE09].

The use of sparse and redundant representations for classification can be found in the work ofHuang et al. [HA07], where the authors combine the characteristics of reconstructive and dis-criminative representations, to exploit the power of the second for classification and the robust-ness of the first to avoid problems coming from corruptions of the signals, as noise, occlusionsor missing parts.

There is the possibility to learn discriminative dictionaries, introduced by Mairal et al. [MBP+08a].In the work proposed in [MLB+08], the authors extend the method to learn the dictionary byproposing a multiscale approach to minimize least-squares reconstruction errors and discrimina-tive cost functions under `0 or `1 regularization constraints, with applications to edge detection,category based edge selection and image classification tasks.

In the work proposed in [MBP+08b] the goal is to learn jointly a single dictionary adapted tothe classification task and a function (or a set of functions) that should discriminate between thedifferent classes. This is a step forward, since we have no more one dictionary for each class,and we let the signals belonging to different classes share some features. The problem with thiswork is that it is in a supervised learning context, and then there is the need for a great amountof labeled training data.

Yang and colleagues [YYGH09] develop an extension of the spatial pyramid matching (SPM)method, in order to reduce the complexity of the SVMs using SPM kernels. This is made by using

23

selective sparse coding instead of traditional vector quantization to extract salient properties ofappearance descriptors of local image patches. In this way, they propose a linear SPM kernelbased on SIFT sparse codes, that can be applied to a variety of image classification tasks.

This kind of functionals can be used as well for image deblurring when the PFS is not known. Inthis case we talk about Blind Deconvolution, and we look for the deblurred image and the PFStogether, as in the work of Decharlier and De Mol [LDM13].

Dictionary Learning and Sparse Coding are widely used for many other tasks, quite differentone from the other, for instance Transfer Learning ([MPRP12, RBL+07]), video compression([CKL10]), MR Images denoising [SCBS11], reconstruction ([RB11]), segmentation [CSBV11,CSBV].

The common feature of all of these tasks is a high availability of data that are difficult to analyze.

The idea behind this thesis is to exploit the power of Sparse Coding and Dictionary Learning tosolve different problems with the same issues as mentioned above.

2.3 Sparse representations

Let D = [d1|d2|...|dK ] ∈ Rd×K be an overcoplete (i.e., K > d) dictionary matrix, whosecolumns di are K basis signals, or atoms. We can represent a signal x ∈ Rd as a sparse linearcombination of these atoms. We can look for the exact representation of x, x = Du, or anapproximate one, x ≈ Du, satisfying ‖x− Du‖2 ≤ ε (see [AEB05]).

The representation of the signal is the sparsest vector u containing the coefficients of this com-bination, and can be the solution of either

(P0) minu‖u‖0 s.t. x = Du, (2.1)

or

(P0,ε) minu‖u‖0 s.t. ‖x− Du‖2 ≤ ε. (2.2)

|| · ||0 denotes the `0 pseudo-norm, defined as:

‖u‖0 = #(i|ui 6= 0) (2.3)

that simply counts the nonzero entries of a vector.

More in general, we are interested in reconstructions that are both sparse and have a good recon-structive power.

24

The reconstructive power can be achieved minimizing a reconstruction error:

minu||x− Du||22 (2.4)

It can be readily seen [Jol05] that, for K ≤ d, viable methods to compute the minimizer ofsuch equation are Principal Component Analysis (PCA) and its derivatives. In particular, amongpossibly different solutions of (2.4), PCA picks the one with minimal ‖U‖2

F . Indeed, althoughnot yielding overcomplete dictionaries, Principal Component Analysis (PCA) and its derivativesare at the root of such approaches, based on the minimization of the error in reconstructing thetraining data as a linear combination of the basis elements (see e.g. [BN06]).

We are interested in over-complete settings, where K > d, with sparse representations,that iswhen the vectors ui have few coefficients different from zero.

We can substitute Equation 2.2 with 2.5, where τ is a parameter that controls the weight of thepenalty term J .

minu||x− Du||22 + τJ(u) (2.5)

There is now a wide literature on methods for solving the problem with different kinds of penal-ties and we refer the interested reader to see the works in [BBC11a, BT09a, BBC11b, CW05,DDDM04, Nes05], and [DS09] for an in-depth dissertation. Substituing the `0-norm with the`1-norm, defined in Equation 2.6, is customary in compressed sensing, and it can be shown tobe equivalent to the former norm under appropriate conditions on D [CRT06, Tro06, BDE09],while being more easily tractable.

n∑i=1

|ui| (2.6)

It has been often applied successfully to dictionary learning, replacing (2.2) with the minimiza-tion of Equation 2.7

minu||x− Du||22 + τ ||u||1 (2.7)

where τ > 0 is a regularization parameter and ‖U‖1 =∑‖ui‖1.

Considering X = [x1|x2|...|xn] ∈ Rd×n as the matrix containing all the n data we want to process,we can re-write the functional in Equation 2.7 in a matricial form:

minU||X− DU||2F + τ ||U||1 (2.8)

25

Since our goal is to have a dictionary adaptive with respect to the data we are considering, wecan learn it together with the representations. In the next subsection techniques to solve this taskwill be described.

2.4 Learning a dictionary from data

In this section we introduce some common properties of the algorithms for dictionary learningand their application to sparse image representation.

If the set of basis signals has to be learned from data, we have two unknowns: the dictionary D,and the matrix of coefficients U. We must add constraints on the norm of the atoms, the columnsof D, to avoid degenerate solutions, where the norm of the coefficients is kept small simply byhaving atoms composed by big values. Let B = {D|||dj||22 ≤ 1}. Starting from Equation 2.8,we want to minimize:

E(D,U) = ||X− DU||2F + τ ||U||1 (2.9)

s.t.D ∈ B

Although this formulation is not convex with respect to both the variables at the same time, it isstill possible to find a solution considering one variable at a time. Such approach has been provedempirically successful [LBRN06], and its convergence towards a critical point of the functionalis guaranteed by Corollary 2 of [GS00].

This is the reason why almost all the proposed methods so far for learning the dictionary sharea basic scheme, in which the problem is solved by an alternated optimization with respect to thetwo variables. This means that we can initialize D, keep it fixed, and minimize with respect to u.We then fix u as the solution of the first optimization step, and minimize with respect to D. Wethen repeat the two steps alternatively until convergence. This kind of scheme is summarized inAlgorithm 1.

Algorithm 1 Alternate optimization schemeRequire D0 ∈ Rd×n, U0 ∈ RK ×N , τt = 0repeat

Ut+1 = minU ‖X− DtU‖+ τ‖U‖1

Dt+1 = minD∈B ‖X− DUt+1‖+ τ‖Ut+1‖1

t = t+ 1until convergence

26

The former, where D is kept fixed, is the step of sparse coding, and is the only one directlyaffected by the choice of the penalty.

The latter optimization step, with U fixed, performs a dictionary update. In principle this is anunconstrained Least Squares (LS) problem that could be solved in closed-form, as it is done bythe method of optimal directions (MOD) [EAHH99]. However, choosing the `1 norm as sparsitypenalty forces the introduction of a constraint on the norm of the atoms, to avoid degeneatesolutions, where the atoms achieve very high values and coefficients are all close to zero, leadingto a quadratic-constrained LS.

In the reminder of the section, we present three different methods to learn D and U. The firstone is based on the `0 norm, the second one is based on the `1 norm as well as the third one,that is based on proximal methods to achieve the optimal solution. Implementations of the threealgorithms are publicly available, and we believe this may be beneficial to prospective practition-ers of dictionary learning in many different fields. In particular, the implementation of the thirdmethod is the one used in all the applications presented in this thesis (see [BSVV11]).

2.4.1 K-SVD algorithm

The first algorithm we reviewed is named K-SVD because its dictionary update step is essentiallybased on solving K times a singular value decomposition problem. K-SVD has been proposedin [AEA06], and it aims at minimizing a variant of the functional (2.5) in which the sparsity isenforced by a constraint on the `0 norm of the columns of U:

minD,U‖X− DU‖2

F s.t. ‖ui‖0 ≤ T0. (2.10)

Among the nice features of the K-SVD we deem worth mentioning are:

• the dictionary update is performed efficiently atom-by-atom;

• the iterations are shown empirically to be accelerated by updating both the current atomand its associated sparse coefficients simultaneously.

Since the algorithm uses an `0 constraint, Orthogonal Matching Pursuit (OMP) can be used forsolving the sparse coding step [TG07].

27

2.4.2 `1-DL algorithm

The second algorithm we considered was proposed by Lee and colleagues in [LBRN06] and aimsat minimizing the functional (2.5) with the standard `1 penalty:

minD,U‖X− DU‖2

F + τ∑‖ui‖1 s.t. ‖di‖2 ≤ 1 (2.11)

The algorithm is based on the alternating convex optimization over U and D. The optimizationover the first subset of variables is an `1-regularized least squares problem, while the one overthe second subset of variables is an `2-constrained least squares problem. In practice, there arequite a number of different algorithms available for solving this type of problems. In their paper,the authors proposed two novel fast algorithms: one named feature-sign search algorithm for the`1 problem and a second based on Lagrange dual for the `2 constraint. According to the reportedresults, both algorithms are computationally very efficient, and in particular the former outper-formed the LARS and basis pursuit algorithms, well known for their superior performances inmany previous works.

2.4.3 Minimization via proximal methods algorithm

The third algorithm will be described more in details because it is the core of all the applica-tions presented in the next chapters. This algorithm aims at finding the optimal solutions forequation 2.9, basing the optimization of u on the use of a proximal operator.

In summary, a proximal (or forward-backward splitting) algorithm minimizes a function of typeE(ξ) = F (ξ) + J(ξ), where F is convex and differentiable, with Lipschitz continuous gradient,while J is lower semicontinuous, convex and coercive. These assumptions on F and J , re-quired to ensure the existence of a solution, are fairly standard in the optimization literature (seee.g. [CCP+05]) and are always satisfied in the setting of dictionary learning. The non-smoothterm J is involved via its proximity operator, which can be seen as a generalized version of aprojection operator:

P (x) = argminy{J(y) +

1

2‖x− y‖2}. (2.12)

Combining the projection step with a forward gradient descent step we have the proximal algo-rithm, as follows

ξp = P

(ξp−1 − 1

2σ∇F (ξp−1)

), (2.13)

Also in this case the optimization is achieved separately with respect to the two variables.

28

Sparse coding

With fixed D we can apply algorithm (2.13) to minimize the functional E(U) = F (U) + J(U),where

F (U) =1

d‖X− DU‖2

F and J(U) =2τ

K

N∑i=1

‖ui‖1. (2.14)

The gradient of the differentiable term F is

∇UF = −2

dDT (X− DU),

while the proximity operator corresponding to J is the soft-thresholding operator Sτ definedcomponent-wise as

(Sτ [U])ij = sign(Uij) max{|Uij| − τ, 0} (2.15)

Inserting the gradient and the proximal operator in the general equation (2.13), we obtain thefollowing update rule:

Up = Sτ/KσU

[Up−1 +

1

σU

(1

dDT (X− DUp−1)

)](2.16)

Dictionary update

The quadratic constraint on the columns of D is equivalent to an indicator function J . Due to thisfact, the proximity operator is a projection operator. Denoting by π(d) = d/max{1, ‖d‖} theprojection on the unit ball in Rd, let πD be the operator applying π to the columns of D. Pluggingthe appropriate gradients and projection operators into equation (2.13) leads to the update step

Dp = πD(Dp−1 +1

dσD(X− Dp−1U)UT ), (2.17)

(2.18)

Gradient descent step

The choice of the step-sizes σU and σD is very important to achieve fast convergence. Followingare described different techniques to choose the step-sizes.

29

Fixed step-size

In general, for E = F +J , one can choose 2σ to be equal, for all iterations, to the Lipschitz con-stant of∇F . For∇DF these constants can be evaluated explicitly, leading to σD = 2‖UUT‖F/d.Such choices ensure linear rates of convergence in the values of E [BT09b] and convergence ofthe sequence Dp towards a minimizer [CCP+05], although no convergence rate is available.

A similar derivation would lead to σU = 2‖1dDTD+ η

KI‖, but in this case the positive definiteness

of the matrix allows us to choose a step-size with faster convergence [RVS+09]. Denoting by aand b be the minimum and maximum eigenvalues of DTD, choosing

σU =a+ b

2d+

η

K,

will lead to linear convergence rates in the value, as well as linear convergence for the sequenceUp to the unique minimizer of E(U) [RVS+09].

FISTA

Improved convergence rates at each iteration can be achieved in two ways: either through adap-tive step-size choices (Barzilai-Borwein method), or by slightly modifying the proximal step asin FISTA [BT09b]. This method makes use of the latter approach.

The FISTA update rule is based on evaluating the proximity operator with a weighted sum of theprevious two iterates. More precisely, defining a1 = 1 and φ1 = ξ1, the proximal step (2.13) isreplaced by

ξp = P

(φp − 1

2σ∇F (φp)

), (2.19)

ap+1 = (1 +√

1 + 4a2p)/2 (2.20)

φp+1 = ξp +ap − 1

ap+1

(ξp − ξp−1). (2.21)

Choosing σ as in the fixed step-size case, this simple modification allows to achieve quadraticconvergence rate with respect to the values [BT09b],

which is known to be the optimal convergence rate achievable using a first order method [Ne83].

Considering the two steps together, the complete algorithm can be found in 2.

During the iterations it may happen that, after the optimization with respect to U, some atoms ofD are used only for few reconstructions, or not at all. As suggested in [AEA06], if the i-th atomdi is under-used we can replace it with an example that is not reconstructed well (see line 4). If

30

Algorithm 2 DL with Proximal OperatorsRequire: X ∈ Rd×N {input data}Require: U0 ∈ RK×n, D0 ∈ Rd×K {initialization}Require: rtol, τ > 0, Tmax ∈ Z+, 1 ≤ H < Tmax {algorithm parameters}

1: E0 = E(D0,U0; X, τ)2: for t = 1 to Tmax do3: Ut = argminU E(Dt−1,U; X, τ), see Equation (2.16)4: possibly replace under-used atoms5: Dt = argminD E(D,Ut; X, τ), see Equation (2.17)6: Et = E(Dt,Ut; X, τ)7: EH =

∑t−1i=max{t−H,0}E

i/H

8: if |Et − EH |/EH < rtol then9: break

10: end if11: end for12: return Ut, Dt

xj is such an example, this can be achieved by simply setting uj to the canonical vector ei, sinceat the next step D is estimated from U and X.

The encoding matrix

In this section we show a variant of the functional in Equation 2.9, that includes a new variable,the encoding matrix C, such that U ≈ CX ([BSVV11, RPCL06, RHBL07, BCL+07]). LetC = [c1, . . . , cK ]T ∈ RK×d be a matrix, also known as encoding or analysis operator, whoserows ci can be seen as filters that are convolved with an input signal x to encode it to a vectoru ∈ RK . Our first objective is to recover a pair (D,C) such that they are approximately dual (inthe frame sense), that is

DC ∼ I,

as well as providing a sparse coding of the data.

We set out to learn both D and its dual C by adding to the cost function (2.6) a coding errordepending on how well the dual can recover the optimal representation:

minD,C,U

= ‖X− DU‖22 + τ‖U‖1 + η‖U− CX‖2

F s.t. ‖di‖2, ‖ci‖2 ≤ 1, (2.22)

where η ≥ 0 weights the coding error with respect to the other terms, and both the atoms and thefilters are constrained to have bounded norm to avoid trivial solutions to the problem.

31

The optimization scheme is the same described in Algorithm 2, but this time we have a newunknown, the matrix C. In this case we procede iteratively fixing two variables at a time andoptimize with respect the third one.

With fixed D and C, we can apply algorithm (2.13) to minimize the functionalE(U) = F (U) + J(U),where

F (U) =1

d‖X− DU‖2

F +η

K‖U− CX‖2

F and J(U) =2τ

K

N∑i=1

‖ui‖1. (2.23)

The gradient of the (strictly convex) differentiable term F becomes

∇UF = −2

dDT (X− DU) +

2η

K(U− CX),

while the proximity operator corresponding to J soft-thresholding operator Sτ defined in Equa-tion 2.15. Plugging the gradient and the proximal operator into the general equation (2.13), weobtain the following update rule:

Up = Sτ/KσU

[(1− η

KσU

)Up−1 +

1

σU

(1

dDT (X− DUp−1) +

η

KCX)]

(2.24)

When U is fixed, the optimization problems with respect to D and C are decoupled and can besolved separately.

The same assumptions made in Section 2.4.3 for the dictionaries hold for the encoding matrix.

Plugging the appropriate gradients and projection operators into Equation (2.13) leads to theupdate step

Cp = πC(Cp−1 +1

KσC(U− Cp−1X)XT ). (2.25)

2.5 A case of study: denoising of medical images

The idea behind this case of study, is to promote a discussion on the potentials of adaptive sparsecoding for addressing different problems of interest to the Medical Imaging world. In order tomake the discussion concrete, we focus on two state-of-the-art dictionary learning algorithms:the K-SVD algorithm, described in Section 2.4.1 and the method described in Section 2.4.2 tosolve `1-regularized Dictionary Learning (`1-DL). We provide details on the two methods andcompare their performances using a collection of MR images representing different anatomicalregions on the task of denoising. In this context, by using the two learned dictionaries we suc-ceeded in improving the results of a state-of-the-art method for denoising based on the Discrete

32

Cosine Transform [GWE09]. We also stress that denoising is only one of the possible applica-tions of these techniques, and that higher-level tasks such as detection or classification are benefitfrom them, as stated in the next chapters.

Remarks

Before showing the whole case of study, we try to summarize some considerations – somemethodological, while some other more practical – about a number of issues we have been con-fronted with while working on this topic.

• The K-SVD algorithm proved to be very effective in practice and has the nice property thatthe level of sparsity of the representations is directly controlled through the `0 constraintT0. However, one should keep in mind that the procedure adopted to solve the sparsecoding step is greedy, and this may lead to some stability problems when there is a lot ofcorrelations between the elements of the dictionary.

• In our experiments the `1-DL algorithm confirmed to be quite efficient and effective. How-ever, we observed some slight unexpected dependence from the initialization of the dictio-nary. Of course these may be due to non optimal convergence criteria used in the imple-mentation. Nonetheless, we believe this is an important point that should be kept in mindwhile using the framework.

• Finally, we believe that being able to be adapted easily to an on-line setting is a valuablefurther property of an algorithm for sparse over-complete dictionary learning in the contextof medical imaging. In this respect, the `1-DL algorithm is surely better suited than K-SVD, as different works for extending the `1-based minimization to the online setting havealready been proposed, and their derivation may be used.

2.5.1 Experimental assessment

There are several specific applications where sparse and redundant representations of naturalimages have been shown to be very effective. The present section will demonstrate the appli-cation of such representations to the denoising of medical images, to show that, even in thiswell-studied domain, better results can be achieved with an adaptive dictionary rather than afixed one. However, we feel the need to emphasize that our work aims at assessing the power ofthe representation rather than the performance on the specific application.

33

Figure 2.2: Examples of denoising results on two MR images: on the top row a slice from theBrainweb dataset, on the bottom row the image of a wrist. From left to right, the first columnshows the original image, the second the image with gaussian noise (µ = 0, σ = 10), the thirdcolumn the denoised image (with `1-DL top and K-SVD bottom), and the fourth the differencebetween noisy and denoised image.

Experimental setup

The tests were carried out on MR images of three different anatomical regions: wrist, brain andkidneys (see Figure 2.2 and Figure 2.7). The images of the wrist have been acquired with a T1-weighted 3D Fast Gradient Echo sequence, the images of kidneys with a T2-weighted sequence,while the brain images are simulated T2-weighted, belonging to Brainweb synthetic dataset.

The experiments were based on the MATLAB code available on the page of Ron Rubinstein1 fordenoising with K-SVD. The software has been adapted to work with both K-SVD and `1-DL.All experiments were made comparing the performances of K-SVD dictionary, `1-DL dictionary(both learned from data) and a data-independent DCT dictionary. The denoising is performediterating over all patches with a given size in the image, finding its optimal code u? via OMP andreconstructing the denoised patch as Du?, where D is the dictionary under use. To remove morenoise, and to avoid border effects, the patches are always overlapping, and the whole image isreconstructed averaging the overlapping region between two patches. If the patch size is p, two

1http://www.cs.technion.ac.il/˜ronrubin/software.html

34

http://www.cs.technion.ac.il/~ronrubin/software.html

(a) (b) (c)

Figure 2.3: Examples of learned dictionaries by (a) K-SVD and (b) `1-DL, compared with (c)the fixed DCT dictionary.

consecutive patches have an overlap of n2

pixels.

The images to be denoised were obtained adding Gaussian noise to the original ones. The ex-periments were made varying different parameters: the size of the patches considered, the levelof sparsity of the coefficients and the level of noise to be removed. The sizes of all dictionariesare always four times the dimension of the training data, to keep a constant level of redundancy.In a first set of experiments, the training data are 40.000 patches coming from the image to bedenoised. In a second type of experiments denoising was performed with a dictionary learnedfrom a set of noise-free images of the same anatomical region of the image to be denoised, witha total number of 60.000 training data. Denoising results were compared by measuring the peaksignal-to-noise ratio (PSNR), defined as:

PSNR = 20 · log10

(255√n

‖I − Iden‖2F

). (2.26)

where n is the total number of pixels, I is the original image and Iden is the output of the denois-ing procedure, and measured in Decibel (dB).

Qualitative assessment

A first qualitative comparison of these different dictionaries is shown in Figure 2.3. Compared toDCT, the two adaptive dictionaries, both learned from the brain images, store more informationabout the structures of the district we are considering. Examples of the denoising results obtainedwith the two dictionaries on the wrist and brain images are shown in Figure 2.2. In both casesthe majority of the difference image is made by noise, a sign that these methods do not disruptthe original structures of the images.

35

(a)

(b)

Figure 2.4: Comparison of PSNR achieved by K-SVD (initialized with DCT or randomly) andby DCT, for (a) varying sizes of the patches and (b) varying levels of sparsity.

Adaptive vs fixed

We first compared the denoising performance of an adaptive dictionary, K-SVD in this case, withrespect to a non-adaptive one. In Figure 2.4 is shown the variation of the PSNR of the imagesdenoised with DCT dictionary and with two versions of the K-SVD dictionary, with differentinitializations of the dictionary and the coefficient matrix: in the first one we start from a randomselection of the (normalized) input data, in the second we start from a matrix whose entries arerandom numbers. The two plots show that adaptive dictionaries always outperformed the fixedone.

Employing an adaptive dictionary adds to the computing time an overhead due to the learning

36

Table 2.1: On the left, comparison between the PSNR achieved by the three dictionary methodson the different images we chose (σ = 20). On the right, PSNR achieved by the different methodswith growing level of noise.

Kidney Wrist Brain`1-DL 29.97 33.89 29.71

K-SVD 30.37 34.14 29.04DCT 29.88 33.82 28.67

σ = 5 σ = 10 σ = 15`1-DL 37.61 33.92 31.50

K-SVD 35.48 32.65 30.59DCT 34.95 32.55 30.37

process. The parameter that most affects it in the K-SVD case is the patch size: anyway it tookless than a minute to compute a dictionary with a large patch size of 18x18. The patch size mustbe chosen not only with an eye to the computational costs, but also with a good tradeoff betweenthe accuracy of the details (decreased by big patch size) and the actual denoising effect (decreasedby small patch size). If the patches are too big there is a tendence to avoid the reconstruction ofthe details, with an oversmoothing effect, while if the patches are too small the overall denoisingeffect is not evident. Generally the patch size used is 8x8 and the number of non-zeros is 20, andon average the time taken to compute the dictionary was less than 15 seconds.

K-SVD vs `1-DL

The second comparison was performed between the two algorithms for dictionary learning, fora fixed patch size of 8 × 8 and 20 nonzero coefficients of the representation (the parameter τ in`1-DL was tuned accordingly). In Table 2.1 we report the results obtained on all three images fora fixed level of noise (σ = 20) and on the brain image for other three levels. The results of theDCT method are reported for the sake of completeness.

On the brain image the highest PSNR is always reached using the `1-DL dictionary, while onthe other two images K-SVD achieves better results. The lowest ratio is obtained with the DCTdictionary.

Learning general dictionaries of anatomical regions

All tests above were performed using the same image for training and denoising, i. e. we learneddictionaries from noisy data, and we denoised data using such dictionaries. We tried to constructa different kind of dictionary, where the training set came from a set of medical images of thesame anatomical region of the image to denoise: namely, the following experiments have beenperformed on the kidneys. We used the noise-free images shown in Figure 2.5. As can be seenfrom the figure, there is some variability on the different images, but the general structure of theregion is quite stable. The idea behind this experiment was to stress the fact that it is possible to

37

Figure 2.5: The noise-free images used to train a dictionary based on the anatomical region ofthe kidneys.

learn a dictionary from a certain class of data and have a good reconstructive power with a newimage of the same class.

The plots in Figure 2.6 show the PSNR of the previous and the new version of a K-SVD dictio-nary (left) and of a `1-DL dictionary (both with patch size 8x8) vs the growing level of noise.In Figure 2.7 we show a qualitative comparison between the reconstructions made with the twoK-SVD dictionaries. In both cases at the growth of the noise level the behavior of the two dictio-naries becomes more and more similar, and in fact, for σ > 10 both dictionaries could be appliedto images different from the training ones without significant loss in performance. This is an im-portant result, since it could remove the overhead due to the process of learning the dictionary onthe specific image, by learning it once on a wider class of modality- and district-specific images.

2.6 Discussion

In this chapter we have presented the dictionary learning paradigm, along with a review of therelevant literature and the details on three different methods to learn dictionaries and representa-

38

Figure 2.6: Denoising performance of K-SVD (top) and `1-DL (bottom) for varying level ofnoise. We show the PSNR achieved by training the dictionary on the same image to be denoised,and on a different training set.

tions of data. This case of study of Section 2.5, proposed in [SCBS11], represents a first attemptto establish a more robust connection between medical image analysis and recent advances onsparse overcomplete coding and unsupervised dictionary learning. We have presented the exper-imental results of the selected algorithms for image denoising, which are very encouraging asthey outperform the more standard technique based on DCT. We also remark that, although theexperiments were run on 2D images, the extension to 3D is straightforward, a further advantageof these techniques.

39

(a) (b)

(c) (d)

Figure 2.7: Denoising on an MRI of kidneys. (a) Original image, (b) image with noise, (c)image denoised with K-SVD trained on the same noisy image, (d) image denoised after trainingon a different set of noise-free images. Results with `1-DL were qualitatively similar.

40

Chapter 3

Dynamic Contrast Enhanced MRI Analysis

This chapter presents a data driven method for unsupervised tissue segmentation from DCE-MRIacquisition. Aside from the manual selection of a region of interest (ROI) the method is auto-matic. Given a DCE-MRI acquisition, after a preliminary stage in which signal distortions dueto patient movement are attenuated by means of a motion compensation technique, a sparse rep-resentation is obtained from a dictionary of basis signals learned from the data. Since the basissignals resemble the prototypical behavior of the enhancement curves corresponding to differenttissues, tissue segmentation is effectively achieved by applying standard clustering techniqueson the obtained representation. By computing a different dictionary of basis signals for eachdataset, our method exploits in full the adaptivity of dictionary learning.

3.1 Introduction

Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) is a powerful imagingmodality which, by allowing for the investigation of biological processes along the temporal axis,provides information about perfusion, capillary permeability, and tissue vascularity [EOS+08].Typically, DCE-MRI is obtained by acquiring a series of T1-weighted sequences before, during,and after the injection of a contrast agent, such as gadoterate meglumine (Gd-DOTA). The con-trast agent uptake is higher where the vascularization is stronger, resulting in signal enhancementand brighter image intensities after the injection. Clinically relevant, quantitative informationcan be extracted by a voxel-wise analysis of the time-varying intensity signal, also known asenhancement curve, which shows a stereotyped behavior across voxels of the same tissue. Thework described in this chapter presents an automated method for tissue segmentation from agiven Dynamic Contrast Enhanced Magnetic Resonance Imaging (DCE-MRI) acquisition con-sisting of three stages. In the first stage signal distortions due to patient movements are attenuatedby means of a conventional motion compensation algorithm. In the second stage, a dictionary

41

of prototypical curves and a sparse adaptive representation of the enhancement curves are ob-tained by solving a suitable optimization problem in a fully unsupervised setting (apart fromthe manual selection of a region of interest). In the last step tissue segmentation is achieved byapplying spectral clustering to the obtained non-parametric representation of the enhancementcurves. The method has been successfully tested on synthetic data and on kidney annotated data,for paediatric patients affected by Renal Dysfunctions (RD). Preliminary results on the wrist, forpatients affected by Juvenile Idiopathic Arthritis (JIA), are also reported.

The core of the method we propose consists of learning a sparse coding of the enhancementcurves obtained from an adaptive dictionary of prototypical curves in an unsupervised setting.Like many of the methods found in the literature, we learn both the dictionary and the sparsecodes at the same time by solving an optimization problem (see the works in [AEA06, BDE09,MBPS10, MSE08, RPCL06, RHBL07, LBRN06] for example). The use of temporal curvesinstead of image patches has been proposed by Kim et al. [KSU10] on electromyographic dataand, independently, by us, [CSBV11], in a preliminary conference version of this chapter forDCE-MRI data. By computing a different dictionary for each dataset, our method fully exploitsthe adaptivity of dictionary learning. As a result, the number of prototypical curves remainsrelatively small without compromising the achievement of accurate, sparse representation.

The rest of this chapter is organized as follows: in Section 3.2 we talk more in detail about DCE-MRI and the relevant literature, in Section 3.3 the proposed technique is discussed in detail. InSection 3.4 experimental results obtained on synthetic and real data are presented. We also showa comparison with data annotated by radiologists. Finally we draw our conclusion in section 3.5.

3.2 Related work

In the literature, DCE-MRI analysis is tackled by means of two different parametric approaches:the first approach relies on a pharmacokinetic model of the contrast agent dynamics tuned to thespecific process (or disease) under study [APA07, SWPY09, HGF+02]. Consequently, the esti-mated parameters have a direct physiological interpretation. The second approach parametrizesthe shape of the enhancement curves with no direct link to the problem physiology [AXF+09,KBB+07, KBP05, GR09]. Segmentation is achieved after fitting a geometric model on the ac-quired enhancement curves.

From the computational viewpoint the analysis of DCE-MRI poses several problems arising fromthe large amount of noise affecting the signal, patient movements during acquisition, and theneed of discriminating between different tissues. Recently, several computational methods basedon DCE-MRI have been proposed for quantitative assessment of several diseases like prostatecancer [APA07], breast cancer [AXF+09, SWPY09], cardiac and cerebral ischemia [HGF+02],renal dysfunction [ZSR+09], and rheumatoid arthritis [KBB+07, KBP05].

42

Figure 3.1: Pipeline of the proposed DCE-MRI analysis process.

In other methods, feature vectors extracted from the raw data [LLC+08], or the raw data di-rectly [ZSR+09], are used as a basis for discriminating among different tissues, by means ofsupervised and unsupervised classification methods respectively. In all these cases, the proposedalgorithms are fine-tuned to the specific medical context and often require a fair amount of workin the construction of the feature models.

Unsupervised feature selection and clustering has been used for segmenting medical images invarious applications domains and modalities: vessel segmentation [LT11a, LT11b], brain tissuesegmentation [CE14], anomaly retrieval [BDM+12], brain disease classification [DSS+12].

Over the last decades, the signal processing community has shown a growing interest on adap-tive sparse coding, starting from the seminal work of Olshausen and Fields [OF97]. Instead ofusing over-complete, fixed dictionaries, like Wavelets [Mal99], an adaptive dictionary and thecorresponding sparse-codes are learnt from data within an optimization framework. Very goodresults have been reported in denoising [AEA06], compression [AEA06], scene categorization,object recognition (see the works in [RBL+07], [MBPS10] for more examples) and image super-resolution. Other works cast the dictionary learning problem as a factor-analysis problem, withthe factor loading corresponding to the number of the dictionary elements used: in this case thenumber of atoms is automatically obtained. Other works use non-parametric Bayesian meth-ods [ZCP+09, PC09] and the Indian buffet process [KG07, GG05].

In the work by Kim et al. [KSU10] the proposed method computes dictionary and sparse codesof 1-D dimensional signals acquired over time, in order to learn interpretable spatio-temporalprimitives from motion capture data and to differentiate between spatio-temporal primitives byusing the obtained atoms. The problem is formulated as a tensor factorization problem withtensor group norm constraints over the atoms as well as smoothness constraints.

43

(a) (b) (c) (d)

Figure 3.2: An example of motion compensation. (a): image at time t1. (b): image at timet2, in which a wide movement of right kidney in the ROI is highlighted. (c): the computeddisplacement field in the ROI, displayed at a larger scale. (d): the image at time t2 after motioncompensation.

3.3 The proposed method

In this section we describe the three stages of the proposed method: motion compensation, learn-ing the representation, and tissue segmentation. Figure 3.1 shows the complete pipeline of themethod.

3.3.1 Motion compensation

During the past three decades several registration techniques have been developed and widelyapplied to medical imaging. Motion correction of DCE-MRI time series is a particular case ofimage registration, in which tissue motion and deformation are the result of breathing and sud-den, sussultatory movements combined with perfusion of the contrast agent: strong deformationand large displacements are especially prominent in dynamic cardiac imaging [CGMCAFAL13,MCS+02], breast imaging [GSL+06], and abdominal imaging [RMJOZ04]. In all these cases, aregistration procedure is mandatory and may pose difficult computational problems.

In this work, even if the acquisition process lasts from 6 to 20 minutes, deformation and dis-placement are usually quite small. Spatial registration is necessary to compensate for the slightmotion artifacts produced by the presence of soft tissues in the kidney, whilst is almost neverneeded in the wrist. Consequently, in this work, we adopt a simple motion compensation methodbased on standard optical flow computation: in particular, we employ a 2-dimensional opticalflow method available in the Insight Toolkit Library (ITK). Given the simplicity of the givenregistration tasks, the adopted procedure produces results adequate for our purpose. As shownFigure 3.2, motion compensation is only performed within a ROI outlined by hand.

44

3.3.2 Learning the representation

We now istantiate the problem of learning a dictionary and a sparse representation from thedata to the use of time-varying signals, like DCE-MRI data. We denote an enhancement curvesampled at times t = 1, ..., p by means of a p-dimensional vector x = (x1, . . . , xp)>. Withoutloss of generality we also assume that, for all t, xt measures the difference between the samplesat time t and 1. This is equivalent to set x1 = 0 for every x.

Given n signals, {x1, . . . , xn} ∈ Rp, we aim at finding a dictionary of K atoms {d1, . . . ,dK} ∈Rp such that xmi , the m-th component of the vector xi, is given approximately by Equation 2.2with only a few non-null uij for each xi and where the superscripts run through rows and sub-scripts through columns. Thus ui = (u1

i , ...,uKi )> is the coefficient vector for the curve xicorresponding to the i-th row of the K×n code matrix U. In a more convenient matrix notation,Equation (2.2) can thus be rewritten as X ≈ DU with X denoting the p × n data matrix and Dthe p×K dictionary matrix.

In this work we use PADDLE1, an optimization algorithm proposed in the work of Basso etal. [BSVV11] and based on first-order proximal methods, for both steps (ii) and (iii). Thecolumns of the data matrix X correspond to the relevant voxels within the ROI. A voxel x isconsidered relevant depending on the value of the standard deviation σ computed on the samplesx1, x2,...,xp. Since the uptake of the contrast agent is responsible for high standard deviation,voxels are considered relevant, or irrelevant, if the standard deviation is higher, or lower, than afixed threshold T . We compute the threshold T by means of the Otsu method [Ots75] under theassumption that, in the ROI, the standard deviation follows a bimodal distribution with the firstpeak very close to 0. Empirical evidence corroborates the validity of this assumption in all of ourexperiments. A Further advantage of this preliminary pruning stage is a considerable speed upof the entire procedure.

The dictionary matrix D is initialized by randomly picking K columns from the data matrix X inthe first loop of alternate optimization between D and U. In our experiments, this choice led toa good trade-off between effective convergence of the algorithm and adaptiveness to the data ofthe resulting dictionary. No significant dependency of the obtained solution on the initializationstep has been noted.

As a measure of the effectiveness of the obtained reconstruction, we compute the Peak Signal toNoise Ratio (PSNR) defined in Equation 2.26.

1the open-source Python implementation is freely available at http://slipguru.disi.unige.it/research/PADDLE, Retrieved October 12, 2011

45

http://slipguru.disi.unige.it/research/PADDLE

http://slipguru.disi.unige.it/research/PADDLE

3.3.3 Tissue segmentation

In each given medical context, the physician is required to identify the number of different tis-sues of interest, each corresponding to a different type of enhancement curve. Relying on thefaithful reconstruction obtained in the previous stage, segmentation of the various tissues can beeffectively computed through standard clustering techniques. In this work we use k-means withthe number of classes k set a priori by an expert. The k centroids are computed by minimizingthe k-means distortion measure

minrij ,µj

n∑i=1

k∑j=1

rij‖ui − µj‖2 (3.1)

in which the rij ∈ {0, 1} with j = 1, .., k indicate which cluster j the vector ui belongs to, whileµj denotes the j-th cluster centroid. For the initialization of the k centroids we apply spectralclustering [vL07].

3.4 Experiments

In this section we present the experimental results obtained for the method assessment. Be-fore discussing the results obtained on real data, we describe some preliminary validation weperformed on synthetic data.

In general, the overall dictionary learning scheme stops either upon reaching the maximum num-ber of iterations, 10 in our case, or if the functional value changes by less than a fixed tolerance,here equal to 10−5.

We assess segmentation accuracy by means of two different metrics. In the case of synthetic datathe measure accuracy as the percentage of the correctly classified voxels or

Acc.(%) = 100 ∗ (TP + TN)

N(3.2)

where TP is the number of true positives, TN the number of true negatives, and N the totalnumber of voxels.

In the case of real data we compare two different binary segmentationsX and Y through the dicemetric [CCH06]

DICE =2|X

⋂Y |

|X|+ |Y |(3.3)

where | · | denotes the number of voxels in the set.

46

Table 3.1: Segmentation accuracy of the proposed method on synthetic data for different noiseand dictionary size. With the sparsity rate kept fixed at 30%, each entry shows the percentageaccuracy - computed as in Equation (3.2) - for additive zero-mean Gaussian noise with σ =5, 10, 15 and 20 and for the number of atoms K = 7, 14, 28 and 70.

K = 7 K = 14 K = 28 K = 70σ = 5 80% 83% 100% 100%σ = 10 63% 80% 82% 100%σ = 15 63% 68% 79% 99%σ = 20 61% 62% 75% 97%

3.4.1 Synthetic data

We generate 2-D images of 300×300 voxels simulating a phantom consisting of five tissues eachcharacterized by different uptaking of the contrast agent. As detailed in the work by Lavini etal. [LdJvdS+07] for the case of rheumatological pathology, we consider two types of synovialvolume (inflamed and healthy), vessels, and two different enhancements, type 3 and type 6. Foreach of these five tissues we define an enhancement curve template as in Figure 3.3(a). We createfour sequences of 200 frames corrupted by a zero mean Gaussian noise with standard deviationσ equal to 5, 10, 15, and 20 respectively.

The choice of the values of K and τ was driven by the quality of the segmentation performance.Consistently with what reported in the work of Basso [BSVV11], Figure 3.4 and Table 3.1 showthat K = 70 and a value for τ corresponding to a sparsity rate around 30% lead to the bestsegmentation accuracy.

As we will see in the case of real data, a value for K about an order of magnitude greater thanthe number of tissues to be distinguished effectively captures the intraclass variability of the en-hancement curves relative to the same tissue and thereby leading to almost perfect segmentation.From Figure 3.4 it can also be seen that a quite wide range of sparsity rates are sufficient toachieve accurate segmentation.

As shown in similar settings (see the works in [AEA06, SCBS11], for example) denoising isthe first effect of sparse coding. In order to asses the denoising performance of our method, wecompute the PSNR over the synthetic data as in Equation (2.26). The obtained results are bothqualitatively and quantitatively very similar for different amount of noise. Therefore, in whatfollows, we only discuss the case of σ = 10. An example of the denoising effect in the obtainedreconstruction can be seen in Figure 3.5.

The PSNR score of the noisy enhancement curves is 21.82 dB, while the PSNR for the de-noised version 31.46 dB. As a comparison, the score of the parametric model described by Ag-

47

0 50 100 150 2000.0

0.8

VesselsSynovial tissue 1Synovial tissue 2type 3type 6

0.4

Time Step50 100 150 200

0.0

0.04

0.12

Atom #11

Atom #13Atom #1

0

Atom #19

Atom #49

0.08

Time Step

(a) (b)

0 50 100 150 200

0.00

0.10

0.20

0.301st comp.2nd comp.3rd comp.4th comp.5th comp.

Time Step

(c)

Figure 3.3: Synthetically generated enhancement curves. (a): templates of the enhancementcurves corresponding to the five different types of simulated tissue. (b): the five atoms most usedin the reconstruction of synthetic data corrupted by noise. (c): the five principal componentscomputed through principal component analysis. The vertical axes have been rescaled manually.

48

0 20 40 60 80 10050

60

70

80

90

100

Figure 3.4: Percentage of segmentation accuracy of our method on synthetic data as a functionof the percentage of non-null entries in the code matrix U . The accuracy, computed as in Equa-tion (3.2), has been averaged over ten runs obtained by adding zero-mean Gaussian noise withσ = 20 and for K = 70. The grey band shows the standard deviation.

49

0 50 100 150 20020

25

30

35

40

45

50

55

60

65

Vox

el In

tensi

ty

Time Step

Figure 3.5: Denoising and signal reconstruction on synthetic data. The dashed curve in lightgrey is the noise corrupted version of the template curve depicted in black. The dashed-dotted anddashed grey curves are the reconstructions obtained by our method and by the method describedin the work by Agner et al. [AXF+09] respectively.

50

ner [AXF+09] is 27.28 dB. From Figure 3.5 it can also be appreciated that the reconstructionobtained by means of our method does not smooth out the relevant features of the underlyingsignal.

Figure 3.3(b) shows the atoms that are most frequently used in reconstruction . It is interesting toobserve that these atoms, up to a scale factor, resemble the different types of simulated enhance-ment curves, see Figure 3.3(a). In contrast, Figure 3.3(c) shows the first 5 principal componentsof the matrix XX> that cannot be readily interpreted in terms of enhancement curves.

3.4.2 Real data

Let us now assess the potential of the proposed method on real data. We propose two differ-ent anatomical regions, with different issues depending on the nature of the tissues we want tosegment: kidneys and wrists.

Renal dysfunction dataset

DCE-MRI plays an important clinical role in the assessment of kidneys damage caused by renaldysfunction. In order to obtain reliable estimates of functional parameters an accurate segmen-tation of the parenchymal tissue within each kidney needs to be computed.

A DCE-MRI dataset of 26 pediatric patients with renal dysfunction was acquired using a 1.5TMR system at the Istituto Giannina Gaslini (IGG). The images, in the kidney anatomical region,were coronal 3D fast field echo (FFE) each 256×256×9 voxels (7mm of slice thickness and1.33mm of pixel spacing) and with a time resolution between 108 and 132 time-points, dependingon the acquisition. The temporal sampling rate was 5 seconds.

Figure 3.6 shows an example of segmentation of both kidneys as a whole obtained by means ofour method.

Using as reference the manual segmentation performed by a trained expert on the entire dataset,we compare the segmentation obtained by our method against the segmentation described in thework by Zoellner and colleagues [ZSR+09]. Table 3.2 reports the results in terms of the dicemetrics of Equation 3.3.

It can be seen that the average performance of our method is approximately 5% better with aconsiderably smaller standard deviation.

The segmentation of the parenchyma can be obtained by refining the segmentation step. Insteadof two clusters - i.e., kidney vs. non-kidney as in Figure 3.6 - in this set of experiments weconsider eight clusters. Figure 3.7 shows an example of the obtained results for the parenchyma.The figure clearly shows the very good agreement between the segmentation obtained by the

51

(a) (b)

Figure 3.6: Qualitative assessment of kidney segmentation. The black and white curves denotethe manual segmentation and the segmentation obtained by our method on the left, (a), and right,(b), kidney respectively. Exact superposition is shown in grey.

Table 3.2: Segmentation accuracy for kidneys as a whole. Comparison of the dice metriccomputed as in Equation (3.3) for our method (top row) and for the technique based on raw dataand described by Zoellner [ZSR+09]. From left to right the columns report the dice metric plusor minus its standard deviation for the left (L), the right (R), and both kidneys (L+R).

L R L + ROur method 0.88± 0.06 0.78± 0.09 0.83± 0.09Zollner [ZSR+09] 0.83± 0.09 0.75± 0.11 0.78± 0.11

proposed method with the manual segmentation performed by a trained expert.

Figure 3.8(a) and (b) show the histograms of atom usage in the reconstruction of two of theeight tissues: the parenchyma and the collective duct. Figure 3.8(c) and (d), instead, show theatoms most used in the reconstruction of both tissues. According to the physicians these atomsresemble the behavior of the enhancement curves of the corresponding tissues. In agreementwith the experiments performed on synthetic data, it can be seen that an effective representationis obtained by employing a small number of similar atoms able to capture the intraclass variabilityamong the enhancement curves of the same tissue.

An indirect measure of the parenchyma segmentation accuracy can be obtained by measuring thesplit renal function (SRF) [RHCT03]. Following the protocol in the work by Rohrschneider etal. [RHCT03], and Vivier et al. [VDTD10] the SRF is computed in terms of the average of allthe enhancement curves from the voxels within the parenchyma. Let left-avg-curve and right-avg-curve be the average curves computed for the left and the right parenchyma respectively.

52

(a) (b)

Figure 3.7: Qualitative assessment of parenchyma segmentation. (a): manual segmentation. (b):segmentation obtained by our method.

Table 3.3: Quantitative assessment of the correlation between SRF scores. Top row: Pearsoncoefficient for our method (OM) vs. the first expert, 1st, our method vs. the second expert, 2nd,and first vs. second expert. Bottom row: same for the Spearman coefficient.

OM vs 1st OM vs. 2nd 1st vs. 2nd

Pearson coeff. 0.83 0.87 0.86Spearman coeff. 0.83 0.71 0.78

If we denote with al and ar the area beneath left-avg-curve and right-avg-curve respectivelycomputed in a time interval of one minute starting from the wash out in vessels for, the SRF isdefined as 100 × al/(al+ar). In our work we compare the SRF obtained from the segmentationresults of our method against the SRF computed by two experts employing the MRU plug-infor ImageJ2, a semi-automatic tool specifically designed for evaluating functional parameters ofrenal dysfunction.

Figure 3.9 shows the Bland-Altman plots of the computed SRF values.

Figure 3.9(a) and (b) depict the comparison between our method and the estimates obtainedby the two experts using MRU, whilst Figure 3.9(c) the comparison between the two expertsestimates. Notice that apart from one case, see Figure 3.9(a), all the estimates obtained by ourmethod vs. the experts fall within the corresponding confidence intervals. Interestingly, the samehappens when comparing the two experts estimates. As a further measure of consistency, wereport both the Pearson and Spearman coefficients in Table 3.3.

2http://www.univ-rouen.fr/med/MRurography/accueil.htm, Retrieved November 30, 2011

53

0 10 20 30 40 50Atoms

0

0.1

0.2

0 20 40 60 80 100 1200.02

0.04

0.06

0.08

0.10

0.12

0.14Atom #19

Atom #45

Time Step

(a) (b)

0 10 20 30 40 50Atoms

0

0.1

0.2

0.3

0.00

0.05

0.10

0.15

0.20Atom#40

Atom#47

0 20 40 60 80 100 120Time Step

(c) (d)

Figure 3.8: Renal dysfunction dataset dictionary. Histogram of atom usage and plot of the twoatoms most used in the reconstruction of the enhancement curves of the clusters correspondingto the parenchymal, (a) and (b), and the conductive duct tissue, (c) and (d), respectively.

54

0 20 40 60 80 100Mean SRF

30

20

10

0

10

20

30

40

Diff

enre

nce

betw

een S

RF

0 20 40 60 80 10040

30

20

10

0

10

20

30

Diff

enre

nce

betw

een S

RF

Mean SRF

(a) (b)

0 20 40 60 80 10040

30

20

10

0

10

20

30

Mean SRF

Diff

enre

nce

betw

een S

RF

(c)

Figure 3.9: Bland-Altman plots of the SRF between our method and the first expert, (a), ourmethod and the second expert, (b), and the first and second expert, (c). In each plot the continuousline is the average of all the differences, whilst the two dashed lines delimit the confidenceinterval, computed as the average plus or minus ±1.96 × σ, with σ the corresponding standarddeviation.

55

Table 3.4: Comparison of the signal recovery performance and computational time betweenour method (top row) and two other approaches (bottom rows), on the wrist dataset. From theleftmost to the rightmost column: PSNR average and standard deviation, avgPSNR and σPSNR,and computational time, avgtime and σtime, over the entire dataset.

Method avgPSNR σPSNR avgtime σtimeOur method 37(dB) 4.1(dB) 14(s) 2.4(s)Agner [AXF+09] 34(dB) 4.2(dB) 12(s) 2.4(s)Guo [GR09] 19(dB) 6.1(dB) 305(s) 76(s)

According to the opinion of medical experts, these results indicate that our method leads toresults which are both very accurate and reliable.

Juvenile idiopathic arthritis dataset

In order to verify the versatility of the proposed method we also considered a dataset of pediatricpatients affected by juvenile idiopathic arthritis (JIA).

In the current clinical practice JIA diagnosis is based on conventional radiology, but DCE-MRIappears to be very useful for reaching an accurate and reproducible quantitative estimate of theinflamed synovial compartment [DMMT10, MDB+10]. From the computational viewpoint theproblem is the separation of synovial volumes from vessels since the signal intensity of theenhancement curves of both tissues is sustained. As shown in the simulation of Figure 3.3(a),the main difference appears to be in the transient phase in which the response in vessels peaksbefore leveling off.

A dataset of 12 pediatric patients with active JIA was acquired using the same MR system ofabove. The images, in the wrist anatomical region, were coronal 2D FFE each 160×160 voxels,(10mm of slice thickness and 1,06mm of pixel spacing) and with a time resolution between 60and 120 time-points, depending on the acquisition. The temporal sampling rate was 3 seconds.

Table 3.4 shows the obtained results in terms of the PSNR computed as in Equation 2.26.

From the two leftmost columns of Table 3.4 it can be seen that our method achieves betteraverage PSNR with a smaller or similar standard deviation with respect to both the models in theworks by Agner [AXF+09] and Guo [GR09] respectively. Furthermore, from the two rightmostcolumns of Table 3.4, it can be seen that from the computational viewpoint our method is almostas good as the fastest of the alternative methods.

Figure 3.10(a) shows an example of the segmentation obtained by means of our method. Overallthe computed results have been considered very accurate by the medical doctors involved in the

56

(a) (b)

Figure 3.10: Qualitative comparison of the segmentation obtained by means of our method, (a),and by the method by Agner [AXF+09], (b). Vessels and synovial volume are in white and blackrespectively.

study and a thorough quantitative analysis on a larger number of patients is under way. Fig-ure 3.10(b) shows the results achieved by the method by Agner [AXF+09]. Notice the visiblylarger percentage of false negatives for the synovial volumes and false positives for vessels withrespect to the segmentation of Figure 3.10(a).

Even if the ground truth is not readily available, an indirect measure of the effectiveness ofour method can be obtained by looking at the shape of the obtained atoms. The black and thegrey curve in Figure 3.11 depict the atoms most frequently used in the reconstruction of theenhancement curves corresponding to synovial volumes and vessels respectively.

As confirmed by trained physicians, the two atoms capture the main features of the correspondingtissue response and constitute a very good starting point for a segmentation stage the quality ofwhich greatly depends on the effectiveness of the obtained representation.

3.5 Discussion

In this chapter we proposed and discussed an unsupervised method for tissue segmentation inDCE-MRI which consists of three steps. In the first step the acquired images are spatially reg-istered to compensate for patient movements. Then, the time-varying signals measured at eachvoxel are represented as a sparse linear combination of adaptive basis signals. Both the coeffi-cients of the combination and the basis signals are learned from the available data by solving asuitable optimization problem of adaptive and sparse dictionary learning. Finally in the third stepthe identification of the different tissues is obtained by means of standard clustering algorithms

57

0 20 40 60 80 100 1200.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14Atom #3Atom #6

Time Step

Figure 3.11: JIA dataset dictionary. The black and grey atoms, mostly used in the reconstructionof the enhancement curves in the clusters corresponding to synovial volume and vessel tissue,resembles the expected behavior of the contrast uptake as a function of the time steps.

58

applied to the computed representation.

The presented analysis shows that the proposed method has several interesting features. First,the basis signals resemble prototypes of the acquired enhancement curves. This property leadsto a naturally sparse representation of the original enhancement curves which can be easily in-terpretable in terms of the used atoms. Experiments indicate that the intraclass variability of theenhancement curves corresponding to the same tissue is captured by a dictionary size an orderof magnitude larger than the number of different tissues of interest. Second, the obtained re-constructions are denoised version and retain all the relevant features of the underlying curveswithout the need of ad hoc hypothesis on the shape of the acquired signals. Consequently, thesegmentation which is obtained by means of standard clustering techniques is very accurate.Third, the method is computationally efficient and provides results which appear to be inde-pendent of the initialization conditions. Finally, the method can be extended to different medicalcontexts with little effort leveraging on the adaptiveness of the whole framework and the minimalrequest of supervision from the physician side.

Future work focuses on the implementation of the proposed scheme on distributed architecture.In conclusion, we believe that the proposed method can constitute the core of a computer aideddiagnosis system for DCE-MRI studies, at least in the domains in which motion compensationis not a too difficult problem.

59

Chapter 4

Background Modeling

In this chapter sparse and adaptive representations are applied to background modeling of ascene captured by a fixed camera. The main idea is to build a robust and dynamic backgroundmodel able to deal with illumination changes and modifications of the background without yeld-ing false positives. The descriptive power of sparse codes and dictionary learning is thus ex-ploited to learn all the possible configurations that a background can reach over time. Thisframework can be very useful for video surveillance, but its output can be used as a preprocess-ing step for different analysis of the data (usually the foreground is the interesting part of a videosequence).

4.1 Introduction

Modeling the background of a scene viewed by a fixed camera is a classic problem of computervision. The goal is to build a model of the background of a scene, in order to detect easily the ob-jects moving in foreground. It can be seen as a pre-processing step for many other applications,from video surveillance to the analysis of the behavior of the people in the scene. Its effec-tive solution is an essential step for addressing subsequent problems, like tracking and recog-nition of moving objects [WADP97, SG99, ST03, TC11, NDLO09] and dynamic scene under-standing [ZJ05, WRB06, NO12, AMP06, JH96, MT11], common in video surveillance [Cuc05,HXF+07, LCST06, ME02, BHH11], human-computer interaction [SFGK02, GFMO12, PSK12,WZS12], and industrial applications [XZTH90, DSAW99, FKSJS08]. Aside from the unavoid-able noise affecting image acquisition, several factors contribute to make this problem still chal-lenging: among the most important we mention illumination changes - no matter whether hap-pening smoothly or abruptly over time, intermittent image variation induced by objects belongingto the background like foliage or poles shaken by the wind in an outdoor scene, and permanentbackground changes like a door which may be open or closed in an indoor environment.

61

In building a background model for a scene viewed by a fixed camera a method can rely on somespecificities. First, the background - or a great part of it, rarely changes over time. When it doeschange, the new background remains stable, at least for a while. A consequence of the back-ground temporal stability is that the observed image sequence can be thought of as a stream ofweakly supervised noisy input data. In this work we model the background by making use of dic-tionary learning techniques inducing sparse representations [LBRN06]. This choice allows us toleverage on the specificities of above. The image is divided into (possibly overlapping) patchesof equal size and each patch is processed individually in order to exploit the changes localitytypical of the application domain, and to allow for high parallelization. In a bootstrap phase,the dictionary is simply derived by the first few frames of the video sequence. At run time ourmethod approximates a patch with a sparse linear combination of atoms (forming the dictionary)learnt online from the image stream. If the reconstruction error is high a change, or anomaly, isdetected in the image patch. Long lasting anomalies call for a dictionary update. The fact thatmany consecutive very similar image patches are fed as examples into the dictionary learningmodule ensures that, for each patch, the dictionary grows adaptively in size and only when thereconstruction accuracy of the current image patch is not sufficient. By looking for sparse recon-struction, not only the dictionary size is kept under control, but the obtained reconstruction canalso be computed and maintained effectively. Image variations not leading to stable changes donot trigger an update of the current dictionary and are discarded. The proposed online learningprocedure can be viewed as a core of a change detection system able to automatically adapt tothe environment evolution keeping track of the relevant background variations without the needof human intervention.We assess our method on two benchmark datasets (the SABS dataset and the Change Detec-tion dataset) to evaluate its appropriateness to deal with different types of scenarios and motion.Also we report the results obtained by the method across long time observations: we process avideo stream acquired in-house covering two long periods of time acquired on two consecutivedays and show how our method is able to learn illumination patterns that periodically occur atthe same time of the day. Similar considerations could be made on other types of stable scenechanges and intermittent motion patterns.

The reminder of the chapter is organized as follows. In Section 4.2 we review the relevant lit-erature about background modeling, Section 4.3 describes the Background Modeling ThroughDictionary Learning (BMTDL) we propose, while Section 4.4 reports an analysis of the methodon a few meaningful study cases. Section 4.5 is about the method assessment on the two bench-mark datasets, while Section 4.6 shows the method performances when long time observationsare available. Finally 4.7 is left to a final discussion.

62

4.2 Related work

The literature of background modeling is rather wide and a complete overview is out of the scopesof this chapter – the interested reader is directed to [RAAKR05, Bou11] for complete surveyson the subject. Here we just summarize the main contributions to the topic. A common choiceis to model each pixel of the background independently. Very popular unimodal approaches, asthe (weighted) running average, filter and gaussian-based methods [CGT04, AW09, WADP97]lie on this category. Probabilistic methods are often based on the temporal prediction of the pixelevolution over time by means of filter – e.g. Wiener of Kalman filters [TKBM99, MMSZ05].Despite their computational efficiency, all these methods let the successive post-processing incharge of considering the spatial consistency of the results.Methods that operate on blocks rather than single pixels have also been proposed. In [SWFS03]the authors assume that neighboring blocks of background pixels follow similar variations overtime, and train a Principal Component Analysis model for each block. However, such tech-nique lacks a mechanism to adapt the models over time. A two-level method has been adoptedin [LLC09], that first determines background image blocks by means of classification, and thenexploits the obtained outcomes to perform block-wise updates.The work in [CSD+08] is an attempt to cast the background subtraction task in the framework ofcompressive sensing. A major drawback is that it needs the foreground object to be of relativelysmall size with respect to the camera view. In [MV10] instead, the authors are inspired by thebiological mechanisms of motion-based perceptual grouping. The solution they propose welladapts to highly dynamic scenes, but is characterized by a very high processing time.In highly complex and challenging environments, the use of a single value or model for eachpixel lead to inaccurate results. In these cases, the problem can be better dealt by means of theso-called multimodal background models. Perhaps the most popular model belonging to thiscategory is the Gaussian Mixture Model and all its many variants (see e.g. [SG99, PS02]), con-sisting in modeling the distribution of pixels over time with a weighted mixture of gaussians. Themajor drawbacks of the method reside in the strong assumption that the background is visiblemost of the time and is characterized by a low variance. Parameters selection may also be verycomplex. To avoid the choice of a specific probability density function, non-parametric modelshave also been proposed, as the ones based on kernel density estimation [EHD00, MP04]. Thesemethods are based on building a histogram of background values accumulated on the recent pixelhistory. However, they often fail to cope with concomitant events evolving at different speeds,and strongly depend on an appropriate choice of the time interval.An alternative relies on using an approach based on codebooks [KCHD05]. Each pixel is de-scribed via a certain number of codewords, each one representing a particular color configura-tion of that pixel when belonging to the background. Improvements of the general paradigm thatincorporate spatial and temporal context of the pixel have been proposed [WP10, BVD11].More relevant to our work are previous approaches tackling background modeling as a sparse re-covery problem [DH08, DTH09, SDB+11, CXWK11]. Common denominator to all these works

63

is the problem formulation, which relies on the definition of the foreground as the error betweenan image frame and its approximation – obtained as a linear combination of a fixed set of frames– and on the attempt of sparsifying the contributions to such error – leveraging on the assump-tion that most of the image locations correspond to background. The works in [DH08, DTH09]represent the first attempts in this direction, the former building the linear approximation on thebasis of a fixed coding matrix obtained in a training phase as concatenation of images, the lattercomparing different methods to select a dictionary. More recently the sparse coding step has beenimproved [SDB+11, CXWK11] and a naif procedure for dictionary update at fixed intervals hasbeen proposed [SDB+11].

Our method represents a joining link between block-based approaches, to account for spatialconsistency, and multi-modal methods, to deal with arbitrary complex scenarios and dynamicevents.Although inspired by similar considerations, our work significantly diverges from previous appli-cations of sparse coding and dictionary learning to background modeling and foreground detec-tion [DH08, DTH09, SDB+11, CXWK11]. First, we aim at sparsifying the patch representationrather than the error contributions, leading to a different mathematical formulation of the prob-lem. Second, our method is formulated as a set of many small-size optimization problems – onefor each patch – rather than considering a single global problem defined on data of far higher di-mensionality (size of a patch as opposed to size of the entire image). Third, the update procedureis designed to be activated only when and where it is necessary. Each dictionary size reflects theamount of dynamics of the corresponding patch, since the model controls the amount of redun-dancy and is able to adapt to different complexities. As a consequence, the spatial demand isnot uniform on all image locations, but is larger when a richer set of configurations need to becaptured while it is very small on more stable regions.Related to this aspect is the computation of the pixel level of details, achieved by a more tradi-tional pixel-based background subtraction and only applied to the image patches that changedwith respect to their background model. Thanks to this coarse-to-fine approach we are able tocontrol the computational cost of the method, while achieving accurate results at a pixel preci-sion whenever needed.


Sparse dictionary learning (DL) aims at building data representations by decomposing each da-tum into a linear combination of a few components selected from a dictionary of basic elements,called atoms. The adaptiveness of the dictionary is achieved through the process of learningthe atoms directly from the input data instead of using a fixed, over-complete dictionary derivedanalytically (as for Wavelets or Discrete Cosine Transform [Mal99]).

64

4.3.1 Dictionary learning for background modeling

We now discuss how to apply dictionary learning to the problem of adaptively modeling thebackground of a scene. In this work we assume our data are image patches of a fixed size acquiredby a fixed camera at different time instants. The choice is motivated by different considerations.A pixel-based modeling of the background, as the one proposed in [SG99, KCHD05] does notappear to be informative enough for learning a dictionary. Conversely methods based on learninga dictionary of the entire image [DH08, DTH09, SDB+11, CXWK11] do not seem to exploit thepeculiarities of the application domain, where changes are often local and usually affect welldefined portions of the image plane. Also, different portions of the image may need models ofdifferent complexity, or may require more frequent updates. In this case, it might be an overkillto update the whole frame-model every time a small local background change occurs. Finally, apatch-based model allows us to choose an appropriate scale for the model (related to the patchsize) and, in principle, to exploit prior information by selecting an appropriate size for a givenposition on the frame.

In what follows, each patch will be named ptxy, meaning that the area we are considering islocated at the position xy of the time instant t. A patch evolving in time will participate to formand update a specific dictionary Dxy.

A peculiarity of the application domain is that data are provided as a (video) stream, one at atime. Therefore, our data are a sequence of xi examples, i = 1, . . . – each one being the patchunfolded in a n-dimensional vector – where the observation at time t could possibly updatethe previous solution. In the machine learning terminology this would correspond to an onlinelearning procedure. Notice that the algorithm described in the previous section naturally appliesto this setting, as we will describe in details the online procedure.

If most training examples are very similar as in the case of an image sequence obtained from afixed camera looking at a slowly changing background, the atoms correspond to denoised ver-sions, or prototypes, of the inputs [AEA06, SCBS11]. Over time, a richer dictionary is expectedto arise in order to be able to capture greater (and permanent) changes in the background as seenfrom the given image patch. Sparsity, which has been reported to favor discriminative powerin subsequent classification tasks [RHBL07, LBRN06, BSVV11], here ensures a model of thebackground consisting of a linear combination of a few prototypes. The prototypes actually usedin the model provide information which, in subsequent stages, can be usefully employed to rea-son on the time evolution of the viewed patch. The parameter K controls the number of learntatoms: in our case it is typically small because relevant background changes are expected to berare.

65

4.3.2 The BMTDL method

We now summarize our method for Background Modeling Through Dictionary Learning (hence-forth BMTDL).

As a first thing, we divide each processed video frame in a regular grid of (possibly overlapping)patches. The whole processing scheme is carried out on each patch individually, allowing forhigh parallelization. In what follows for clarity, unless otherwise stated, we will concentrate ona single patch.

We assume we are given a sequence of patches {ptxy}t which may be seen as noisy versionsof the same information if no dynamic events occur within the patch. All these elements maybe approximated as linear combinations of the same dictionary Dxy constructed over time, andupdated only when necessary. We then assume this dictionary may allow us to reconstruct dif-ferent realizations of pxy in a normal configuration (that is, when nothing significant occurs inthe patch).

At run time t, the BMTDL method approximates the patch ptxy w.r.t. Dxy. If the reconstructionerror is small, this means the current patch can be accurately approximated with the availabledictionary, therefore the patch is likely to be a background patch. Otherwise an anomaly isdetected in the patch, possibly caused by the presence of a foreground object. If such anomalycontinues for some time, it probably means the background has changed permanently, then thedictionary needs to be updated to accommodate new stable information on the area.

The method can be seen as a state machine: at status 1, Normal, nothing is happening in thepatch. At status 2, Anomaly, a foreground object is altering the appearance of the patch. States 3and 4, Update and Clean respectively, keep the background model up-to-date while controllingits size. A graphic description of the system can be seen in Figure 4.1. In the remainder of thesection, each part of the method is described in a greater detail.

4.3.3 Run time analysis

On a bootstrap phase, each dictionary Dxy is initialized with the first k patches normalized bytheir average in order to constrain the `2 norm of the atoms to be no greater than 1: Dxy =

{ptxy/p}k−1t=0 with p = 1

k

∑k−1t=0 ptxy. Notice it is not important the initial dictionary is meaningful

since the run time procedure described below will take care of noise or moving objects includedin the initial dictionary. We now consider each possible state of the system in a given time instantt, following the bootstrap (see Figure 4.1).

NORMAL: at run time, at each instant t the input patch ptxy is decomposed with respect to thecurrent dictionary Dxy. To do so we compute a feature vector u as ptxy ≈ Dxyu by minimizing

66

Figure 4.1: The structure of the proposed system.

Equation 2.11 with respect to u only. This can be done by applying the iterative gradient descentwith the update rule (2.16). At each t u can be initialized with the output of instant t − 1, thuswe obtain a very fast convergence. We then estimate the reconstruction error

||ptxy − Dxyu||2/||ptxy||2.

If the reconstruction error is lower than a threshold τerr the patch remains in a NORMAL sta-tus; the system finds the u non-zero coefficients and increases a usage counter assigned to theircorresponding atoms. These counters represent a measure of how many times an atom has beenused. They will be needed in a subsequent step to discard low-rated atoms from the dictionary,to control its space occupancy.

ANOMALY: If the reconstruction error is greater than the threshold τerr an anomaly is detectedand the system changes status. A counter measuring the persistence of the change is increased.

UPDATE: If the system remains in the state ANOMALY for τtemp frames, it means that a stablebackground change occurred in the scene. Using the last τtemp frames as training data, we learna new dictionary Dupdate

xy of a fixed size Kupdate by minimizing Equation 2.11, and add the newestimated atoms to the main dictionary Dxy = Dxy ∪Dupdate

xy . Then the system returns in the stateNORMAL and begins to process new patches with the updated dictionary. This step provides agood adaptation of the system to new background configurations without losing relevant infor-mation about previous configurations of the scene. This allows us to obtain a background model

67

with a memory, so that if some event happens with a certain frequency, with the exception of thefirst time, it will not be classified as foreground. The value of τtemp is influenced by the stationar-ity property of the background under analysis and in turn influences its capability of adaptation intime [TKBM99]. It turns out that it is strongly dependent on the specific scene and its dynamics,thus these aspects should be considered in the selection process (similar considerations may befound in [BVD11, Bou11, BHH11, KCHD05]).The size of the update, kupdate, should be chosen to control the overall size of the dictionary but,at the same time, allowing for variations of different complexity.

CLEAN: In order to control the size of the model, that may significantly grow as a consequenceof different update steps, and discard atoms which are not useful or even wrong a cleaning stepis required.

Every τB frames, the system checks the usage counters associated with the recently added atomsand discards the ones which have seldom been used (whose counter is below a threshold whichmay depend on the expected duration of foreground dynamic events).

Doing so, the part of the dictionary that is actually checked is the one that has been updated last,that might contain atoms produced by a leakage of foreground information. In this case it isreasonable to assume the corresponding atoms will be rarely (or never) used in the future.

Once a portion of the dictionary has been cleaned, it can be considered as a consolidated part ofthe background history.

Algorithm 3 summarizes the run time procedure for background modeling.

(a) (b) (c)

Figure 4.2: The three steps needed to obtain a pixel-precision binary map from the patch-precision map. Left: the coarse patch-precision binary map. Center: the absolute differencebetween the current patch and the most similar background atom. Right: the pixel-precisionbinary map obtained by thresholding.

68

Algorithm 3 RUNTIMERequire: ptxy ∈ Rn t = 1 . . . {stream of input patches}Require: D ∈ Rn×K {initial dictionary}Require: τerr, τtemp, τB {thresholds}

1: while true do2: ptxy ≈ Dut {sparse coding}3: if ||ptxy − Dut||2/||ptxy||2 > τerr then4: counter ++ {ANOMALY}5: if counter > τtemp then6: update dictionary {UPDATE}7: set the cleaning variable to zero8: end if9: else

10: increase the score of the atoms of the nonzero coefficients of ut {NORMAL}11: increase the cleaning variable12: end if13: if t%τB == 0 then14: check the score of each atom15: discard atoms seldom used, {CLEAN}16: end if17: end while

4.3.4 Foreground detection

The state ANOMALY corresponds to a variation with respect to the available dictionary whichcould be caused by a foreground moving object. At a given time t all patches in the stateANOMALY form a change detection map with a patch precision (see Figure 4.2 (a)). If apixel-precision foreground map is needed, a further step is required.

To this purpose, when a patch ptxy is marked as anomalous, we look for the nearest neighboratom in its dictionary with respect to the euclidean distance, be it dbgxy.

We then compute the absolute difference between the normalized ptxy and dbgxy. If patches are notoverlapping the union of all absolute differences gives us an overall absolute difference image.In presence of overlap, the different contributions are summed up to compute the final differenceimage. This leads to image areas which are more likely to be foreground, corresponding tolocations where more than a patch presents a high reconstruction error. The difference image isfinally binarized with an appropriate threshold.

69

4.3.5 Computational costs

We now briefly discuss the computational load of our method, considering the temporal com-plexity of processing a single patch. At each time instant, representing the patch p ∈ Rn withrespect to the dictionary D ∈ Rn×K requires an iterative procedure of p steps. Each step includestwo main operations, i.e. the computation of the gradient ( Equation 2.16, cost in O(Kn)), andthe soft-thresholding, performed according to Equation 2.15 (cost in O(K)). For similar consid-erations, this is also the cost of dictionary updating. The computation of the reconstruction erroris for free, being included in the iterative procedure. Hence, each step costs O(nK), leading to atemporal complexity of O(nK)× p steps for the computation of the binary map at a patch level.The additional cost required to reach a pixel-based precision – obtained after the comparison ofthe current patch instance with all the atoms in its dictionary – is proportional to O(nK) as well.Notice, however, this overload is paid only for patches marked as anomalous after the coarseprocedure, which in the general case represent only a small percentage of the entire image.For comparison, let us consider the temporal complexity at a patch level, of one of the mostefficient methods for building and maintaining a background model, the running average algo-rithm [WADP97]. Running average first relies on detecting the locations within the patch inwhich some change occurred. Then, the method updates the background as a weighted sum ofprevious background and current image, that in the worst case will affect the entire patch. Fi-nally, the binary map is obtained by thresholding the absolute difference between current imageand current background. All these operations have a cost linear in the patch size, O(n). Thus,the difference between the two methods depends on the size of the dictionary and on the numberof iterations required. The latter are usually small, since we initialize ut with ut−1. The formerdepends on the dynamics of the patch, but it is usually quite small, and always K << n , as itwill be experimentally observed in Section 4.6.On the matter of temporal complexity a final remark is in order. The computational cost ourmethod requires to process the entire image (approximately of order P (O(nK)× p), with P thenumber of patches, K and p the average number of atoms and iterations over all the patches )is comparable to the cost we would pay for applying our method to the estimation of a singleglobal dictionary, with a single patch of the size of the entire image. In that case, given the di-mensionality of the input data equal to the full image size m = Pn, the complexity is of orderO(mK)× p, where K is the cardinality of the single global dictionary while p is the number ofiterations. From a theoretical standpoint, this is in line with the principle of domain decomposi-tion [FLS09, FLS10].

70

Table 4.1: F-measure on a validation set taken from the ”Baseline” group of the Change De-tection dataset for different values of the regularization parameter. A higher τ corresponds to asparser solution.

τ 0.001 0.010 0.050 0.100 0.300 0.500F-measure 0.863 0.864 0.864 0.865 0.863 0.810

used atoms (%) 99.9 98.8 86.9 85.9 79.8 78.7

4.4 The method at work

In this section we analyze a few study cases, focusing in particular on illumination changes,uniform and non uniform with respect to the patch, and stable background changes. As a firstthing we experimentally observe the choice of the regularization parameter τ is not critical forthe overall procedure. We observe the behavior of the F-measure, defined as in Equation 4.1,where precision is the ratio between the true positives and the pixels classified as positives by thesystem (true positives and false positives together), and recall is the ratio between the true posi-tives and the number of true positives and false negatives. In Table 4.1 we report the F-measure,obtained by varying τ while processing a validation set (taken from the Change Detection dataset— see Section 4.5.2). Therefore, in all the experiments presented in the chapter τ will always befixed (τ = 0.1), chosen as a compromise between sparsity and accuracy.

F-measure = 2 · precision · recallprecision + recall

(4.1)

As for the patch size, in this section we set it to 50 × 50 pixels. The level of overlapping hasbeen set to 10 pixels. Figure 4.3 shows our results on a baseline outdoor scenario, with nopermanent changes in the background besides physiological illumination changes.The obtainedsegmentation is very clean, as expected. In the reminder of the section we will discuss in detailshow the proposed method behaves in more complex circumstances.

First, we consider common problems related to light variation, showing that our method is ableto cope with global illumination changes and the presence of shadows. Next, we analyse thesystem performances when stable local changes occur in the background.

4.4.1 Uniform illumination changes

We discuss here the robustness of the dictionary-based background model with respect to globalillumination changes. A global uniform change on a patch can be formulated as the product of

71

Figure 4.3: The BMTDL method applied to a simple outdoor sequence from the Change De-tection dataset. The changes are detected correctly and the method does not detect any falsepositives due to illumination changes.

the pixels value within the patch by a constant. In our background model, this change will not af-fect the dictionary but will change the coefficients u of the linear combination. This observationconveys a very powerful concept of our model: the same dictionary can be exploited to modelbackgrounds characterized by different levels of illumination by simply appropriately selectingthe coefficients.To support this assertion, it is important to check whether we can achieve, under these circum-stances, comparable reconstruction errors.

In Figure 4.4, first row, we show three video frames of the same environment under differentilluminations. For the sake of this analysis, we sample five patches from the image – highlightedin the figure – and learn dictionaries the size of which was set a priori (K = 30) from a set ofimages acquired under a “medium” illumination (similar to the central frame in Figure 4.4). Werepresent the three frames with respect to the same dictionary and evaluate their reconstructionerrors. The rest of Figure 4.4 shows the results obtained on the five patches: each plot barrepresents the coefficients needed to reconstruct the patches under the different light conditions.Notice how the coefficients change coherently on the various atoms: in particular, the coefficientsused to reconstruct the image with lower illumination are consistently smaller than the ones usedfor the “medium” image. Similarly, the “lighter” image has been reconstructed with consistentlyhigher coefficients.Table 4.2 summarized the reconstruction error on the five patches: it is immediate to see that thevalues are stable as the illumination changes.

4.4.2 Background stable changes and local shadows

We have seen that our method does not require an update of the dictionary after a uniform changeof the patch intensity. On the other hand, an uneven illumination can generate new apparent

72

darker medium lighter

0 5 10 15 20 25 300

500

1000

1500

2000

atoms index

coeffic

ients

lighter

medium

darker

0 5 10 15 20 25 300

500

1000

1500

2000

2500

3000

3500

4000

atoms index

coeffic

ients

lighter

medium

darker

patch #1 patch #2

0 5 10 15 20 25 300

500

1000

1500

2000

2500

3000

3500

atoms index

coeffic

ients

lighter

medium

darker

0 5 10 15 20 25 300

500

1000

1500

2000

2500

atoms index

coeffic

ients

lighter

medium

darker

0 5 10 15 20 25 300

200

400

600

800

1000

1200

1400

1600

atoms index

coeffic

ients

lighter

medium

darker

patch #3 patch #4 patch #5

Figure 4.4: An analysis of the coefficients used for the reconstruction of five image patches, underdifferent levels of illumination and using a fixed dictionary. Above: three frames correspondingto different illumination conditions; the 5 patches we consider are marked in red . Below: barplots of the u vectors for the 5 patches under the three different illuminations show consistentvariations on the coefficients.

Table 4.2: A table reporting the overall reconstruction error of the three images in Figure 4.4with respect to the fixed dictionary. Notice the stability of the reconstruction error.

Reconstruction errorpatch # medium lighter darker

1 0.133 0.135 0.1562 0.149 0.139 0.1773 0.114 0.109 0.1164 0.096 0.092 0.0975 0.219 0.207 0.222

73

Figure 4.5: The effect of illumination changing slowly in the scene, because of a blade ofsun light cast on the floor. Left: a meaningful set of atoms of the dictionary associated withan unstable image patch, accumulated at the end of the phenomenon (to be read column-wise).Right: three samples of the scene as the illumination change occur — the patch used in theanalysis is marked with the black square.

structures in the image (corresponding, for instance, to the edges of a shadow) and in this casewe may need to add new atoms to the corresponding dictionary to account for them in the future.Figure 4.5 shows a scene modified during the day by a blade of sun light projected onto the floor.On the left we show the atoms of the dictionary of one of the patches affected by the illuminationchange (highlighted on the frames). The atoms (whose insertion order should be read column-wise on the image) depict the main structural changes in the patch appearance. Notice thatuniform changes have not been included in the dictionary, as discussed in the previous section.

From the point of view of a background update the dynamic event we have just described isequivalent to a real structural change in the scene, caused, for instance, by objects added ormoved in a new and permanent position. Figure 4.6 shows instead an example of a video se-quence containing short-term dynamic events (people walking in the scene) and a long-termvariation (a door left open). In this analysis, for clarity, we consider very small dictionaries withK = Kupdate = 1 We focus on three different patches corresponding to the door region, on the

74

#100 # 1096 # 2153 # 3241

Figure 4.6: An example of a stable change to be stored in the background model. On the rightwe report sample frames and their corresponding patch-based change detection maps: at Frame#100, the initial empty scene; Frame #1096, some people enter the scene; Frame #2153, a dooris opened, inducing a stable background change; Frame #3241, the modified empty scene. Blacksquares mark patches we use as a reference in our discussion. On the left the three dictionariescorresponding to the patches highlighted, the numbers below refer to the frame index at whichthe atoms have been included in the models.

left of the image we show the dictionaries corresponding to the three patches; then the figurecontains the sample frames (top) and the change detection map with patch precision (bottom).The presence of the people which entered the scene is correctly detected at frame #1096. As thevideo evolves, the door is opening and the corresponding event is still captured by our method(#2153). At frame #3214, the variation in the background has became stable, and the model isagain able to correctly reconstruct the corresponding patches. As for the dictionary evolution, atframe #100 the dictionaries are composed by a single atom, computed in the bootstrap phase (theone on the left in each dictionary). While the action is occurring, the atoms are no longer able toprovide an appropriate patch reconstruction, thus new atoms are added. In the final sets of atomsdifferent configurations of the same patch can be identified. Notice that the central dictionaryalso contains an intermediate atom in which the door is not completely open. This is because thedoor was left in this state for some time and the appearance of this patch triggered a dictionaryupdate. All the other temporary changes have not been included in the model.

Given the richness of the model we build as time goes by, if the “door opening” event occursoften, then the atoms involved will be often accessed and will always be available — this nicelyreduces the amount of false alarms due to structural changes in the scene. This property can beexploited effectively whenever we observe periodic changes, even frequent ones, as in camerajitter. In this case the various states will be stored in the background model and will not causeany change detection in the future. Obviously, the patches affected by the change will see theirdictionary size growing accordingly. The idea we pursue has been addressed in the past with

75

different techniques [SG99, KCHD05, BVD11, LLC09], but the method we propose allows usto incorporate the variations in a clean way, exploiting the power and the theoretical soundnessof the adopted dictionary learning algorithm.

4.5 Method assessment

In this section we report a quantitative experimental analysis of our method on two benchmarkdatasets, SABS and Change Detection. In both cases we follow the experimental protocol pro-posed by the authors of the benchmark, to compare our results with previously published one.We first comment on parameter selection. As anticipated earlier we set the dictionary learningregularization parameter τ = 0.1. A critical parameter is the threshold on the reconstructionerror, τerr. When a training set was available we used the following procedure: we set τerr = 0.3and processed the training frames. Then we corrected the τerr based on the reconstruction errorsobtained: we evaluated average error recerr and its variance σerr, then we took different valuesfor τerr lying in the range [recerr − σerr, recerr + 10σerr]. Usually the standard deviation is verysmall and the average reconstruction error is a good estimation of the threshold. We chose anasymmetric range to control the false positives. In absence of a training set, the same procedurewas applied to the first part of the video.For a fair comparison with state-of-art methods, we produce binary maps at a pixel-level. Thisalso allowed us to adopt the evaluation code provided with the benchmarks. The threshold onthe image difference ranges from 30 to 60, depending on the set of videos under analysis.

4.5.1 The SABS dataset

The SABS (Stuttgart Artificial Background Subtraction) dataset1 [BHH11] is an artificial datasetfor pixel-wise evaluation of background models. It consists of video sequences representingdifferent challenges in video surveillance, from dynamic elements in the background to suddenillumination changes. All the video sequences but one (Bootstrap) are provided with trainingframes that show the scene without any foreground object. The Basic sequence is a basic video-surveillance scenario; on a sub-sequence of it, it is possible to evaluate the system behaviourwith respect to tree foliage. The Darkening sequence simulates a gradual illumination changein the scene, while the Light Switch shows several sudden illumination changes. The scene isacquired during the night, with some noise added, in the Noisy Night sequence. In the Cam-ouflage sequence foreground and background have similar colors; as a comparison the datasetcontains also a NoCamouflage sequence. Finally, the dataset offers acompressed version of theBasic sequence, H.264. We have fixed the patch size at 40 × 40 pixels, then we have followed

1http://www.vis.uni-stuttgart.de/index.php?id=sabs

76

Table 4.3: Comparison (F-measures) on the SABS dataset as in [BHH11]. (1): basic sequences;(2) Dynamic background; (3) bootstrap; (4) Darkening; (5) Light switch; (6) noisy night; (7)camouflage; (8) no camouflage); (9) H.264 compression (40 kbps)

Method (1) (2) (3) (4) (5) (6) (7) (8) (9) Avg

Stauffer 0.800 0.704 0.642 0.404 0.217 0.194 0.802 0.826 0.761 0.594Oliver 0.635 0.552 - 0.300 0.198 0.213 0.802 0.824 0.669 0,524∗

McKenna 0.522 0.415 0.301 0.484 0.306 0.098 0.624 0.656 0.492 0,433Li 0.766 0.641 0.678 0.704 0.316 0.047 0.768 0.803 0.773 0.610Kim 0.582 0.341 0.318 0.342 - - 0.776 0.801 0.551 0,530∗

Zivkovic 0.768 0.704 0.632 0.620 0.300 0.321 0.820 0.829 0.748 0.638Maddalena 0.766 0.715 0.495 0.663 0.213 0.263 0.793 0.811 0.772 0,610Barnich 0.761 0.711 0.685 0.678 0.268 0.271 0.741 0.799 0.774 0.632BMTDL 0.789 0.736 0.671 0.672 0.329 0.591 0.780 0.790 0.778 0.681

(∗) average computed over fewer results

the experimental setup proposed by [BHH11], and adopted the provided code for evaluating ourresults. Table 4.3 extends the table reported on the dataset webpage to include our method.

It can be noticed how our BMTDL method is very appropriate in most circumstances and achieves,on average, the best results, even if the video sequences of the SABS dataset are short (about 1000frames) and do not leave enough time to our method to properly adapt to the complexity and tothe peculiarities of the various scenarios.

4.5.2 The Change Detection dataset

The Change Detection dataset2 is a benchmark collection of 31 real-world videos acquired bystandard and thermal video cameras and divided in six categories that include diverse motionand change detection challenges, both outdoor and indoor. The reported categories are Baseline(simple video surveillance videos), Dynamic Background (videos with moving tree foliage, waterand so on), Camera Jitter (videos acquired with slightly moving cameras), Shadow (videos withdark shadows), Intermittent Object Motion (videos with objects added to or removed from thescene) and Thermal (videos acquired with a thermal camera).

All the videos were provided with training frames and in some of them the evaluation was madenot on the whole frame but only in a region of interest. Here again we follow the experimentalprotocol provided by the authors of the benchmark.

As a first thing, Figure 4.7 shows the effect of changing the patch size and the amount of overlap.We consider all the videos of the Baseline set, one at a time, to analyze the possible relationshipsbetween the parameters and the size of the objects in the scene. On the left we show the effectof varying the amount of overlap once the patch size has been fixed to 40 × 40: in all the caseswe notice that an increase of overlap improves the results. On the right we report the results

2http://www.changedetection.net/

77

Highway OFFICE Pedestrians PETS2006 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

F−

me

asu

re

overlaps − Patch size: 40 px side

no overlap

10 px overlap

20 px overlap

Highway OFFICE Pedestrians PETS2006 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

F−

me

asu

re

patch size − without overlap

10 px side

20 px side

40 px side

Figure 4.7: The effect of varying the amount of overlap (left) and the patch size (right) on theF-measure obtained on the baseline videos (see text). Sample frames are reported below as areference.

obtained for different patch sizes and no overlap. In this case it is more difficult to choose thebest size, even if we may infer a relationship between an appropriate size and the type of video(or else, the average size of the moving objects): for instance the Highway video contains smallas well as big objects, thus a medium patch size seems to be advisable. In the PETS2006 videothe objects are quite small and distinctive w.r.t. the background, thus a small patch brings thehighest performance, while in Pedestrians we find that a bigger patch brings better results, dueto the fact that the people moving in the video, more or less always of the same size, projectshadows that are more likely to be captured as foreground by small, thus more local, patches.

As a final cross check we may observe that a 40×40 patch with an overlap of 20 pixels (magentabars on Figure 4.7 - left) is better than a 20× 20 patch with no overlap (violet bars on Figure 4.7- right). This seems to be due to the fact that, although the obtained resolution is equivalent, inthe first case each 20× 20 sub-patch is evaluated 4 times joint with different neighbors, leadingto a redundant and more precise evaluation.

We now compare our results with the ones reported on the benchmark web page, which refer to alarge selection of methods and thus may be considered as a reliable account of the state of the arton the topic. The reported statistics is rather rich, in this comparison we stick to the F-measurewhich summarizes the effect of false positives and false negatives. In these experiments the patchsize was set to 40 × 40 or 32 × 32 pixels, depending on the video resolution, and an overlap of

78

Table 4.4: Detailed results of BMTDL over the whole Change Detection datasetCathegory Recall Specificity FPR FNR PBC FMeasure Precision Ranking

Baseline 0.894 0.996 0.004 0.106 0.753 0.877 0.864 14.71

DynamicBG

0.732 0.998 0.002 0.268 0.440 0.753 0.782 9.57

CameraJitter

0.707 0.988 0.0123 0.293 2.496 0.725 0.748 9.14

Intermitt.ObjectMotion

0.684 0.982 0.016 0.316 4.421 0.686 0.709 8.42

Shadow 0.830 0.990 0.010 0.170 1.628 0.810 0.798 10.43

Thermal 0.862 0.980 0.020 0.138 2,790 0.793 0.737 14.42

OVERALL 0.785 0.989 0.011 0.215 2.088 0.774 0.773 6.42

half the patch size. Table 4.4 reports all the statistics we obtained with respect to the various sets.Table 4.5 instead compares our results with the most promising methods from the benchmarkweb site across all the categories. Our overall result (F-measure=0.77) is very close to the bestperforming algorithm (F-measure=0.78), granting us the second position in the overall rank. Therankings published are based on 5 significative figures performances, but most of them are verysimilar to the 3rd figure. Therefore, Table 4.6 reports the ranking we achieve for each category,both in absolute terms and by clustering the results to the 3rd figure. The overall performance weobtain brings us to the 4th position in the absolute ranking which is quite satisfactory, consideringno method ranks first on all categories.

4.6 Experiments across long time observations

In this section we analyze the performances of the BMTDL method across long time observa-tions: we process a video stream covering two periods of time (each of 2 hours) of two consec-utive days. The days have been selected to maximize their difference, thus Day 1 is a sundaymorning (weekend) while Day 2 is a monday morning (working day). During the weekend wedo not expect to observe any dynamic event caused by moving objects, but there will be smoothchanges due to illumination. The monitored scenario is the one shown in Figure 4.5.

Figure 4.8 reports on two different plots the reconstruction errors obtained during the observa-tion, averaged on all the patches. On sunday morning the trend of the reconstruction error wasquite variable between 9:20 and 10:30 because of a blade of sun projected onto the floor (seeFig 4.5), after that, the model remains quite stable. On monday morning we may observe spikesdue to various dynamic events — people or groups coming into the hall. Figure 4.9 shows a detail

79

Table 4.5: BMTDL compared with the most promising methods (according to http://www.changedetection.net)

Method Baseline DynamicBG

CameraJitter

IntermittentOM

Shadow Thermal Overall

DPGMM 0.93 0.81 0.75 0.54 0.81 0.81 0.78SGMM-SOD

0.92 0.68 0.70 0.70 0.87 0.71 0.76

PBAS 0.92 0.68 0.72 0.57 0.86 0.76 0.75PSP MRF 0.93 0.70 0.75 0.56 0.79 0.69 0.74BMTDL 0.88 0.75 0.72 0.69 0.81 0.79 0.77SC SOBS 0.93 0.67 0.70 0.59 0.78 0.69 0.73CDPS 0.92 0.75 0.49 0.74 0.81 0.66 0.73SOBS 0.93 0.64 0.71 0.56 0.77 0.68 0.72SGMM 0.86 0.64 0.73 0.54 0.79 0.65 0.70

Table 4.6: BMTDL compared with the most promising methods, ranking obtained to the 5th andthe 3rd figure (see text).

Cathegory OurFmeas.

Best Fmeas. (method) Diff. Posit.(5thfig.)

Clust.Posit.(3rdfig.)

Baseline 0.88 0.93 (SCSOBS) 0.05 12 6DynamicBG 0.75 0.81 (DPGMM) 0.06 4 4Camera Jitter 0.72 0.75 (PSPMRF) 0.03 4 4

Intermittent OM 0.69 0.74 (CDPS) 0.05 4 2Shadow 0.81 0.87 (GMM12) 0.05 7 4Thermal 0.79 0.81 (DPGMM) 0.02 2 2

80



9:20 9.40 10:00 10.20 10.40 11.00 11.200

0.05

0.1

0.15

0.2

0.25reconstruction error − sunday

time (am)

err

or

(a)

9:20 9.40 10:00 10.20 10.40 11.00 11.200

0.05

0.1

0.15

0.2

0.25reconstruction error − monday

time (am)

err

or

(b)

Figure 4.8: A comparison of the trends of the average reconstruction error over sunday (left) andmonday (right) morning.

of the reconstruction error trend on monday morning, between 9:20 and 9:30, on a meaningfulregion of interest (marked by the red rectangle). The red patches correspond to reconstructionerrors above the threshold τerr, set to 0.2 in these experiments. The height of the spikes is relatedto the area of the change, while its width to the temporal extent of the variation (within the wholeregion of interest).

Figure 4.10, instead, compares the details of corresponding time frames acquired on sunday andmonday morning. The sample frames show there is a similar illumination condition. On sundaythe reconstruction error is fluctuating, since many dictionary updates (the value of τtemp wasfixed to 40′′) are required to accommodate new illumination conditions, while on monday wetake advantage of the background model gathered the previous day and we obtain a very smoothreconstruction error with the exception of new dynamic events.

Figure 4.11 shows the per-patch dictionaries size reached at different times: on sunday morning,after one hour since the beginning of the experiments, most of the dictionaries are still very small(K < 40), with the exception of the lower part of the image plane where the blade of light isappearing. At the end of sunday morning, when the whole illumination variation event has beenstored in the background model, the dictionaries of the lower part of the image plane are larger(10 patches have 100 ≤ K ≤ 180), although the overall background model is still quite small (onaverage K = 25.7). On late monday the dictionaries size grew on the top right side of the imageplane where people sit still for a long time, and end up being incorporated in the background.Dynamic changes occurring elsewhere are not incorporated in the background, as they shouldnot be. By this time we have reached the maximum background size we may expect to obtain –from now on the sequence of update and cleaning steps will keep the overall size quite stable. In

81

Figure 4.9: A zoom in on the reconstruction error in a small time span on a region of interestmarked with a red rectangle shows how the peaks correspond to interesting events only.

82

Figure 4.10: A zoom in on the reconstruction error to compare the average reconstruction errorsduring sunday (left) and monday (right) morning. Here we notice how, on similar illuminationcondition, the monday morning trend is much more stable and the peaks correspond to actualdynamic events.

(a) Early Sunday (b) Late Sunday (c) Monday

Figure 4.11: Dictionary size after one hour acquisition on sunday, at the end of acquisition onsunday, at the end of acquisition on monday.

83

(a) Sunday (b) Monday

Figure 4.12: Percentage of patches in an anomaly state on sunday and monday.

Figure 4.12 we analyze the percentage of anomalies per patch: on sunday most of the anomaliesoccur on the bottom region, on monday on the top right regions. It is apparent that these regionswill also trigger most of the dictionary updates.

We conclude with a remark on the dictionaries size. The current procedure for dictionary up-dating considers an injection of Kupdate = 5 atoms in order to incorporate most of the availableup-to-date information. A spectral analysis on a sample of dictionaries allowed us to observethere is still some redundancy in our representation which lead us to the conclusion there is stillsome room for improving the compactness of our model. In particular, either the number ofatoms inserted on each update, or the cleaning procedure, might be improved observing the sig-nificant singular values of the updated dictionaries, adapting the number of new atoms/the atomsto be removed to the effective information gained by the model. This issue is currently underinvestigation.

4.7 Discussion

We proposed a space-variant method for background modeling, we refer to as Background Mod-eling Through Dictionary Learning (BMTDL). The method is entirely data-driven, integrates anonline dictionary learning procedure that allows us to update the model only when and whereit is necessary, and is built on the assumption that, in the absence of dynamic events, all thevideo frames acquired by a fixed camera may be seen as a noisy version of a background pro-totype. On this respect the video stream produces semi-supervised input data which are fed toa dictionary learning module and produce a background model able to adapt to the working en-vironment. The method, which is intrinsically patch based, leverages on the spatial correlation

84

between pixels and shows robustness and stability across the different types of changes includingillumination, jitter, motion, dynamic background, indoor/outdoor scenarios, and image sensors.A pixel based analysis of the obtained results reveals that the proposed module is already effec-tive for detecting changes.The current implementation is not real time but the proposed framework can be naturally par-allelized on multicore systems, thanks to the inherent independence of the computation of eachpatch. The design of an appropriate architecture where dictionary update is (possibly) performedoffline and the online processing of each patch is synchronized is currently under investigation.

85

Chapter 5

Automatic Emotion Recognition

In this chapter we present a work that introduces a new challenge for data representation: weaim at automatically recognize emotions from body gestures. The data, in this case, are streamsof 3D coordinates recorded by motion capture systems tracking a set of body joints. To describethe gestures performed during the expression of each emotion we extract a set of emotion-relatedfeatures, then we represent them adaptively, learning different dictionaries for each feature. Oncethe data are suitably represented a classification step allows the system to perform automaticemotion recognition. The system works in real-time. To provide further testing of the system, wedeveloped two games, where one or two users interact with each other or the computer and haveto understand and express emotions with their body.

5.1 Introduction

Although psychological research indicates that bodily expressions convey important affective in-formation, to date research in emotion recognition focused mainly on facial expression or voiceanalysis. This is probably due to the fact people tend to mostly focus on face and voice expres-sions, and the acquisition of relevant data was easily done with simple sensors (video camerasand microphones). Nowadays, motion capture systems allow us to acquire important informationabout the body gestures, information that are useful to understand the emotion a person is ex-pressing. Therefore in this chapter we focus on real-time automatic emotion recognition from theanalysis of body movements; our research is based on studies from experimental psychology (deMeijer[dM89], Wallbott [Wal98], Boone & Cunningham [BC98]) and from humanistic theories(Laban effort [LL47, Lab63]). We would like to remark that our goal is not gesture recognition,instead we analyze the emotions expressed through a gesture. In other words, the same gesturecan be performed in different emotional states (for instance saying we can hail to somebody in ahappy or sad way), and we want to distinguish between these emotional states. This framework

87

is part of the FP7-ICT ASC-Inclusion European Project1, that aims at developing ICT solutionsto assist children affected by Autism Spectrum Conditions (ASC). It focuses on the developmentof serious games to support children to understand and express emotions. The complete frame-work will process facial expressions [GBCH06], voice [GAG+10], and full-body movementsand gestures. The stand-alone system described in this chapter will be integrated with the othertwo modalities for the final version of the serious games. In our study we start from detailed 3Dmotion data recordings of full-body movements obtained by professional grade optical motioncapture systems (e.g., Qualysis [Qua13]).

Also, we assess the appropriateness of our method on 3D data acquired by low-cost RGB-Dsensors (e.g., Kinect [Kin13]). The adoption of low-cost measuring devices, which are lessprecise but easier to acquire and to install, will ensure the usability of our framework in contextswhere it is not possible to install sophisticated motion capture systems.

The chapter is organized as follows: in Section 5.2 we review relevant literature related to theanalysis of emotions, then in Section 5.3 we describe the framework in details, introducingemotion-related dimensions, describing the data representations and the classification procedure;in Section 5.4 we describe the experimental set-up, the validation process, and two applicationsof the framework; finally, in Section 5.6 conclusions are given.

5.2 Related work

Over the years research in emotion recognition mainly focused on facial expression or voice anal-ysis, in accordance with the intuition proposed by Ekman [Ekm65], who pointed out how peoplefocus more on facial expression than body gesture when they try to understand other people’semotions. However, recent research in experimental psychology suggests that body languageconstitute a significant source of affective information. Bull [Bul87] found that interest/boredomand agreement/disagreement can be associated with different body postures/movements.

Studies on the recognition of emotions from point-light display were carried out by Johansson(1973), who inspired further work by Pollick and colleagues [MPP06, HP00], Dittrich [DTLM96],and Atkinson [ADG+04].

Pollick et al.[PPBS01] found that given point-light arm movements human observers could dis-tinguish basic emotions with an accuracy significantly above the chance level. Coulson [Cou04]highlighted the role of static body postures in the recognition task where artificially generatedemotional-related postures where shown to people. Techniques for automated emotion recogni-tion from full body movement were proposed by Camurri et al [CLV03].

Even more so, research in experimental psychology demonstrated how some qualities of move-

1http://www.asc-inclusion.eu/

88

http://www.asc-inclusion.eu/

ment are related to specific emotions: for example, fear brings to contract the body as an attemptto be as small as possible, surprise brings to turn towards the object capturing our attention, joymay bring to openness and upward acceleration of the forearms [BC98]. Body turning away istypical of fear and sadness; body turning towards is typical of happiness, anger, surprise; wetend to open our arms when we are happy, angry or surprised; we can either move fast (fear,happiness, anger, surprise) or slow (sadness). In [dM89] de Meijer presents a detailed study ofhow the body movements are related to the emotions. He observes the following dimensions andqualities: Trunk movement: stretching - bowing; Arm movement: opening - closing; Verticaldirection: upward - downward; Sagittal direction: forward - backward; Force: strong - light; Ve-locity: fast - slow; Directness: direct - indirect. de Meijer notices how various combinations ofthose dimensions and qualities can be found in different emotions. For instance a joyful feelingcould be characterized by a Strong force, a fast Velocity and a direct Trajectory but it could havea Light force as well, or be an indirect movement.

Works inspired by these studies that involve the analysis of the entire body movement showedthat movement-related dimensions can be used in the inference of distinct emotions [dM89,Wal98, CLV03, GDC+11].

Kapoor et al. [KBP07] demonstrated a correlation between body posture and frustration in acomputer-based tutoring environment. In a different study Kapur et al. [KKVB+05] showedhow four basic emotions could be automatically distinguished from simple statistical measuresof motion dynamics. Balomenos et al. [BRI+05] combined facial expressions and hand gesturesfor the recognition of six prototypical emotions.


In this section we present in details the method we have developed for automatic emotion recog-nition from body movements. The set of emotions considered in this work is limited to sixemotions, because they are widely accepted and recognized as ”basic” (especially from the bodygestures point of view) and cross-cultural [Cor96, Cor00].

5.3.1 Movement features for emotion recognition

It is well-known how the body is able to convey and communicate emotions and in generalimplicit information. It is important to understand what movement features people use to dis-criminate between the different emotions. The first part of this study has focused on this topic: aset of relevant low- and mid-level movement related features has been identified.

We used a motion capture system to retrieve a 3D skeleton of the user moving over time. Thisskeleton is made up of points in the space connected by lines that represent stylised limbs. Each

89

Figure 5.1: 3D skeleton and associated labels.

of these points is a joint that can be tracked over time, resulting in a trajectory that describes howthe user moved that part of the body.

Figure 5.1 shows the labels associated with each body point.

We have implemented different algorithms to extract the features from the available data [PSCO13].We focus on the tracking of joints positioned on the upper body (head, shoulders, elbows, hands,and torso), according to the study of Glowinski et al. [GDC+11]. The joints are tracked with twodifferent systems: the Qualisys motion capture system [Qua13], and a Microsoft Kinect [Kin13].

Qualisys is a sophisticated motion capture system, composed by different (9 in our setup) infraredcameras that capture the light reflected by passive markers. The data it provides are precise andreliable (its spatial measurement error is under the millimeter), its frame rate is 100 fps, but ithas also many drawbacks: it is very expensive, it requires a complex calibration procedure anda large space (and, for these reasons, it has a limited portability). Finally, it requires the userto wear markers, making it inappropriate for users with Autism Spectrum conditions. Instead,Microsoft Kinect as a commercial product is very easy to acquire and to install, it has very fewrequirements on the acquisition environment, but provides noisier and lower resolution data,

90

since its frame rate is only 30 fps and its spatial measurement error at the optimal distance fromthe sensor (2 meters) is 3 mm for the x/y axes and 1 cm for the z axis.

The two acquisition systems are very different and are indeed meant for two different purposes:Qualisys has been used at an early stage of development, as a proof of concept of the proposedmethodology, Kinect has been adopted later to test the applicability of the method in real-worldapplications and games.

In this section we describe the features computed from three-dimensional user tracking data. Thefollowing set of features includes indexes computed on the user movement and its qualities, onthe body posture and on the trajectories drawn by the tracked joints.

Kinetic Energy

The Kinetic Energy is the overall energy spent by the user during the movement, estimatedas the total amount of displacement in all of the tracked points. The amount of movementactivity is important for differentiating emotions. The highest values of energy are related toanger, joy and terror, the lowest values correspond to sadness and boredom. Camurri and col-leagues [CLV03] showed that movement activity is a relevant feature in recognizing emotionfrom full-body movement. For these reasons, we included in the set of expressive features anapproximated measure of the overall motion energy at time frame f . Given 3D user trackinginformation, let vi(f) =

√x2i (f) + y2

i (f) + z2i (f) denote the magnitude of velocity of the i-th

tracked point at time frame f . We then define KE(f), the Kinetic Energy index at the frame f , asthe weighted sum of the kinetic energy of each joint:

KE(f) =1

2

n∑i=1

miv2i (f) (5.1)

where mi is the approximation of the mass of the i-th joint. The m values are computed fromanthropometric tables [MCC+80].

Contraction Index

The Contraction Index is a measure, ranging from 0 to 1, of how the body of the user occupies thespace surrounding it. It is related to Laban’s ”personal space” [Lab63, LL47]. The ContractionIndex is the normalized Bounding Volume that is the volume of the minimum parallelepipedsurrounding the body of the user.

91

Smoothness and Fluidity

In general, Smoothness is synonymous for ”having small values of high-order derivatives.” Wall-bott [Wal98], in his analysis of qualitative aspects of hand movements of psychiatric patients,noticed that movements judged as smooth are ”characterized by large circumference, long wave-length, high mean velocity, but not abrupt changes in velocity or acceleration (standard devia-tions of velocity and acceleration). Thus, smooth movements seem to be large in terms of spaceand exhibit even velocity”. We have therefore adapted Wallbott’s statements on the qualitativedimensions of underconstrained arm movements and we have computed hands trajectories cur-vature to identify their smoothness. Curvature (k) measures the rate at which a tangent vectorturns as a trajectory bends. It is defined as the reciprocal of the radius (R) of the curve describedby the trajectory:

k =1

R(5.2)

A joint’s trajectory following the contour of a small circle will bend sharply, and hence will havehigher curvature that means low fluidity; by contrast, a point trajectory following a straight linewill have zero curvature so high smoothness.

The curvature is computed for a single point trajectory r at time frame f as follows:

ki =ri × rir3i

(5.3)

where ri is the velocity of the trajectory of the i-th point and ri is its acceleration.

The principle of the curvature can be applied to the movement velocity and its variation in timeto calculate Fluidity: high curvature of the speed trajectory in time means low fluidity, whilelow curvature means high fluidity. To calculate the Fluidity we compute the curvature of thetangential velocity of the desired joint as described in Equation 5.3.

Symmetry

Lateral symmetry of emotion expression has long been studied in face expressions, resulting invaluable insights about a general hemisphere dominance in the control of emotional expression.An established example is the expressive advantage of the left hemiface that has been demon-strated with chimeric face stimuli, static pictures of emotional expressions with one side of theface replaced by the mirror image of the other. A study by Roether et al. on human gait demon-strated pronounced lateral asymmetries also in human emotional full-body movement [ROG08].Twenty-four actors (with an equal number of right and left-handed subjects) have been recordedby using a motion capture system during neutral walking and emotionally expressive walking(anger, happiness, sadness). For all the three emotions, the left body side moved with signif-icantly higher amplitude and energy. Perceptual validation of the results has been conducted

92

through the creation of chimeric walkers using the joint-angle trajectories of a half body to an-imate completely symmetric puppets. Considering that literature pointed out the relevance ofsymmetry as behavioural and affective feature, we address the symmetry of gestures and its rela-tion with emotional expression. It is measured by evaluating limbs spatial symmetry with respectto the body on each available dimension. Each partial symmetry (SIx, SIy, SIz) is computedfrom the position of the centre of mass and the left and right joints (e.g., hands shoulders, foots,knees) as described below:

SIx =(xLi − xB)− (xRi − xB)

|xLi − xB| − |xRi − xB|i = 0, 1, ..., n (5.4)

SIy =(yLi − yB)− (yRi − yB)

|yLi − yB| − |yRi − yB|i = 0, 1, ..., n (5.5)

SIz =(zLi − zB)− (zRi − zB)

|zLi − zB| − |zRi − zB|i = 0, 1, ..., n (5.6)

where xB, yB, zB are the coordinates of the centre of mass, xLi, yLi, zLi are the coordinates of aleft joint (e.g., left hand, left shoulder, left foot, etc.) and xRi, yRi, zRi are the coordinates of therelative right joint (e.g., right hand, right shoulder, right foot, etc.). The three partial indexes arethen combined in an average index that expresses the overall estimated symmetry.

Forward-Backward leaning of the upper body and relative positions

Head and body movements and positions are relied on as important features for distinguishingbetween various emotional expressions [SH95]. The amount of forward and backward leaningJLi of the joint i at the time frame f is measured by the velocity of the joint displacement alongits z component (depth) with respect to the body position and orientation.

JLi =(zB − zLi)− (zB − zRi)

zRi − zLi(5.7)

Directness

Directness is one of the features that are deemed important in the process of recognizing emo-tions [dM89]. A direct movement is characterized by almost rectilinear trajectories. We com-pute the Directness Index of a trajectory drawn in the space by a joint as the ratio between theeuclidean distance, calculated between the starting and the ending point of the trajectory, and the

93

actual length of the trajectory. The directness index tends to assume values close to 1 if a move-ment is direct and low values (close to 0) otherwise. In the case of three dimensional trajectoriesthe index is computed as follows

DI =

√(xE − xS)2 + (yE − yS)2 + (zE − zS)2∑N−1

k=1

√(xk+1 − xk)2 + (yk+1 − yk)2 + (zk+1 − zk)2

(5.8)

where xS , yS , zS are the coordinates of the starting point, xE , yE , zE , are the coordinates of theending point and N is the length of the trajectory.

Periodicity

The Periodicity Index can be calculated using the Periodicity Transform [SS99]. It decomposessequences into a sum of periodic sequences by projecting onto a set of periodic subspaces. ThePeriodicity Transform looks for the best periodic characterization of the sequence x of lengthN . The underlying technique is to project x onto some periodic subspace Pp. This periodicityis then removed from x leaving the residual r stripped of its p-periodicities. A sequence of realnumbers is called p-periodic if there is an integer p with x(k + p) = x for all k integers.

Impulsiveness

In human motion analysis, Impulsiveness can be defined as a temporal perturbation of a regimemotion (Heiser et al. [HFS+04]). Impulsiveness refers to the physical concept of impulse asa variation of the momentum. This contributes to define and reach a reference measure forimpulsiveness. From psychological studies (Evenden [Eve99]; Nagoshi et al. [NWR06]), animpulsive gesture lacks of premeditation, that is, it is performed without a significant preparationphase. We developed an algorithm for impulsiveness detection, derived by the work of Manciniand Mazzarino [MM09], where a gesture is considered an impulse if it is characterized by a shortduration and high magnitude. In addition to the observations made by Mancini and Mazzarino weincluded the condition that, to be recognized as impulsive, a gesture has to be performed withoutpreparation. This includes sudden change of the movement direction or intensity. Algorithm 4describes how impulsiveness is computed, where dt is the gesture duration, KE is the kineticenergy and KE is the mean of the kinetic energy computed over the gesture duration. The valuesof the proposed thresholds have been empirically estimated through perception tests on videosportraying people who performed highly and lowly impulsive gestures.

Table 5.1 reports all the features described so far, and others that are more intuitive, along withthe joints where they are extracted from.

94

Algorithm 4 Impulsiveness computationlet δt = 0.45 sec;let energyThreshold = 0.02;if KE > energyThreshold then

if 0 ≤ dt ≤ δt thenevaluate energy peaks and movemet direction;if the energy peak is solitaire in the given direction thenII = KE

dt

end ifend if

end if

Table 5.1: List of all the features and their provenienceAll tracked joints Head Joint Torso Joint Hand(s) Joint(s) Shoulders/ElbowsKinetic Energy,Contraction In-dex,Periodicity,Global directionof motion,Overall symme-try

Speed,Acceleration,Jerk,Leaning speed,Leaning position,Symmetry w.r.t.hands,Symmetry w.r.t.shoulders

Leaning speed,Leaning position

Distance betweenhands,Smoothness,Speed,AccelerationJerk,Periodicity,Direction,Fluidity,Impulsiveness,Curvature,Distance from thetorso,Distance from thehead

Speed,Acceleration,Jerk,Fluidity,Direction,Periodicity,Curvature,Distance fromthe torso (elbowsonly)

95

Figure 5.2: Kinetic energy-based segmentation of gestures primitives

5.3.2 Data representation

We now discuss the operations we made on the extracted features in order to have a meaningfulrepresentation of each gesture. Raw data are composed by relatively long sequencies includingrepetitions of a certain emotion.

From raw data to histograms

The body joints are tracked along the entire sequence and the resulting trajectories are segmentedinto a set of single gestures (or primitives, clips, segments), by applying a threshold over somefeatures. In our experiments we observed that we obtain a good segmentation by thresholding theKinetic Energy. Figure 5.2 shows the kinetic energy of a sequence of seven different gestures.Basing the segmentation on low values of the kinetic energy it is possible to separate each singlegesture correctly.

We now have N primitive gestures xi ∈ Rf×Ti , for i = 1, ..., N where f is the number of fea-tures (in particular, f = 86, see Table 5.1), and Ti is the temporal lenght of the i-th gesture. Ingeneral such time series will have variable lengths, depending on the different duration, speed,and complexity of the gesture. At the same time, from the machine learning stand point, weusually assume that points are feature vectors leaving in a fixed d−dimensional space. This isan issue common to problems that deal with temporal data and many solutions were proposed.It is common practice to map initial representations on an intermediate (fixed length) represen-tation [NSO11, NO12]. There are different ways of doing so, e.g. sub-sampling variable lengthseries with a fixed number of samples (but in this case we may miss important information),clipping a time series into a subset of fixed length sub-sequences (and in this case the size ofthe sub-sequences will be a crucial parameter), zero-padding of the shorter sequencies (in thiscase there is the possibility to alter the information contained in the sequence, and the maximumlenght of the sequencies would be crucial again). Alternatively one may resort to alternative de-

96

scriptions, such as first or second order statistics. In this work, after a preliminary experimentalanalysis, we rely on first order statistics and compute histograms of each time series. Once an ap-propriate quantization is chosen, each temporal sequence xi will be represented by a histogramof a fixed length. In this way we loose temporal information, in favour of a compact, easy tocompute, fixed-length representation.

Figure 5.3 confirms that the histogram-based representation still carries important informationon the different type of emotions: the figure reports cumulative histograms of the ContractionIndex of 100 gesture primitives for each one of the six basic emotions we consider. The his-tograms show a different trend, suggesting that the Contraction Index is differently expressed inthe different emotions.

Adaptive data representations

Starting from the histograms, we apply a dictionary learning step to smooth the representationsand capture the most representative patterns.

Here again, according to the theory presented in Chapter 2, we assume we have a training setof n d-dimensional signals (n histograms representing a given feature obtained by processingn different primitive gestures), and we learn a dictionary D ∈ Rd×K and the coefficient matrixU ∈ RK×n, where K is the number of atoms that compose the dictionary.

In this chapter we consider both the classical L1 dictionary learning and its variant based onestimating the encoding matrix C (see Section 2.4.3). The latter approach may be very beneficialfrom the computational point of view, concerning the decomposition of the data with respect to afixed dictionary, learnt previously, and is recommended for the real-time performances requiredby the interactive games. Thus, apart from D and U, we will compute also a matrix C ∈ RK×d

such that U ≈ CX. In this case we have three parameters to tune: K, τ , and η.

We process the various features independently, learning one different dictionary for each ofthem, that is, if we have f features, we learn the dictionaries D1, ...,Df , the coefficient ma-trices U1, ...,Uf , and the encoding matrices C1, ...,Cf . The primitives, belonging to the differentemotions, are used all together to learn the matrices, in an unsupervised fashion (i.e. we do notmake any assumption on the labels at this stage).

The final representation step, is to build the complete feature vector concatenating each sub-vector obtained representing suitably each feature. Let vij be the representation of the j-th featureof the i-th gesture. The gesture xi will be represented by the feature vector vi = [vi1|vi2|...|vif ].

97

Figure 5.3: 30 bins cumulative histograms of the Contraction Index of 100 segments per emo-tion produced from the recordings of 12 people: anger (top left), sadness (top right), happiness(middle left), fear (middle right), surprise (bottom left), and disgust (bottom right).

98

5.3.3 Classification

Once data have been appropriately represented we obtain a training set of (vi, yi), where vi canbe a concatenation of histograms or of coefficient vectors, while yi encodes one of M possibleclasses (emotions).

Each basic emotion represent one of six classes, and we want to classify correctly the gestures.Assuming to have a labelled training set, we can train a multi-class classifier, building a modelthat is sample-based. We need this model to have a generalization of the properties of eachsingle class, so that basing its answers on this model, our framework will be able to classify newunlabelled data, performing the automatic emotion recognition.

We base our classification step on a one-vs-one Support Vector Machine (SVM) [Vap98], fol-lowed by an Error Correcting Output Coding [KJO02], that is, building a matrix that models thecorrelation between the different classes. ECOC allows us to attenuate the effect of having aseverely non separable problem, and improves the overall classification results of 3-5 %.

5.4 Method assessment

We collected a dataset of people expressing emotions with their body, using two different acqui-sition devices: the Qualisys motion capture system [Qua13], and a Microsoft Kinect [Kin13].

In both cases we recorded sequences of 3D coordinates (or 3D skeletons), corresponding to thesame body joints.

The two datasets include 12 actors, of whom four are female and eight males, aged between24 and 60, performing the six basic emotions with their body, each of them for a number oftimes ranging from three to seven. We obtained a total of about 100 videos, which have beensegmented according to the Kinetic Energy values to separate segments of expressive gestures.Each datum has been associated with a label, obtained by the actor stating what type of emotionhe or she was expressing.

5.4.1 Data validation

To evaluate the degree of difficulty of the task, the obtained segments have been validated byhumans (60 anonymous people who did not participate to the data acquisition process) throughan on-line survey.

Each participant was shown 10 segments displaying the 3D skeletons, and they were asked toguess which emotion was being expressed or, alternatively, choose a ”I don’t know” item. Fig-ure 5.4 shows an example of the input stimulus, where the limited amount of information con-

99

Figure 5.4: An example of the stimuli used for data validation with humans.

6 emotions 4 emotionshappiness 81,3% 87,5%

fear 48,5% 81,2%disgust 37,2% -sadness 86,7% 94,9%anger 73,9% 82,0%

surprise 35,2% -average 61,9% 85,2%

Table 5.2: Human validation results. First column: label of the emotion. Second column: resultsobtained with 6 different emotions. Third column: results obtained with 4 different emotions.

veyed by the data is clear. Indeed, the goal of this experiment was to understand to what extent ahuman observer is able to understand emotions simply by looking at the body movements. Thesole 3D skeleton is a guarantee that the user is not exploiting other information, such as insightsfrom facial expressions or the context. The human validation underlined the fact that three out ofthe six basic emotions (happiness, sadness, and anger) are clearly recognizable from body move-ments, while the other three (surprise, disgust, and fear) are easily confused with one another.For this reason a sub-problem of four classes (happiness, sadness, anger, and fear) has also beentaken into account. The results of the human validation, with respect to the data ground truth, arereported in Table 5.2 and clearly testify the difficulty of the problem we are addressing.

5.4.2 Classification results

Two different problems have been considered: the first one considering only four emotions (Hap-piness, Sadness, Anger, Fear), and the second considering all of the six basic emotions.

The emotion dimensions described in Section 5.3 have been computed from the data and sum-

100

Table 5.3: Comparison of the early results obtained with data captured with Qualisys and Kinect,represented as 30 bins histograms

Qualisys Kinect4 classes 6 classes

RS 79.2 62.3LOSO 74.6 54.2

4 classes 6 classesRS 82.0 68.5

LOSO 78.7 61.6

marized in 30-bins histograms. Then a dictionary per feature has been learned, with the relativecoefficients and encoding matrix. Finally, a linear one-vs-one SVM with ECOCs has been usedto classify all the different feature vectors.

A first set of experiments, Random Split (RS), has been performed by splitting the dataset ran-domly in a training and a test set of equal size (the procedure has been repeated 50 times, theresults we show are an average of the repetitions): in this case, gestures performed by the samesubject might appear both in the training and in the test set. A second set of experiments, LeaveOne Subject Out (LOSO), has been carried out with a Leave-One-Subject-Out cross validation,training the classifiers over 11 subjects and testing them with data of the 12th left out subject.

Proof of concept

As a feasibility study, we have used the Qualisys data to perform a first set of experiments. Thoseare reliable data, but they are more difficult to get with respect to the use of Kinect. In fact wehave a smaller number of segments acquired with Qualisys (213 segments/four emotions and 310segments/six emotions) rather than Kinect (398 segments/four emotions and 579 segments/sixemotions). Table 5.3 shows the preliminary results obtained using as feature vectors 30 binshistograms, both on Qualisys and Kinect data. Comparing the results, we get better results onthe Kinect data, despite they are less reliable, because of the dimension of the dataset. A smalldataset is not enough to capture the high variability of the data typical of the problem we areaddressing. The results we show in the reminder of the section are obtained on the bigger Kinectdataset.

Including the encoding matrix

The first experiments aim at assessing whether the computation of the encoding matrix C caninfluence D and U. The first results we show in Table 5.4 are the classification results obtainedconsidering the 4 classes problem in the RS setup. We compare the accuracy obtained computingU with η = 0, thus without considering the C matrix in the functional, and with η = 0.7, withτ ranging between 0.1 and 0.9. In this case, and in all the other experiments, the value assignedto η has been evaluated on different runs of classification as the best. Figure 5.6 shows how the

101

Table 5.4: Comparison of the accuracy obtained optimizing the functional without or with the Cterm

τ η = 0 η = 0.70.1 77.4 83.60.3 76.6 82.30.5 77.4 82.30.7 77.4 81.90.9 78.6 82.0

Table 5.5: RS - summary of the best classification rates4 classes 6 classes

Raw Data 82.0 68.5U 84.9 69.9

CX 84.5 69.8

classification accuracy varies as the parameters η and τ vary: we are in the 4 classes RS problem,with K=20. It is clear that the classification results are positively influenced by the inclusion ofC in the functional, even if it is not directly involved in the computation of the representations ofthe test set.

Comparing different representations

Figure 5.5 considers the RS setup and compares a raw data (histograms) representation withcoding vectors u and Cx. We have an improvement of the classification accuracy if we use thecoding vectors, and the us achieve better results than the Cxs. In this figure we show the resultsobtained fixing η = 0.7 and letting τ range between 0.1 and 0.9, and K between 20 and 40.

Figure 5.7 shows a comparison between the classification results obtained in the LOSO setupwith the raw data (histograms), with the us, and with the Cxs. In this case, it is still clear that rawdata are outperformed by the adaptive representations. The us look in general more stable withrespect to the different parameters. We are currently analysing this behavior to obtain a clearerunderstanding of the relationship between the two representations u and Cx.

To summarize the classification results, Table 5.5 and Table 5.6 present the best classificationpercentage obtained using the different data representations.

102

Figure 5.5: RS - Comparison of the classification results obtained with histograms (blue bars),Us (red bars), and Cxs (yellow bars), for the 4 classes (left) and 6 classes (right) problems.

Table 5.6: LOSO - summary of the best classification rates4 classes 6 classes

Raw Data 78.7 61.6U 81.4 63.4

CX 81.3 63.7

103

Figure 5.6: The accuracy values obtained varying τ and η. The maximum is achieved forτ = 0.9 and η = 0.7.

104

Figure 5.7: LOSO - Comparison of the classification results obtained with histograms (bluebars), Us (red bars), and Cxs (yellow bars), for the 4 classes (left) and 6 classes (right) problems.

105

Figure 5.8: Two different screens of the GUI.

5.5 Application to the design of serious games

The described system has been used as a building block in the design of serious games developedwithin the ASC-Inclusion project.

The goal is to implement and evaluate an on-line body expression analyser, showing childrenwith Autism Spectrum Condition how they can improve their context-dependent body emotionalexpressiveness. Based on the movements of the child acquired by RGB-D sensors (as the Mi-crosoft Kinect), the set of emotionally relevant features is extracted and analysed to infer theemotional state of the child.

The body movement analyser is based on the EyesWeb XMI development platform 2, that allowsus to extract the feature vectors in real-time, and to give a feedback to the user.

We designed two game demonstrations. Both the games perform a live automatic emotion recog-nition, and interact with the user by asking to guess an emotion and to express an emotion withthe body. In the remainder of this section the GUIs and the games will be described in details.

5.5.1 Graphical user interface

Both the game demonstrations share a very simple GUI that recalls a blackboard with whitechalk writings over it. Interaction with the GUI is gestural and it is performed with the helpof a Kinect sensor: the user may select one of the available options by driving a pointer with ahand movement. There is no sound, all the questions and the instructions are made through text.Figure 5.8 shows examples of the GUI.

2 http://www.infomus.org/eyesweb

106

5.5.2 Body emotion game demonstration

The Body Emotion Game is a short game composed of two parts. In the first part the systemshows to the player a video of a person that expresses an emotion. Being the video of a 3Dskeleton, it shows only information carried by the body movements and discards other stimuli(context, facial expression, voice). Then, the system shows a set of possible emotions (the sixbasic emotions: sadness, anger, happiness, fear, disgust, and surprise) and the player has toselect the emotion that in her opinion the person in the video was feeling. If the player selects thecorrect emotion she gains one point. In the second part the player should act the same emotionof the video with the body. The system will try to understand what emotion is being expressedby the player. If the recognized emotion is the correct one, the player gains another point. Ifthe recognized emotion is different from the correct one, the system will ask to the player if shewants to try again. If she says no, she will not gain any other point. She will have up to threeattempts. If the system answers correctly, or the user spends all the three attempts, the game endsand the user is given a final score that includes the points acquired during the two phases of thegame.

5.5.3 Emotional charades

The Emotional Charades is a two-player game. Player 1 and Player 2 cannot see each other,they both have access to a computer equipped with a Kinect sensor. In the first round, Player1 has to choose one of the six basic emotions and express it with the body. Player 2 will seeonly the depth map produced by Player 1 and will have to guess which emotion she chose. Thecomputer will try to do the same. Player 1 will be shown the answers given by the computer andPlayer 2, and will say if they have given the correct answer or not. This is a cooperative gamebetween the humans against the computer, so the points gained by each participant depend onthe answers given by the others. Player 1 will gain 2 points if both Player 2 and the computerguess the correct emotion, 1 point if only one of them is correct and no points if they are bothwrong. Player 2 and the computer will gain 1 point each if they are both correct, 2 points if theother guesser is wrong and no points if their answer is wrong. The game continues with the rolesinverted. At the beginning of the game there is an extra score, that counts the number of correctanswers given by the human players and the number of corrected answer given by the computer.Figure 5.9 shows an example of interaction between two players.

5.6 Discussion

In this chapter we have presented a complete framework to monitor and process body movementsto automatically understand emotions. The input data we have processed are acquired by motion

107

Figure 5.9: An example of interaction between two players

capture systems. We have identified and extracted a set of emotion-related features to describethe gestures belonging to different emotional states, and we have represented them both via his-tograms and adaptively, with dictionary learning and sparse coding. The whole system relies ona motion capture device, Python libraries, and EyesWeb XMI. The obtained classification resultsare in general very encouraging, compared to the ones obtained by the human beings validation.In this chapter we obtained a further evidence of the appropriateness of adaptive representationsfor large dimensional problems. The results obtained so far speak in favour of sparse coding,with and without the estimate of an encoding matrix, although further investigations are neededto obtain a full understanding of the potential of the different adaptive solutions. The next stepswill concern the feedback given by the system. From the discrete classification labelling, thesystem will move to a continue Valence-Arousal labelling, to be more coherent with the rest ofthe platform of the ASC-Inclusion project.

108

Chapter 6

Conclusions

This thesis investigated the problem of learning adaptive representations for analysing large di-mensional collections of data. In particular, sparse adaptive data representations have been stud-ied in depth, and they have been applied in different domains. The common thread of the ap-plication domains we considered is the nature of data: in all cases there is a large availabilityof them and they are (very) high dimensional. This poses a problem in the data analysis pro-cess, since these type of data are usually affected by noise and redundancy which is difficultto identify. Dictionary learning and sparse coding allow us to find meaningful representations,based on a dictionary of elements (or atoms) which are learnt from data and thus nicely adaptto a specific application domain. The main contribution of this thesis consists in exploring suchrepresentations to address three different problems belonging to different domains:

• Medical image analysis: here we tackle the problem of tissue segmentation on contrastenhanced 3D+T Magnetic Resonances.

• Video analisys: we devise a method for modeling a background of the scene and adapt itto scene variations as we are provided with new observations coming from a 2D+T videostream.

• Human Machine Interaction: we study the problem of automatic emotion recognition,which we address by classifying a 3D+T stream of 3D coordinates, once they have beenappropriately segmented and represented.

In each field we exploited all prior knowledge available, either provided by experts (physicians,engineers, or psycologists, in the three domains) or by the nature of the available data (for in-stance a spatial stability of data in the first two cases, or a temporal smoothness typical of all thedomains we addressed) and devised solutions based on sparse coding and dictionary learning. Inall cases the method we proposed proved to be very competitive with the state of the art.

109

More in details:

• In the case of Medical Image Analysis, after a proof of concept on a classical applicationof dictionary learning (image denoising) we focused on the more challenging domain ofTissue Segmentation: the data were 3D images of Magnetic Resonance varying over time.The goal in this application was to discriminate between the different tissues forming acertain anatomical region, basing the segmentation of the answer of a tissue to a constrantagent. Dictionaries of curves of each voxel varying over time were learned, and very goodresults were obtained on the segmentation. The atoms of the dictionaries were denoisedprototypes of the behaviors of the voxels belonging to different tissues: this was the rele-vant role of dictionary learning in this particular application.

• In the case of Video analysis, in Chapter 4 we have shown the use of adaptive representa-tions to build a background model robust with respect to illumination changes and dynamicbackground areas. The main assumption of this work is that in a video shot by a fixed cam-era the majority of the pixels belong to the background for the majority of the time, andthings change smoothly with respect to the time axis. This is an information that is re-lated to the nature of the data, and is exploited by our framework, that adapts itself onlinewhen and where it is needed. The dictionaries store all the possible configurations of thebackground of the scene, making the foreground detection easy and precise.

• In the case of Human Machine Interaction, in Chapter 5 we have dealt with streams of 3Dcoordinates in order to automatically recognize emotions from body gestures. In this casenot only the data have very large dimensionality, but the variability of the problem itself isvery large, as can be stated considering the results obtained in the data validation performedby humans. The results obtained in the relative work show that adaptive representationshelp in capturing the relevant characteristics that allow the discrimination between thedifferent emotions.

In the case of tissue segmentation and background modeling, we are currently developing parallelimplementations of the algorithm, that come naturally from the design of the frameworks. Thiswould save computational time, speeding up the output computation.

In the case of automatic emotion recognition, at time we are focusing on a different output ofthe system, based on the values of two emotion-related dimensions that describe if an emotion ispositive or negative and how much we are involved in expressing the emotion, namely Valenceand Arousal. Switching from a discrete output, based on six classes, to a continuous one, basedon points on the Valence-Arousal plane, further investigations must be done in the application ofadaptive representations.

Future work will include a more comprehensive analysis on the role of the encoding matrix Cand on its computational benefits.

110

Bibliography

[ACM12] W. K. Allard, G. Chen, and M. Maggioni. Multi-scale geometric methods fordata sets ii: Geometric multi-resolution analysis. Applied and ComputationalHarmonic Analysis, 32(3):435–462, 2012.

[ADG+04] A. P. Atkinson, W. H. Dittrich, A. J. Gemmell, A. W. Young, et al. Emotionperception from dynamic and static body expressions in point-light and full-light displays. PERCEPTION-LONDON-, 33:717–746, 2004.

[AEA06] M. Aharon, M. Elad, and Bruckstein A. K-SVD: An algorithm for design-ing overcomplete dictionaries for sparse representations. Signal Processing,IEEE Transactions on [see also Acoustics, Speech, and Signal Processing,IEEE Transactions on], 54(11):4311–4322, 2006.

[AEB05] M. Aharon, M. Elad, and A. Bruckstein. K-svd: Design of dictionaries forsparse representation. Proceedings of SPARS, 5:9–12, 2005.

[AMP06] S. Atev, O. Masoud, and N. Papanikolopoulos. Learning traffic patterns at in-tersections by spectral clustering of motion trajectories. In Intelligent Robotsand Systems, 2006 IEEE/RSJ International Conference on, pages 4851–4856.IEEE, 2006.

[APA07] R. Alonzi, A. R. Padhani, and C. Allen. Dynamic contrast enhanced MRI inprostate cancer. Eur J Radiol, 63(3):335–350, 2007.

[AW09] R. G. Abbott and L. R. Williams. Multiple target tracking with lazy back-ground subtraction and connected components analysis. Machine Vision andApplications, 20(2):93–101, 2009.

[AXF+09] S. C. Agner, J. Xu, H. Fatakdawala, S. Ganesan, A. Madabhushi, S. Englan-der, M. Rosen, K. Thomas, M. Schnall, M. Feldman, and Others. Segmen-tation and classification of triple negative breast cancers using DCE-MRI.In Biomedical Imaging: From Nano to Macro, 2009. ISBI’09. IEEE Inter-national Symposium on, pages 1227–1230. Department of Biomedical Engi-neering University of Pennsylvania Philadelphia , PA 19104, IEEE, 2009.

111

[BB98] M. Bertero and P. Boccacci. Introduction to inverse problems in imaging.CRC press, 1998.

[BBC11a] H. H. Bauschke, R. S. Burachik, and P. L. Combettes. Fixed-point algorithmsfor inverse problems in science and engineering, volume 49. Springer, 2011.

[BBC11b] S. Becker, J. Bobin, and E. J. Candes. Nesta: a fast and accurate first-ordermethod for sparse recovery. SIAM Journal on Imaging Sciences, 4(1):1–39,2011.

[BC98] R. T. Boone and J. G. Cunningham. Children’s decoding of emotion in ex-pressive body movement: The development of cue attunement. Developmen-tal psychology, 34:1007–1016, 1998.

[BC11] D Boyd and K Crawford. Six provocations for big data. cambridge, ma,microsoft research, 2011.

[BCL+07] Y. Boureau, S. Chopra, Y. Lecun, et al. A unified energy-based framework forunsupervised learning. In International Conference on Artificial Intelligenceand Statistics, pages 371–379, 2007.

[BDE09] A. M. Bruckstein, D. L. Donoho, and M. Elad. From sparse solutions ofsystems of equations to sparse modeling of signals and images. SIAM review,51(1):34–81, 2009.

[BDM+12] A. Burner, R.e Donner, M. Mayerhoefer, M. Holzer, F. Kainberger, andG. Langs. Texture bags: Anomaly retrieval in medical images based on lo-cal 3d-texture similarity. In Medical Content-Based Retrieval for ClinicalDecision Support, pages 116–127. Springer, 2012.

[BF03] J. J. Benedetto and M. Fickus. Finite normalized tight frames. Advances inComputational Mathematics, 18(2-4):357–385, 2003.

[BHH11] S. Brutzer, B. Hoferlin, and G. Heidemann. Evaluation of background sub-traction techniques for video surveillance. In Computer Vision and PatternRecognition (CVPR), 2011 IEEE Conference on, pages 1937–1944. IEEE,2011.

[BN06] C. M. Bishop and N. M. Nasrabadi. Pattern recognition and machine learn-ing, volume 1. springer New York, 2006.

[Bou11] T. Bouwmans. Recent advanced statistical background modeling for fore-ground detection: A systematic survey. RPCS, 4(3):147–176, 2011.

112

[BRI+05] T. Balomenos, A. Raouzaiou, S. Ioannou, A. Drosopoulos, K. Karpouzis, andS. Kollias. Emotion analysis in man-machine interaction systems. In Machinelearning for multimodal interaction, pages 318–328. Springer, 2005.

[BSVV11] C. Basso, M. Santoro, A. Verri, and S. Villa. Paddle: proximal algorithm fordual dictionaries learning. Artificial Neural Networks and Machine Learning–ICANN 2011, pages 379–386, 2011.

[BT09a] A. Beck and M. Teboulle. Fast gradient-based algorithms for constrained totalvariation image denoising and deblurring problems. Image Processing, IEEETransactions on, 18(11):2419–2434, 2009.

[BT09b] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithmfor linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

[Bul87] P. E. Bull. Posture and gesture. Pergamon press, 1987.

[BVD11] O. Barnich and M. Van Droogenbroeck. Vibe: A universal background sub-traction algorithm for video sequences. Image Processing, IEEE Transactionson, 20(6):1709–1724, 2011.

[CCH06] W. R. Crum, O. Camara, and D. L. G. Hill. Generalized overlap measures forevaluation and validation in medical image analysis. Medical Imaging, IEEETransactions on, 25(11):1451 –1461, nov. 2006.

[CCP+05] C. Chaux, P. L. Combettes, J. Pesquet, V. R. Wajs, et al. A forward-backwardalgorithm for image restoration with sparse representations. Signal Process-ing with Adaptative Sparse Structured Representations, pages 49–52, 2005.

[CDS01] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition bybasis pursuit. SIAM review, 43(1):129–159, 2001.

[CE14] A. Cevik and B. M. Eyuboglu. Mumford-shah based unsupervised segmen-tation of brain tissue on mr images. In XIII Mediterranean Conference onMedical and Biological Engineering and Computing 2013, pages 265–268.Springer, 2014.

[CGMCAFAL13] L. Cordero-Grande, S. Merino-Caviedes, S. Aja-Fernandez, and C. Alberola-Lopez. Groupwise elastic registration by a new sparsity-promoting met-ric: Application to the alignment of cardiac magnetic resonance perfusionimages. IEEE Transactions on Pattern Analysis and Machine Intelligence,35(11):2638–2650, 2013.

113

[CGT04] R. Chang, T. Gandhi, and M. M. Trivedi. Vision modules for a multi-sensorybridge monitoring approach. In Intelligent Transportation Systems, 2004.Proceedings. The 7th International IEEE Conference on, pages 971–976.IEEE, 2004.

[Che95] S. S. Chen. Basis Pursuit. PhD thesis, Stanford Univ., Stanford, CA, Nov.1995.

[CKL10] H. Chen, L. Kang, and C. Lu. Dictionary learning-based distributed com-pressive video sensing. In Picture Coding Symposium (PCS), 2010, pages210–213. IEEE, 2010.

[CLV03] A. Camurri, I. Lagerlof, and G. Volpe. Recognizing emotion from dancemovement: comparison of spectator recognition and automated techniques.International journal of human-computer studies, 59(1):213–225, 2003.

[CM10] G. Chen and M. Maggioni. Multiscale geometric wavelets for the analysis ofpoint clouds. In Information Sciences and Systems (CISS), 2010 44th AnnualConference on, pages 1–6. IEEE, 2010.

[Cor96] R. R. Cornelius. The science of emotion: Research and tradition in the psy-chology of emotions. Prentice-Hall, Inc, 1996.

[Cor00] R. R. Cornelius. Theoretical approaches to emotion. In ISCA Tutorial andResearch Workshop (ITRW) on Speech and Emotion, 2000.

[Cou04] M. Coulson. Attributing emotion to static body postures: Recognition accu-racy, confusions, and viewpoint dependence. Journal of nonverbal behavior,28(2):117–139, 2004.

[CRT06] E. J. Candes, J. K. Romberg, and T. Tao. Stable signal recovery from incom-plete and inaccurate measurements. Communications on Pure and AppliedMathematics, 59(8):1207–1223, 2006.

[CSBV] G. Chiusano, A. Stagliano, C. Basso, and A. Verri. Unsupervised tissue seg-mentation from dynamic contrast-enhanced magnetic resonance imaging. toappear in Artificial Intelligence in Medicine.

[CSBV11] G. Chiusano, A. Stagliano, C. Basso, and A. Verri. Dce-mri analysis usingsparse adaptive representations. In Kenji Suzuki, Fei Wang, Dinggang Shen,and Pingkun Yan, editors, Machine Learning in Medical Imaging, volume7009 of Lecture Notes in Computer Science, pages 67–74. Springer Berlin /Heidelberg, 2011.

114

[CSD+08] V. Cevher, A. Sankaranarayanan, M. F. Duarte, D. Reddy, R. G. Baraniuk, andR. Chellappa. Compressive sensing for background subtraction. In ComputerVision–ECCV 2008, pages 155–168. Springer, 2008.

[Cuc05] R. Cucchiara. Multimedia surveillance systems. In Proceedings of the thirdACM international workshop on Video surveillance & sensor networks, pages3–10. ACM, 2005.

[CW05] P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale Modeling & Simulation, 4(4):1168–1200,2005.

[CXWK11] Z. Cong, W. Xiaogang, and C. Wai-Kuen. Background subtraction via robustdictionary learning. EURASIP Journal on Image and Video Processing, 2011,2011.

[D+00] D. L. Donoho et al. High-dimensional data analysis: The curses and blessingsof dimensionality. AMS Math Challenges Lecture, pages 1–32, 2000.

[DDDM04] I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algo-rithm for linear inverse problems with a sparsity constraint. Communicationson pure and applied mathematics, 57(11):1413–1457, 2004.

[DDMOV09a] A. Destrero, C. De Mol, F. Odone, and A. Verri. A regularized framework forfeature selection in face detection and authentication. International journalof computer vision, 83(2):164–177, 2009.

[DDMOV09b] A. Destrero, C. De Mol, F. Odone, and A. Verri. A sparsity-enforcingmethod for learning face features. Image Processing, IEEE Transactions on,18(1):188–201, 2009.

[DE03] D. L. Donoho and M. Elad. Optimally sparse representation in general(nonorthogonal) dictionaries via `1 minimization. Proceedings of the Na-tional Academy of Sciences of the United States of America, 100(5):2197,2003.

[DET05] D. L. Donoho, M. Elad, and V. N. Temlyakov. Stable recovery of sparseovercomplete representations in the presence of noise. Information Theory,IEEE Transactions on, 52(1):6–18, 2005.

[DH08] M. Dikmen and T. S. Huang. Robust estimation of foreground in surveillancevideos by sparse error estimation. In Pattern Recognition, 2008. ICPR 2008.19th International Conference on, pages 1–4. IEEE, 2008.

115

[dM89] M. de Meijer. The contribution of general features of body movement tothe attribution of emotions. Journal of Nonverbal Behavior, 13(4):247–268,1989.

[DMMT10] M. B. Damasio, C. Malattia, A. Martini, and P. Toma. Synovial and inflamma-tory diseases in childhood: role of new imaging modalities in the assessmentof patients with juvenile idiopathic arthritis. Pediatric radiology, 40(6):985–98, June 2010.

[DS09] J. Duchi and Y. Singer. Efficient online and batch learning using forwardbackward splitting. The Journal of Machine Learning Research, 10:2899–2934, 2009.

[DSAW99] C. Demant, B. Streicher-Abel, and P. Waszkewitz. Industrial image process-ing: visual quality control in manufacturing. Springer, 1999.

[DSS+12] T. A. Dinh, T. Silander, B. Su, T. Gong, B. C. Pang, C. C. T. Lim, C. K. Lee,C. L. Tan, and T. Leong. Unsupervised medical image classification by com-bining case-based classifiers. Studies in health technology and informatics,192:739–743, 2012.

[DTH09] M. Dikmen, S. Tsai, and T. S. Huang. Base selection in estimating sparseforeground in video. In Image Processing (ICIP), 2009 16th IEEE Interna-tional Conference on, pages 3217–3220. IEEE, 2009.

[DTLM96] W. H. Dittrich, T. Troscianko, S. Lea, and D. Morgan. Perception of emotionfrom dynamic point-light displays represented in dance. Perception-London,25(6):727–738, 1996.

[EA06] M. Elad and M. Aharon. Image denoising via sparse and redundant represen-tations over learned dictionaries. IEEE Transactions on Image Processing,15:3736–3745, December 2006.

[EAHH99] K. Engan, S. O. Aase, and J. Hakon Husoy. Method of optimal directions forframe design. In Acoustics, Speech, and Signal Processing, 1999. Proceed-ings., 1999 IEEE International Conference on, volume 5, pages 2443–2446.IEEE, 1999.

[EB02] M. Elad and A. M. Bruckstein. A generalized uncertainty principle and sparserepresentation in pairs of bases. Information Theory, IEEE Transactions on,48(9):2558–2567, 2002.

[EFM10] M. Elad, M. A. T. Figueiredo, and Y. Ma. On the role of sparse and redundantrepresentations in image processing. Proceedings of the IEEE, 98(6):972–982, 2010.

116

[EHD00] A. Elgammal, D. Harwood, and L. Davis. Non-parametric model for back-ground subtraction. In Computer Vision—ECCV 2000, pages 751–767.Springer, 2000.

[Ekm65] P. Ekman. Differential communication of affect by head and body cues. Jour-nal of personality and social psychology, 2(5):726, 1965.

[EOS+08] S. Eida, M. Ohki, M. Sumi, T. Yamada, and T. Nakamura. MR factor analysis:improved technology for the assessment of 2D dynamic structures of benignand malignant salivary gland tumors. Journal of magnetic resonance imaging: JMRI, 27(6):1256–62, June 2008.

[ERKD99] K. Engan, B. D. Rao, and K. Kreutz-Delgado. Frame design using FOCUSSwith method of optimal directions (MOD). In Norwegian Signal ProcessingSymposium, volume 65, page 69, 1999.

[Eve99] J. L. Evenden. Varieties of impulsivity. Psychopharmacology, 146(4):348–361, 1999.

[FKSJS08] A. Fabijanska, M. Kuzanski, D. Sankowski, and L. Jackowska-Strumitto. Ap-plication of image processing and analysis in selected industrial computer vi-sion systems. In Perspective Technologies and Methods in MEMS Design,2008. MEMSTECH 2008. International Conference on, pages 27–31. IEEE,2008.

[FLS09] M. Fornasier, A. Langer, and C. Schonlieb. Domain decomposition methodsfor compressed sensing. arXiv preprint arXiv:0902.0124, 2009.

[FLS10] M. Fornasier, A. Langer, and C. Schonlieb. A convergent overlapping domaindecomposition method for total variation minimization. Numerische Mathe-matik, 116(4):645–685, 2010.

[GAG+10] O. Golan, E. Ashwin, Y. Granader, S. McClintock, K. Day, V. Leggett, andS. Baron-Cohen. Enhancing emotion recognition in children with autismspectrum conditions: An intervention using animated vehicles with real emo-tional faces. Journal of autism and developmental disorders, 40(3):269–279,2010.

[GBCH06] O. Golan, S. Baron-Cohen, and J. Hill. The cambridge mindreading (cam)face-voice battery: Testing complex emotion recognition in adults with andwithout asperger syndrome. Journal of autism and developmental disorders,36(2):169–183, 2006.

[GDC+11] D. Glowinski, N. Dael, A. Camurri, G. Volpe, M. Mortillaro, and K. Scherer.Towards a minimal representation of affective gestures. 2(2):106–118, 2011.

117

[GFMO12] I. Gori, S. R. Fanello, G. Metta, and F. Odone. All gestures you can: Amemory game against a humanoid robot. Humanoids2012: IEEE-RAS Inter-national Conference on Humanoid Robots, 2012.

[GG05] T. L. Griffiths and Z. Gaharamani. Infinite Latent Feature Models and theIndian Buffet Process Infinite Latent Feature Models and the Indian BuffetProcess. In Advances in neural information processing systems, pages 475–482, 2005.

[GR09] J. Y. Guo and W. E. Reddick. DCE-MRI pixel-by-pixel quantitative curvepattern analysis and its application to osteosarcoma. Journal of MagneticResonance, 30(1):177–84, July 2009.

[GS00] L. Grippo and M. Sciandrone. On the convergence of the block nonlinearGauss-Seidel method under convex constraints. Operations Research Letters,26(3):127–136, 2000.

[GSL+06] Y. Guo, R. Sivaramakrishna, C. Lu, J. S. Suri, and S. Laxminarayan. Breastimage registration techniques: a survey. Medical and Biological Engineeringand Computing, 44(1-2):15–26, 2006.

[GWE09] R. C. Gonzalez, R. E. Woods, and S. L. Eddins. Digital image processingusing MATLAB, volume 2. Gatesmark Publishing Knoxville, 2009.

[HA07] K. Huang and S. Aviyente. Sparse representation for signal classification.Advances in Neural Information Processing Systems, 19:609, 2007.

[HCF+08] D. Howe, M. Costanzo, P. Fey, T. Gojobori, L. Hannick, W. Hide, D. P. Hill,R. Kania, M. Schaeffer, S. St Pierre, et al. Big data: The future of biocuration.Nature, 455(7209):47–50, 2008.

[HFS+04] P. Heiser, J. Frey, J. Smidt, C. Sommerlad, P. M. Wehmeier, J. Hebebrand, andH. Remschmidt. Objective measurement of hyperactivity, impulsivity, andinattention in children with hyperkinetic disorders before and after treatmentwith methylphenidate. European child & adolescent psychiatry, 13(2):100–104, 2004.

[HGF+02] N. G. Harris, V. Gauden, P. A. Fraser, S. R. Williams, and G. J. M. Parker.MRI measurement of blood-brain barrier permeability following spontaneousreperfusion in the starch microsphere model of ischemia. Magnetic Reso-nance Imaging, 20(3):221–230, 2002.

[HP00] H. Hill and F. E. Pollick. Exaggerating temporal differences enhancesrecognition of individuals from point light displays. Psychological Science,11(3):223–228, 2000.

118

[Huo99] X. Huo. Sparse image representation via combined transforms. PhD thesis,Citeseer, 1999.

[HXF+07] W. Hu, D. Xie, Z. Fu, W. Zeng, and S. Maybank. Semantic-based surveillancevideo retrieval. Image Processing, IEEE Transactions on, 16(4):1168–1181,2007.

[JH96] N. Johnson and D. Hogg. Learning the distribution of object trajectories forevent recognition. Image and Vision Computing, 14(8):609–615, 1996.

[Jol05] I. Jolliffe. Principal component analysis. Wiley Online Library, 2005.

[KBB+07] O. Kubassova, M. Boesen, R. Boyle, M. Cimmino, K. Jensen, H. Bliddal, andA. Radjenovic. Fast and robust analysis of dynamic contrast enhanced MRIdatasets. Technical report, 2007.

[KBP05] O. Kubassova, R. D. Boyle, and M. Pyatnizkiy. Bone segmentation inmetacarpophalangeal MR data. Pattern Recognition and Image Analysis,pages 726–735, 2005.

[KBP07] A. Kapoor, W. Burleson, and R. W. Picard. Automatic prediction of frus-tration. International Journal of Human-Computer Studies, 65(8):724–736,2007.

[KC07] J. Kovacevic and A. Chebira. Life beyond bases: The advent of frames (partii). Signal Processing Magazine, IEEE, 24(5):115–125, 2007.

[KCHD05] K. Kim, T. H. Chalidabhongse, D. Harwood, and L. Davis. Real-timeforeground–background segmentation using codebook model. Real-timeimaging, 11(3):172–185, 2005.

[KG07] D. Knowles and Z. Ghahramani. Infinite sparse factor analysis and infiniteindependent components analysis. In Independent Component Analysis andSignal . . . , number Z X, pages 381–388, 2007.

[Kin13] Kinect for windows, 2013. http://www.microsoft.com/en-us/kinectforwindows/.

[KJO02] A. Klautau, N. Jevtic, and A. Orlitsky. Combined binary classifiers with ap-plications to speech recognition. In INTERSPEECH, 2002.

[KKVB+05] A. Kapur, A. Kapur, N. Virji-Babul, G. Tzanetakis, and P. F. Driessen.Gesture-based affective computing on motion capture data. In Affective Com-puting and Intelligent Interaction, pages 1–7. Springer, 2005.

119

[KSU10] T. Kim, G. Shakhnarovich, and R. Urtasun. Sparse coding for learning inter-pretable spatio-temporal primitives. In Proc. Neural Information . . . , pages1–9, 2010.

[Lab63] R. Laban. Modern Educational Dance. Macdonald & Evans, London, 1963.

[LBRN06] H. Lee, A. Battle, R. Raina, and A. Ng. Efficient sparse coding algorithms.In Advances in neural information processing systems, pages 801–808, 2006.

[LCST06] M. Liao, D. Chen, C. Sua, and H. Tyan. Real-time event detection and its ap-plication to surveillance systems. In Circuits and Systems, 2006. ISCAS 2006.Proceedings. 2006 IEEE International Symposium on, pages 4–pp. IEEE,2006.

[LdJvdS+07] C. Lavini, M. C. de Jonge, M. G. H. van de Sande, P. P. Tak, A. J. Nederveen,and M. Maas. Pixel-by-pixel analysis of DCE MRI curve patterns and anillustration of its application to the imaging of the musculoskeletal system.Magnetic resonance imaging, 25(5):604–12, June 2007.

[LDM13] L. Lecharlier and C. De Mol. Regularized blind deconvolution with poissondata. In Journal of Physics: Conference Series, volume 464, page 012003.IOP Publishing, 2013.

[LL47] R. Laban and F. C. Lawrence. Effort. Macdonald & Evans, London, 1947.

[LLC+08] J. Levman, T. Leung, P. Causer, D. Plewes, and A. L. Martel. Classificationof dynamic contrast-enhanced magnetic resonance breast lesions by supportvector machines. IEEE transactions on medical imaging, 27(5):688–96, May2008.

[LLC09] H. Lin, T. Liu, and J. Chuang. Learning a scene background model via classi-fication. Signal Processing, IEEE Transactions on, 57(5):1641–1654, 2009.

[Loh12] S. Lohr. The age of big data. New York Times, 11, 2012.

[LT11a] C. A. Lupascu and D. Tegolo. Automatic unsupervised segmentation of reti-nal vessels using self-organizing maps and k-means clustering. In Computa-tional Intelligence Methods for Bioinformatics and Biostatistics, pages 263–274. Springer, 2011.

[LT11b] C. A. Lupascu and D. Tegolo. Stable automatic unsupervised segmentationof retinal vessels using self-organizing maps and a modified fuzzy c-meansclustering. In Fuzzy Logic and Applications, pages 244–252. Springer, 2011.

120

[Mal89] S. G. Mallat. A theory for multiresolution signal decomposition: the waveletrepresentation. Pattern Analysis and Machine Intelligence, IEEE Transac-tions on, 11(7):674–693, 1989.

[Mal99] S. Mallat. A wavelet tour of signal processing. Access Online via Elsevier,1999.

[MBP+08a] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Discriminativelearned dictionaries for local image analysis. In Proc. of Computer Visionand Pattern Recognition Conference (CVPR 2008), 2008.

[MBP+08b] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Supervised dictio-nary learning. arXiv preprint arXiv:0809.3083, 2008.

[MBPS09] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning forsparse coding. In Proceedings of the 26th Annual International Conferenceon Machine Learning, pages 689–696. ACM, 2009.

[MBPS10] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix fac-torization and sparse coding. The Journal of Machine Learning Research,11:19–60, 2010.

[MCC+80] J. T. McConville, C. E. Clauser, T. D. Churchill, J. Cuzzi, and I. Kaleps.Anthropometric relationships of body and body segment moments of inertia.Technical report, DTIC Document, 1980.

[MCS+02] T. Makela, P. Clarysse, O. Sipila, N. Pauna, Quoc Cuong Pham, T. Katila,and I. E. Magnin. A review of cardiac image registration methods. IEEETransactions on Medical Imaging, 21(9):1011–1021, 2002.

[MDB+10] C. Malattia, M. B. Damasio, C. Basso, A. Verri, F. Magnaguagno, S. Vi-ola, M. Gattorno, A. Ravelli, P. Toma, and A. Martini. Dynamic contrast-enhanced magnetic resonance imaging in the assessment of disease activity inpatients with juvenile idiopathic arthritis. Rheumatology (Oxford, England),49(1):178–85, January 2010.

[ME02] D. Makris and T. Ellis. Path detection in video surveillance. Image and VisionComputing, 20(12):895–903, 2002.

[MES08] J. Mairal, M. Elad, and G. Sapiro. Sparse representation for color imagerestoration. IEEE Trans. Image Processing, 17(1):53–69, 2008.

[MLB+08] J. Mairal, M. Leordeanu, F. Bach, M. Hebert, and J. Ponce. Discriminativesparse image models for class-specific edge detection and image interpreta-tion. Computer Vision–ECCV 2008, pages 43–56, 2008.

121

[MM09] B. Mazzarino and M. Mancini. The need for impulsivity & smoothness-improving hci by qualitatively measuring new high-level human motion fea-tures. In SIGMAP, pages 62–67, 2009.

[MMSZ05] S. Messelodi, C. M. Modena, N. Segata, and M. Zanin. A kalman filter basedbackground updating algorithm robust to sharp illumination changes. In Im-age Analysis and Processing–ICIAP 2005, pages 163–170. Springer, 2005.

[MP04] A. Mittal and N. Paragios. Motion-based background subtraction using adap-tive kernel density estimation. In Computer Vision and Pattern Recognition,2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Confer-ence on, volume 2, pages II–302. IEEE, 2004.

[MPP06] Y. Ma, H. M. Paterson, and F. E. Pollick. A motion capture library for thestudy of identity, gender, and emotion perception from biological motion.Behavior research methods, 38(1):134–141, 2006.

[MPRP12] A. Maurer, M. Pontil, and B. Romera-Paredes. Sparse coding for multitaskand transfer learning. arXiv preprint arXiv:1209.0738, 2012.

[MSE08] J. Mairal, G. Sapiro, and M. Elad. Learning multiscale sparse representa-tions for image and video restoration. Multiscale Modeling and Simulation,7(1):214–241, 2008.

[MT11] B. T. Morris and M. M. Trivedi. Trajectory learning for activity under-standing: Unsupervised, multilevel, and long-term adaptive approach. Pat-tern Analysis and Machine Intelligence, IEEE Transactions on, 33(11):2287–2301, 2011.

[MV10] V. Mahadevan and N. Vasconcelos. Spatiotemporal saliency in dynamicscenes. Pattern Analysis and Machine Intelligence, IEEE Transactions on,32(1):171–177, 2010.

[MZ93] S. G. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionar-ies. Signal Processing, IEEE Transactions on, 41(12):3397–3415, 1993.

[NDLO09] N. Noceti, A. Destrero, A. Lovato, and F. Odone. Combined motion andappearance models for robust object tracking in real-time. In AdvancedVideo and Signal Based Surveillance, 2009. AVSS’09. Sixth IEEE Interna-tional Conference on, pages 412–417. IEEE, 2009.

[Ne83] A. S. Nemirovskici and D. eIeUdin. Problem complexity and method effi-ciency in optimization. Wiley (Chichester and New York), 1983.

122

[Nes05] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematicalprogramming, 103(1):127–152, 2005.

[NO12] N. Noceti and F. Odone. Learning common behaviors from large sets ofunlabeled temporal series. Image and Vision Computing, 2012.

[NSO11] N. Noceti, M. Santoro, and F. Odone. Learning behavioral patterns of timeseries for video-surveillance. In Machine Learning for Vision-Based MotionAnalysis, pages 275–304. Springer, 2011.

[NWR06] C. T. Nagoshi, J. R. Wilson, and L. A. Rodriguez. Impulsivity, sensationseeking, and behavioral and emotional responses to alcohol. Alcoholism:Clinical and Experimental Research, 15(4):661–667, 2006.

[O+96] B. A. Olshausen et al. Emergence of simple-cell receptive field properties bylearning a sparse code for natural images. Nature, 381(6583):607–609, 1996.

[OF97] B.A. Olshausen and D.J. Fields. Sparse coding with an overcomplete basisset: A strategy employed by V1? Vision Research, 37(23):3311–3325, 1997.

[Ots75] N. Otsu. A threshold selection method from gray-level histograms. Automat-ica, 20(1):62–66, 1975.

[PC09] J. Paisley and L. Carin. Nonparametric factor analysis with beta process pri-ors. In Proceedings of the 26th Annual International Conference on MachineLearning - ICML ’09, pages 1–8, New York, New York, USA, 2009. ACMPress.

[PE09] M. Protter and M. Elad. Image sequence denoising via sparse and redundantrepresentations. IEEE Trans. Image Processing, 18(1):27–35, 2009.

[PPBS01] F. E. Pollick, H. M. Paterson, A. Bruderlin, and A. J. Sanford. Perceivingaffect from arm movement. Cognition, 82(2):B51–B61, 2001.

[PS02] P. W. Power and J. A. Schoonees. Understanding background mixture modelsfor foreground segmentation. In Proceedings image and vision computingNew Zealand, volume 2002, 2002.

[PSCO13] S. Piana, A. Stagliano, A. Camurri, and F. Odone. A set of full-body move-ment features for emotion recognition to help children affected by autismspectrum condition. In IDGEI International Workshop, 2013.

[PSK12] B. Packer, K. Saenko, and D. Koller. A combined pose, object, and featuremodel for action understanding. In CVPR, pages 1378–1385, 2012.

123

[PSO+] S. Piana, A. Stagliano, F. Odone, A. Verri, and A. Camurri. Real-time au-tomatic emotion recognition from body gestures. to appear in IDGEI 2014proceedings.

[QC94] S. Qian and D. Chen. Signal representation using adaptive normalized gaus-sian functions. Signal processing, 36(1):1–11, 1994.

[Qua13] Qualisys motion capture systems, 2013. http://www.Qualisys.com.

[RAAKR05] R. J. Radke, S. Andra, O. Al-Kofahi, and B. Roysam. Image change detectionalgorithms: a systematic survey. Image Processing, IEEE Transactions on,14(3):294–307, 2005.

[RB11] S. Ravishankar and Y. Bresler. Mr image reconstruction from highly under-sampled k-space data by dictionary learning. Medical Imaging, IEEE Trans-actions on, 30(5):1028–1041, 2011.

[RBE10] R. Rubinstein, A. M. Bruckstein, and M. Elad. Dictionaries for sparse repre-sentation modeling. Proceedings of the IEEE, 98(6):1045–1057, June 2010.

[RBL+07] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng. Self-taught learning:transfer learning from unlabeled data. In Proceedings of the 24th interna-tional conference on Machine learning, pages 759–766. ACM, 2007.

[RHBL07] M. Ranzato, F. Huang, Y. Boureau, and Y. Lecun. Unsupervised learningof invariant feature hierarchies with applications to object recognition. InComputer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conferenceon, pages 1–8. IEEE, 2007.

[RHCT03] W. K. Rohrschneider, S. Haufe, J. H. Clorius, and J. Troger. MR to assessrenal function in children. European radiology, 13(5):1033–45, May 2003.

[RMJOZ04] T. Rohlfing, C. R. Maurer Jr, W. G. O’Dell, and J. Zhong. Modeling livermotion and deformation during the respiratory cycle using intensity-basednonrigid registration of gated MR images. Medical Physics, 31(3):427–432,2004.

[ROG08] C. L. Roether, L. Omlor, and M. A. Giese. Lateral asymmetry of bodilyemotion expression. Current Biology, 18(8):R329–R330, 2008.

[RPCL06] M. Ranzato, C. S. Poultney, S. Chopra, and Y. LeCun. Efficient learningof sparse overcomplete representations with an energy-based model. In Ad-vances in Neural Information Processing Systems 19 (NIPS 2006), 2006.

124

[RVS+09] L. Rosasco, A. Verri, M. Santoro, S. Mosci, and S. Villa. Iterative projectionmethods for structured sparsity regularization. 2009.

[SCBS11] A. Stagliano, G. Chiusano, C. Basso, and M.. Santoro. Learning adaptiveand sparse representations of medical images. In Medical Computer Vision.Recognition Techniques and Applications in Medical Imaging, pages 130–140. Springer, 2011.

[SCD02] J. L. Starck, E. J. Candes, and D. L. Donoho. The curvelet transform forimage denoising. Image Processing, IEEE Transactions on, 11(6):670–684,2002.

[SDB+11] R. Sivalingam, A. D’Souza, M. Bazakos, R. Miezianko, V. Morellas, andN. Papanikolopoulos. Dictionary learning for robust background modeling.In Robotics and automation (ICRA), 2011 IEEE International Conference on,pages 4234–4239. IEEE, 2011.

[SFGK02] O. Schreer, I. Feldmann, U. Golz, and P. Kauff. Fast and robust shadow de-tection in videoconference applications. In Video/Image Processing and Mul-timedia Communications 4th EURASIP-IEEE Region 8 International Sympo-sium on VIPromCom, pages 371–375. IEEE, 2002.

[SG99] C. Stauffer and W. E. L. Grimson. Adaptive background mixture models forreal-time tracking. In Computer Vision and Pattern Recognition, 1999. IEEEComputer Society Conference on., volume 2. IEEE, 1999.

[SH95] S. J. Schouwstra and J. Hoogstraten. Head position and spinal positionas determinants of perceived emotional state. Perceptual and motor skills,81(2):673–674, 1995.

[SH03] T. Strohmer and R. W. Heath. Grassmannian frames with applications tocoding and communication. Applied and computational harmonic analysis,14(3):257–275, 2003.

[She98] D. Shenk. Data smog: Surviving the information glut. Harper San Francisco,1998.

[SNOV] A. Stagliano, N. Noceti, F. Odone, and A. Verri. Online space-variant back-ground modeling with sparse coding. submitted.

[SNOV13] A. Stagliano, N. Noceti, F. Odone, and A. Verri. Background modelingthrough dictionary learning. In International Conference on Image Process-ing (ICIP), 2013.

125

[SS99] W. A. Sethares and T. W. Staley. Periodicity transforms. Signal Processing,IEEE Transactions on, 47(11):2953–2964, 1999.

[ST03] C. Stauffer and K. Tieu. Automated multi-camera planar tracking correspon-dence modeling. In Computer Vision and Pattern Recognition, 2003. Pro-ceedings. 2003 IEEE Computer Society Conference on, volume 1, pages I–259. IEEE, 2003.

[SWFS03] M. Seki, T. Wada, H. Fujiwara, and K. Sumi. Background subtraction basedon cooccurrence of image variations. In Computer Vision and Pattern Recog-nition, 2003. Proceedings. 2003 IEEE Computer Society Conference on, vol-ume 2, pages II–65. IEEE, 2003.

[SWPY09] V. J. Schmid, B. Whitcher, A. R. Padhani, and G. Yang. Quantitative analysisof dynamic contrast-enhanced MR images based on Bayesian P-splines. IEEEtransactions on medical imaging, 28(6):789–98, June 2009.

[TC11] M. Taj and A. Cavallaro. Distributed and decentralized multicamera tracking.Signal Processing Magazine, IEEE, 28(3):46–58, 2011.

[TG07] J. A. Tropp and A. C. Gilbert. Signal recovery from random measurementsvia orthogonal matching pursuit. Information Theory, IEEE Transactions on,53(12):4655–4666, 2007.

[TKBM99] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers. Wallflower: Principlesand practice of background maintenance. In Computer Vision, 1999. The Pro-ceedings of the Seventh IEEE International Conference on, volume 1, pages255–261. IEEE, 1999.

[Tro06] J.A. Tropp. Just relax: convex programming methods for identifying sparsesignals in noise. Information Theory, IEEE Transactions on, 52(3):1030 –1051, march 2006.

[Vap98] V. Vapnik. Statistical learning theory, 1998.

[VDTD10] P. Vivier, M. Dolores, M. Taylor, and J. Dacher. Mr urography in children.part 2: how to use imagej mr urography processing software. Pediatric Radi-ology, 40:739–746, 2010.

[VJ01] P. Viola and M. Jones. Rapid object detection using a boosted cascade of sim-ple features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001.Proceedings of the 2001 IEEE Computer Society Conference on, volume 1,pages I–511. IEEE, 2001.

126

[vL07] U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing,17:395–416, 2007.

[WADP97] C. R. Wren, A. Azarbayejani, T. Darrell, and A. P. Pentland. Pfinder: Real-time tracking of the human body. Pattern Analysis and Machine Intelligence,IEEE Transactions on, 19(7):780–785, 1997.

[Wal98] H. G. Wallbott. Bodily expression of emotion. European Journal of SocialPsychology, 28:879–896, 1998.

[Wic92] M. V. Wickerhauser. Adapted Wave Form Analysis, Wavelet-Packets andApplications. In ICIAM 91: proceedings of the Second International Confer-ence on Industrial and Applied Mathematics, page 41. Society for Industrial& Applied, 1992.

[Wic94] M. V. Wickerhauser. Adapted wavelet analysis from theory to software. AKPeters Ltd, 1994.

[WP10] M. Wu and X. Peng. Spatio-temporal context for codebook-based dynamicbackground subtraction. AEU-International Journal of Electronics and Com-munications, 64(8):739–747, 2010.

[WRB06] D. Weinland, R. Ronfard, and E. Boyer. Free viewpoint action recognitionusing motion history volumes. Computer Vision and Image Understanding,104(2):249–257, 2006.

[WZS12] D. Wu, F. Zhu, and L. Shao. One shot learning gesture recognition from rgbdimages. In Computer Vision and Pattern Recognition Workshops (CVPRW),2012 IEEE Computer Society Conference on, pages 7–12. IEEE, 2012.

[XZTH90] W. Xian, Y. Zhang, Z. Tu, and E. L. Hall. Automatic visual inspection ofthe surface appearance defects of bearing roller. In Robotics and Automation,1990. Proceedings., 1990 IEEE International Conference on, pages 1490–1494. IEEE, 1990.

[YYGH09] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matchingusing sparse coding for image classification. In Computer Vision and Pat-tern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1794–1801.IEEE, 2009.

[ZCP+09] M. Zhou, H. Chen, J. Paisley, L. Ren, G. Sapiro, and L. Carin. Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations.In Neural Iinformation Processing Systems, number December, pages 1–9,2009.

127

[ZJ05] Y. Zhang and Q. Ji. Active and dynamic information fusion for facial expres-sion understanding from image sequences. Pattern Analysis and MachineIntelligence, IEEE Transactions on, 27(5):699–714, 2005.

[ZO11] L. Zini and F. Odone. Efficient pedestrian detection with group lasso. In Com-puter Vision Workshops (ICCV Workshops), 2011 IEEE International Confer-ence on, pages 1777–1784. IEEE, 2011.

[ZSR+09] F. G. Zollner, R. Sance, P. Rogelj, M. J. Ledesma-Carbayo, J. Rø rvik, A. San-tos, and A. Lundervold. Assessment of 3D DCE-MRI of the kidneys usingnon-rigid image registration and segmentation of voxel time courses. Com-puterized medical imaging and graphics : the official journal of the Comput-erized Medical Imaging Society, 33(3):171–81, April 2009.

128

Dipartimento di Informatica, Bioingegneria, Robotica ed ...Universita degli Studi di Genova`...

Documents

Transcript of Dipartimento di Informatica, Bioingegneria, Robotica ed ...Universita degli Studi di Genova`...