Statistiche di test non lineari - Gruppo1-2 INFN...

Statistiche di test non lineariLa superificie di separazione ottimale può non essere un iperpiano, → statistica di test non lineare

acceptH0

H1Ci sono molti metodi statistici multivariati:

La fisica delle particelle benificia in questo caso dei progressi fatti nel campo del machine learning (per esempio negli studi sull’intelligenza artificiale)

Neural Networks,Kernel density methods,Decision Trees...

3 Glen Cowan Multivariate Statistical Methods in Particle Physics

Linear decision boundaries

A linear decision boundary is only optimal when both classes follow multivariate Gaussians with equal covariances and different means.

x1

x2

For some other cases a linear boundary is almost useless.

x1

x2


Nonlinear transformation of inputs

x1

x2

We can try to find a transformation, x1 , , xn1x , ,mx

so that the transformed “feature space” variables can be separatedbetter by a linear boundary:

2= x12x22

1=tan−1 x2/ x1

1

2

Here, guess fixed basis functions(no free parameters)

Introduzione ai neural networks

Sono usati in neurobiologia, pattern recognition, matematica finanziaria, ... qui sono solo un tipo di statistica di test

Supponiamo t(x) abbia la forma sigmoide

Questa è una rete neurale con un solo strato di nodi (single-layer perceptron)Se s(u) è monotona→ è equivalente a una t(x) lineare

La rete neurale con più strati

I risultati del primo strato costituiscono i valori di entrata di uno strato successivo

Il valore dei nodi nello strato intermedio (nascosto) è dato da

e l’uscita della rete è

weights (connection strengths)

Discussione sulle reti neuraliFacile generalizzare a un numero arbitrario di strati di nodiRete feed-forward: i valori di un nodo dipende solo dallo strato precedente.Più nodi → più t(x) è vicino all’ottimale ma un numero maggiore di parametri deve essere determinato

I parametri si determinano minimizzando una error function

dove t (0) , t (1) sono valori preassegnati, per esempio 0 e 1 per la sigmoide.I valori di aspettazione sono calcolati su un campione MC (training sample).La procedura è complicata e si usano dei software standard


Network architecture: one hidden layer

Theorem: An MLP with a single hidden layer having a sufficiently large number of nodes can approximate arbitrarily well the Bayes optimal decision boundary.

Holds for any continuous non-polynomial activation functionLeshno, Lin, Pinkus and Schocken (1993), Neural Networks 6, 861—867

In practice often choose a single hidden layer and try increasing thethe number of nodes until no further improvement in performanceis found.


More than one hidden layer“Relatively little is known concerning the advantages and disadvantagesof using a single hidden layer with many units (neurons) over many hidden layers with fewer units. The mathematics and approximationtheory of the MLP model with more than one hidden layer is not wellunderstood.”

“Nonetheless there seems to be reason to conjecture that the two hiddenlayer model may be significantly more promising than the single hiddenlayer model, ...”

A. Pinkus, Approximation theory of the MLP model in neural networks,Acta Numerica (1999), pp. 143—195.


OvertrainingIf the network has too many nodes, after training it will tend to conform too closely to the training data:

The classification error rate on the training sample may be very low, but it would be much higher on an independent data sample.

Overtraining

Therefore it is important to evaluate the error rate with a statisticallyindependent validation sample.


Monitoring overtrainingIf we monitor the value of the error function E(w) at every cycle of the minimization, for the training sample it will continue to decrease.

But the validation sample it may initially decrease, and then at some point increase, indicatingovertraining.

validation sample

training sample

error

training cycle


Validation and testingThe validation sample can be used to make various choices about the network architecture, e.g., adjust the number of hidden nodes soas to obtain good “generalization performance” (ability to correctlyclassify unseen data).

If the validation stage is iterated may times, the estimated error rate based on the validation sample has a bias, so strictly speaking one should finally estimate the error rate with an independent test sample.

train : validate : test 50 : 25 : 25

Rule of thumb if data nottoo expensive (Narsky):

But this depends on the type of classifier. Often the bias in the errorrate from the validation sample is small and one can omit the test step.

Esempio di Neural network a LEP IISignale: e+e− → W+W− (4 jet ben separati)Fondo: e+e− → qqgg (4 jet non tanto separati)

← variabile di input basata sulla struttura del jet, event shape,nessuno dei quali permette da solo di separare segnale e fondoIl Neural network dà una separazione migliore

(Garrido, Juste and Martinez, ALEPH 96-144)

Probability Density Estimation (PDE) techniques

See e.g. K. Cranmer, Kernel Estimation in High Energy Physics, CPC 136 (2001) 198; hep-ex/0011057; T. Carli and B. Koblitz, A multi-variate discrimination technique based on range-searching, NIM A 501 (2003) 576; hep-ex/0211019

Construct non-parametric estimators of the pdfs

and use these to construct the likelihood ratio

(n-dimensional histogram is a brute force example of this.)More clever estimation techniques can get this to work for(somewhat) higher dimension.

Product of one-dimensional pdfsFirst rotate to uncorrelated variables, i.e., find matrix A such that

for we have

Estimate the d-dimensional joint pdf as the product of 1-d pdfs,

(here x decorrelated)

This does not exploit non-linear features of the joint pdf, butsimple and may be a good approximation in practical examples.


Correlation vs. independenceIn a general a multivariate distribution p(x) does not factorize into a product of the marginal distributions for the individual variables:

px=∏i=1

n

pi xiholds only if thecomponents of x are independent

Most importantly, the components of x will generally have nonzerocovariances (i.e. they are correlated):

V ij=cov [ xi , x j ]=E [ xi x j ]−E [ xi ]E [ x j ]≠0


Decorrelation of input variablesBut we can define a set of uncorrelated input variables by a linear transformation, i.e., find the matrix A such that forthe covariances cov[y

i, y

j] = 0:

y=Ax

For the following suppose that the variables are “decorrelated” in this way for each of p(x|H

0) and p(x|H

1) separately (since in general

their correlations are different).


Decorrelation is not enoughBut even with zero correlation, a multivariate pdf p(x) will in general have nonlinearities and thus the decorrelated variables are still not independent.

pdf with zero covariance butcomponents still notindependent, since clearly

x1

x2

p x2∣x1≡p x1 , x2

p1x1≠ p2 x2

p x1, x2≠ p1x1 p2 x2

and therefore


Naive BayesBut if the nonlinearities are not too great, it is reasonable to first decorrelate the inputs and take as our estimator for each pdf

px=∏i=1

n

pi xi

So this at least reduces the problem to one of finding estimates ofone-dimensional pdfs.

The resulting estimated likelihood ratio gives the Naive Bayes classifier(in HEP sometimes called the “likelihood method”).


HistogramsStart by considering one-dimensional case, goal is to estimate pdf p(x)of continuous r.v. x.

Simplest non-parametric estimate of p(x) is a histogram:

Bishop Section 2.5

p x =ni

N xi for x in bin i

ni

Dxi

x

N total entries


Histograms (2)

Small bin width: estimate is very spiky, structure not really part of underlying distribution.

Medium bin width: best

Large bin width: too smooth and thus fails to capture e.g. bimodalcharacter of parent distribution

Bishop Section 2.5


Counting events in a local volumeConsider a small volume V centred about x = (x

1, ..., x

D).

This is in contrast to the histogram where the bin edges were fixed.

Suppose from N total events we find K in V.

p x =K

N VTake as estimate for p(x)

Two approaches:

Fix V and determine K from the data

Fix K and determine V from the data


Kernels

E.g. take V to be hypercube centered at the x where we want p(x).

k u=1for∣ui∣1/2 and 0 otherwise, i = 1, ..., DDefinei.e., the function is nonzero inside a unit hypercube centred about x and zero outside.

k(u) is an example of a kernel function (here called a Parzen window).

Kernel-based PDE (KDE, Parzen window)Consider d dimensions, N training events, x1, ..., xN, estimate f (x) with

Use e.g. Gaussian kernel:

kernel bandwidth (smoothing parameter)

Need to sum N terms to evaluate function (slow); faster algorithms only count events in vicinity of x (k-nearest neighbor, range search).

Decision treesA training sample of signal and background data is repeatedlysplit by successive cuts on its input variables.Order in which variables used based on best separation betweensignal and background.

Example by Mini-Boone, B. Roe et al., NIM A 543 (2005) 577

Iterate until stop criterion reached,based e.g. on purity, minimumnumber of events in a node.Resulting set of cuts is a ‘decision tree’.


Decision tree size and stabilityUsually one grows the tree first to a very large (e.g. maximum) size and then applies pruning.

For example one can recombine leaves based on some measure of generalization performance (e.g. using statistical error of purity estimates).

Decision trees tend to be very sensitive to statistical fluctuations inthe training sample.

Methods such as boosting can be used to stabilize the tree.

Boosted decision treesBoosting combines a number classifiers into a stronger one; improves stability with respect to fluctuations in input data.To use with decision trees, increase the weights of misclassifiedevents and reconstruct the tree. Iterate → forest of trees (perhaps > 1000). For the mth tree,

Define a score αm based on error rate of mth tree.

Boosted tree = weighted sum of the trees:

Algorithms: AdaBoost (Freund & Schapire), ε-boost (Friedman).

Confronto di metodi multivariati (TMVA)

Si sceglie quello che dà il risultato migliore

Data una variabile di test, i passi successivi sono, per esempio, selezionare n eventi e stimare la sezione d’urto estimate a cross section of signal:

Discussione sulle analisi multivariate

Ma dobbiamo stimare anche l’errore sistematicoSe il campione di training (MC) ≠ Natura, le nostre stime di fondi e efficienze possono essere sbagliate (vero anche per semplici tagli)

Conviene iniziare con solo 1-2 variabili (quelle che hanno il maggior potere discriminatorio) e aggiungere le altre solo se i miglioramenti sono significativi - Con meno variabili non c’è un problema di ‘over-training’- Le correlazioni spesso rendono inutile aggiungere un altra variabile e sono potenzialmente pericolose per i sistematici

Statistiche di test non lineari - Gruppo1-2 INFN...

Documents

Transcript of Statistiche di test non lineari - Gruppo1-2 INFN...