CAPITOLO 1 - unina.it

132

Transcript of CAPITOLO 1 - unina.it

Page 1: CAPITOLO 1 - unina.it
Page 2: CAPITOLO 1 - unina.it
Page 3: CAPITOLO 1 - unina.it

Ringraziamenti Sono tante le persone che dovrei ringraziare che non basterebbero intere pagine, pertanto ho deciso di non ci-tare nessuno, sperando così di non offendere nessuno. Ringrazio infinitamente tutti voi che, con la vostra conoscenza, la vostra passione, la vostra dedizione, la vostra pazienza, la vostra fiducia e soprattutto il vostro affetto avete contribuito alla stesura di questa tesi. Grazie di cuore Ida

Page 4: CAPITOLO 1 - unina.it

TABLE OF CONTENTS

table of contents

Introduction

ta38.620C /P Chapte400</MCID 1 >>BDC BT/TT0 51 Tf0 Tc 0 Tw 12 0 0 12 197.40001 490.22065ETta38.620C /P an <</MCID 14 >>BDC BT/TT261 Tf0.000 6.02 206.03999 16.02 0 0253.40001 490.22028 Tm[(ta12.8of con)7(tMulticolliJETeEMC arity and Partial least squares regressio <</MCID 13 >>BDC BT/TT0 1 Tf12 0 0 12 113.40 12 197.40001 490.22385ET1( )T12.8of con)7 an <</MCID 14 >>BDC BT/TT2 1 Tf12 0 0 123113.40001 516.14028 T22393.08EMC /Sp1.1.</MCID 1 >>BDC BT/TT0 1 Tf12 0 0 12 113.40001 516.14041 T22393.08EMC /Span <</MCID 14 >>BDC BT/TT201 Tf12 0 0 123113.40079 Tw 12 0 0 12587.40001 490.22063.37( )393.08EMC /SpRegressio analysis</MCID 2 >>BDC BT/TT0 1 Tf0 Tc 0 Tw 12 0 0 12 197.40001 490.22259.07( )393.08EMC /Span <</MCID 14 >>BDC BT/TT221 Tf12 0 0 123113.40001 516.14028 T2237ET80C /P 1.2.</MCID 1 >>BDC BT/TT02 1 Tf12 0 0 12 113.40001 619.46041 T2237ET80C /P an <</MCID 14 >>BDC BT/TT241 Tf12 0 0 123113.40079 T98.73999 464.42063.37( )37ET80C /P(tMulticolliJe)5(a)1(rity<</MCID 13 >>BDC BT/TT251 Tf0 Tc 0 Tw 12 0 0 12 1001 490.22244 Tm( )37ET80C /P an <</MCID 14 >>BDC BT/TT2 1 Tf0 Tc 0 Tw 12 0 0999 464.42033.37( )359.TEMC /P 1.2.1.</MCID 1 >>BDC BT/TT02 1 Tf12 0 0 12 113.40001 543.740jE37( )359.TEMC /P an <</MCID 14 >>BDC BT/TT2 1 Tf12 0 0 12 113.40079 574.94031 T8a3.40001 490.22078 Tm[(t359.TEMC /P[(Theoretical and prac)5(tica9 113.40079 DC BT/TT0 nts)]TJETEMC EMC //P an <</MCID 14 >>BDC 5o2 1of)4( m)9(u 0 l- Tc 5o <</6(ear)4(2 1001 490.22244 Tm( )37ETpan <</MCID 14 >>BDC ID 14 >>BDC BT/TT2467.7.40079 DC BT/TT079 574.94031 T8a3.40001 490.30/P 1.2.1.</MCID 1 >>BDC BT/TT02 1 Tf12 4513.40913.40001 619.46041 T2237ET80C /P3/P <</MCID 2 >>BDC BT/TT0 1 Tf1 Tf12 0 4513.40913.409 574.94031 T8a3.40001 490.3/P <</MCID 3 >>BDC BliJe)5(a)1ical 6nd prac)5(tica9 113.400794513.40913.4[(Dete2 17n1of)4( m)9(u 0 lessio <</6(e12 012 1001 490.22244 Tm( )37E3EMC /Span <</MCID 14 >>BD >>BDC BT/TT2 15.9512 0 4513.40913.409 574.94031 T8a3.40001 490.3/P <</MCID 5 >>BDC BT/TT1 1 Tf2 1 Tf12 3113.4 113.40001 319.46041 T2237ET80C /P380C /P an <</MCID 14 T/TT0 1 Tf1 Tf12 0 3113.4 113.409 574.94031 T8a3.40001 490.3TjETEMC /P <</MCID 7)TjETENTS )TjETEMC ac)5(tica9 113.400793113.4 113.4 Tfethods for 14m)8(bating mgressio <</5(ear2 1001 490.22244 Tm( )37E3T12.8of con)7 an <</MCID 14 >>BDC BT/TT2 6813.ssio <</5(ear2 1001 490.22244 Tm( )37E3T1p1.1.</MCID 1 >>BDC BT/TT0 1 Tf12 0 0 14)lR1Cuo9/0g0 5 6 7T/TT2 1 T 0 3113.4 113.T802244 Tm( )37.03999 L6.02 S 0253.4 BT/TT20C /P <</MCID 3 >>BDC B0 12587.40001 490.22063.37( )393.08EMC /324.353.T802244 Tm( )37E3T1p1.1.</MCID 1 >>BDC BTT02 1 Tf12 4513.409131 490.22244 Tm( )37298.34393.08EMC3an <</MCID 14 >>BDC BT/T4BT/TT221 Tf12 0 0 123113.40001 031 T8a3.298.34393.08EE3T1p1.1.</MCID 1 >>BDC BTessio <</6(e12 012 1005 6 74/0g0 5 6 74237ET80C /P380C /P an298.34393.08ENIPALS approach<</MCID 14 >>BDC BT/T499 574.94031 T48a3.40001 490.22028 Tm[(ta26</M8.298.34393.08EE3T1p1.1.</MCID 1 >>BDC BT1 Tf0 Tc 0 Tw 12 0 0 12 1001 490.22244 Tm284.60C /Span <3 >>BDC ID 14 >>BDC BT/T42 1 Tf0 Tc 0 Tw 12 0 0999 464.42033.37( 284.60C /SpanE3T1p1.1.</MCID 1 >>BDC BTT02 1 Tf12 0 0 12 113.40001 319.46041 T2237ET80C /P380C /P an284.60C /SpanSIMPLS approach<</MCID 14 >>BDC BT/T43.40079 DC BT/TT0 nts)]TJETEMC EMC /26</M8.284.60C /SpanE3T1p1.1.</MCID 1 >>BDC B5 5o <</6(ear)4(2 1001 490.22244 Tm( )37270.80E3T12.8of c con)7 an <</MCID 14 >>BD52467.7.40079 DC BT/TT079 574.94031 T8a3.270.80E3T12.8E3T1p1.1.</MCID 1 >>BDC B5T02 1 Tf12 4513.40913.40002 619.460402237ET80C /P380C /P an270.80E3T12.[(C12 0 p31 s1 Tf1 SIMPLS with NIPAL-PLSa3.40001 490.3TjETEMC 5CID 3 >>BDC BliJe)5(a)1ical 6nd prac)5(tic77.tico3.270.80E3T12.8E3T1p1.1.</MCID 1 >>BDC B5essio <</6(e12 012 1001 490.22244 Tm( )37257 6 74232.8of c4con)7 an <</MCID 14 >>BD52 15.9512 0 4513.40913.409 574.94031 T8a3.257 6 74232.8E3T1p1.1.</MCID 1 >>BDC B5Tf2 1 Tf12 3113.4 113.40001 319.46041 T2237ET80C /P380C /P an257 6 74232.8Model CalibratT201 Td ValidatT20on)7 an <</MCID 14 >>BD5/P <</MCID 7)TjETENTS )TjETEMC ac)5(tic35 an <</257 6 74232.8E3T1p1.1.</MCID 1 >>BDC B5T02 1 Tf12 0 02.4 113.400001 543.740jjE37( )359.TEMC1C /P an2400nalysis</MChapter 2on)7 an <</MCID 14 >>BD53.40079 DC BT/TT0 nts

Page 5: CAPITOLO 1 - unina.it
Page 6: CAPITOLO 1 - unina.it

3.6. A robust method for PLS regression based on SSVD

3.7. Model Calibration and Validation

3.8. Simulation study

3.9.

3.10. Robust properties of robust PLS

Chapter 4

An application in environmental field.

4.1 A look at data(4.1)TjETEMC /P <</MC7 >>BDC BT/TT1 1 Tf0 Tc 0 Tw 12 0 0 12 221.10001 510.20029 Tm( )TjETEMC /Span <</MCID 28 >>BDC BT/TT0 1 Tf12 0 0 12 123.42 490.4003 Tm(4.2(4.1)TjETEMC /P <</MC9 >>BDC BT/TT1 1 Tf12 0 0 12 138.42 490.4003 Tm( )TjETEMC /Span <</MCID 30 >>BDC BT/TT0 1 Tf0.00031 Tc -0.00031 Tw 12 0 0 12 153.42 490.4003 Tm(Multicollinearity diagnostics)TjETEMC /P <</MCID 31 >>BDC BT/TT1 1 Tf0 Tc 0 Tw 12 0 0 12 292.44 490.4003 Tm( )TjETEMC /Span <</MCID 32 >>BDC BT/TT0 1 Tf12 0 0 12 123.42 470.6.408 Tm(4.3)TjETEMC /P <</MCID 33 >>BDC BT/TT1 1 Tf12 0 0 12 138.42 470.6.408 Tm( )TjETEMC /Span <</MCID 34 >>BDC BT/TT0 1 Tf0.00011 Tc -0.00011 Tw 12 0 0 12 153.42 470.6.408 Tm(Resuls by P)Tj12 0 0 12 209.35274 470.6.408 Tm(L)Tj12 0 0 12 216.01042 470.6.408 Tm(S regression)TjETEMC /P <</MCID 35 >>BDC BT/TT1 1 Tf0 Tc 0 Tw 12 0 0 12 275.75998 470.6.408 Tm( )TjETEMC /Span <</MCID 36 >>BDC BT/TT0 1 Tf12 0 0 12 123.42 450.80029 Tm(4.4)TjETEMC /P <</MCID 37 >>BDC BT/TT1 1 Tf12 0 0 12 138.42 450.80029 Tm( )TjETEMC /Span <</MCID 38 >>BDC BT/TT0 1 Tf0.00011 Tc -0.00011 Tw 12 0 0 12 153.42 450.80029 Tm(Results by RSIMPLS)TjETEMC /P <</MCID 39 >>BDC BT/TT1 1 Tf0 Tc 0 Tw 12 0 0 12 252.78 450.80029 Tm( )TjETEMC /Span <</MCID 40 >>BDC BT/TT0 1 Tf12 0 0 12 123.42 431.0000027 Tm4.5)TjETEMC /P <</MCID 41 >>BDC BT/TT1 1 Tf12 0 0 12 138.42 431.0000027 Tm

4.5.1. SSVD and FAST-MCD on data(4.1)TjETEMC /P <</M47 >>BDC BT/TT1 1 Tf0 Tw 12 0 0 12 326.03998 417.26408 Tm( )TjETEMC /Span <</MCID 48 >>BDC BT/TT1 1 Tf12 0 0 12 133.37999 403.4603 Tm(4.5.2.)TjETEMC /P <</MCID 49 >>BDC BT/TT1 1 Tf12 0 0 12 160.37999 403.4603 Tm( )TjETEMC /Span <</MCID 50 >>BDC BT/TT1 1 Tf0.0004 Tc -0.0004 Tw 12 0 0 12 173.40001 403.4603 Tm(Multicollinearity diagnostic)Tj0.0007 Tc -0.0032 Tw 12 0 0 12 307.14006 403.4603 Tm(s on the subsam)Tj12 0 0 12 382 38104 403.4603 Tm(ple without )Tj-0.00571 Tw 12 0 0 12 133.38011 389.6603027 Tmmultivar

4.5.3. PLS regression on the subsample without multivariate outliers 4.5.4. Squared robust residual distance 4.5.5. PLS regression on the subsample without regression outliers

4.6. Conclusions

Conclusions and Perspectives

Page 7: CAPITOLO 1 - unina.it
Page 8: CAPITOLO 1 - unina.it
Page 9: CAPITOLO 1 - unina.it

CHAPTER 1

MULTICOLLINEARITY AND PARTIAL LEAST SQUARES REGRESSION

1.1. Regression analysis The term regression was introduced by Galton (1886). In a famous paper, Galton found that, although there was a tendency for tall parents to have

Page 10: CAPITOLO 1 - unina.it

variables to the response variable. The model that is used most exten-sively is Multiple Linear Regression Model. It ca

Page 11: CAPITOLO 1 - unina.it

5.

The first task of researcher is to estimate the vector of the unknown pa-

1.

2

ppling

a

re g res sion es

Page 12: CAPITOLO 1 - unina.it
Page 13: CAPITOLO 1 - unina.it
Page 14: CAPITOLO 1 - unina.it
Page 15: CAPITOLO 1 - unina.it
Page 16: CAPITOLO 1 - unina.it
Page 17: CAPITOLO 1 - unina.it
Page 18: CAPITOLO 1 - unina.it

( ) yXIXXɓ TTRR

1ˆ -

Page 19: CAPITOLO 1 - unina.it
Page 20: CAPITOLO 1 - unina.it
Page 21: CAPITOLO 1 - unina.it

Table 1. 1 NIPALS-PLS algorithm

1. YYX;X 00 ==2. For : ah 2,2,1=

Page 22: CAPITOLO 1 - unina.it
Page 23: CAPITOLO 1 - unina.it

1.3.1.2. PLS regression model We saw the PLS algorithm is an iterative process; i.e. after extraction of one component the algorithm starts again using the deflated matrices and computed in step 2.4 and 2.5. Thus we can achieve the sequence

Page 24: CAPITOLO 1 - unina.it

( ) 1-= WPWR T 1. 22

Due to the propriety c., the mahe mtrix is upper triangular and thus in-vertible.he m It fhe mohe m rhe mom

WP

T

R

and sharhe me the mhhe me same

column space and thhe mat

W

RPT should be equal to the identity mahe m

trix. Fhe m

i(trix. Fhe m)Tj12.0136 0 3.76817 587.42029 Tm(-)Tj0.0004 Tc -0.0007 Tw 12.0136 0 0 12.0136 113.4006 573.6203992 Tmnally,

Page 25: CAPITOLO 1 - unina.it

1.3.2.

Page 26: CAPITOLO 1 - unina.it

matrix

Page 27: CAPITOLO 1 - unina.it

If the number of y is smaller than the number of x variables, it will be

Page 28: CAPITOLO 1 - unina.it

1.3.3. Comparison of SIMPLS with NIPAL-PLS We can now compare the NIPALS-PLS and the SIMPLS algorithm. Both the algorithm mators . At first sight, these conditions would seem to guarantee equal re-sults obtainedm(e)Tj12.0136 0 0 12.0180.4749639 583.82013 Tm with the two algoriths. Howeve

Page 29: CAPITOLO 1 - unina.it

A test set should be independent from the training set which is used to es-timate the regression parameters in the model, but should still be repre-sentative of the population. Let

Page 30: CAPITOLO 1 - unina.it
Page 31: CAPITOLO 1 - unina.it

CHAPTER 2

OUTLIERS AND ROBUST STATISTICS

Page 32: CAPITOLO 1 - unina.it
Page 33: CAPITOLO 1 - unina.it
Page 34: CAPITOLO 1 - unina.it

This point is called an outlier in the y-direction and it has a rather large influence on the OLS line, which is quite different from the OLS line in

Page 35: CAPITOLO 1 - unina.it
Page 36: CAPITOLO 1 - unina.it
Page 37: CAPITOLO 1 - unina.it
Page 38: CAPITOLO 1 - unina.it

ä=

n

Page 39: CAPITOLO 1 - unina.it

Figure 2. 5

x y L1(y)1 1,5 2,5307

15 2,2 2,21,7 1,7 2,5141

2 3 2,50712,3 2,5 2,5

3 4 2,48342,7 3,5 2,4905

012345

0 5 10 15 20

The next step in this direction was the use of M-estimators (Huber, 973). They are based on the idea of r

or5j12.03819 0 0 1c.03819 300.27393 501.67955 42.0033j12.03819 0 0 12.03819 275.43559 486.139533Tm9178j12.03819 0 0 12.03819 246.95383 486.139533Tm9248j12.03819 0 0 1g.03819 246.95383 486.139533T7.931Tj12.03819 0 0 12.03819 309.34656 486.13953 61.032Tj12.03819 0 0 12.03819 256.86104 486.139533652 3

Page 40: CAPITOLO 1 - unina.it

Therefore 2.9 is really a system of p

equations, the solution of

which is onot always easy to find. In practise, ne uses iter ation sch

Page 41: CAPITOLO 1 - unina.it

Successively, various other estimators have been proposed w

Page 42: CAPITOLO 1 - unina.it
Page 43: CAPITOLO 1 - unina.it
Page 44: CAPITOLO 1 - unina.it
Page 45: CAPITOLO 1 - unina.it

too harshly nor too lightly. The ideal estimator would penalize outliers without penalizing non-outliers. If it fa

Page 46: CAPITOLO 1 - unina.it
Page 47: CAPITOLO 1 - unina.it

and delete one observation at a time. The size of the subset of observa-tions used in fitting decreases as the method proceeds. Another alternative is forward procedure (Atkinson, 1987) i

Page 48: CAPITOLO 1 - unina.it
Page 49: CAPITOLO 1 - unina.it

However, it is well known that x is not robust, because even a single outlier in sample can move x arbitrarily far away. To quantify such ef- is not robust, because even65 633.26091 Tm(fects, we slightly ad)Tj12.03819 0 0 12.03819 226.64439 633.26091 Tm(apt the defini)Tj-0.0015 Tc 0.0854 Tw 12.03819 0 0 12.03819 291.9601 633.26091 Tm(tion of breakdown point introduced by )Tj/CS0 cs 1 0 0 scn-0.0014 Tc 0 Tw 1 is not robust, because even65 619.46152 Tm(2.4 )Tj0 g-0.0002 Tc 0.064cauTw 1 is not robust, becaus46.40021 619.46152 Tm(to the f)Tj12.03819 0 0 12.03819 181.9839 60 0 12.03819 w 12.03819 0 0 12.03819 127.62033 644p7auTw 1 is not r64715 619.a84bust, because eve.05992 Tm(le can m)Tj12.0300.5264st, because evee05992 Tm(le can m)Tj12.0300 0279ust, because evew05992 Tm(le can m)Tj12.0314.56537st, because eveots, we slightly adle can mle can mle can mHocots, we slightly adpple can m

Page 50: CAPITOLO 1 - unina.it
Page 51: CAPITOLO 1 - unina.it
Page 52: CAPITOLO 1 - unina.it

Let us focus on Classical Outlier Rejection. The squared Mahalanobis distance

( ) ( ) ( ) ( )Tiii T(i)Tj12.0387 0 0 12.038240.7204727.89999 610.46027 Tm(T(i)Tj12.0387 0 148.0201727.89999 610.46MD)Tj/TT0 1 Tf12.0387 0 0 12.0387 349.3806627.89999 610.46)(i)Tj12.0387 0 0 12.038262.2000427.89999 610.46)

Page 53: CAPITOLO 1 - unina.it
Page 54: CAPITOLO 1 - unina.it

( ) ( )

2

1

1

1

2

,,

1

sup ,

öööö

Page 55: CAPITOLO 1 - unina.it
Page 56: CAPITOLO 1 - unina.it

where is nonsingular whenever are in general position. Let us compute

JC11

,,+pii xx 2

( ) ( )TJiJJinJ medm xxCxx --= -

=

1

,,11

2

2 2. 38

The ellipsoid corresponding to will contain JJm C2 [ ] 12 += nh

Page 57: CAPITOLO 1 - unina.it

( ) ( ) ( )íìë ¢--=

-

otherwisec�

Page 58: CAPITOLO 1 - unina.it
Page 59: CAPITOLO 1 - unina.it

of observations without replacement, as we did for the estimation of MVE. For each subgroups (indexed by ) the arithm

Page 60: CAPITOLO 1 - unina.it
Page 61: CAPITOLO 1 - unina.it

2.6. Robust Estimation of Singular Value De-composition

A numerically stable way to perform ma

Page 62: CAPITOLO 1 - unina.it
Page 63: CAPITOLO 1 - unina.it
Page 64: CAPITOLO 1 - unina.it

ä=

te the resut

Page 65: CAPITOLO 1 - unina.it

thousands), so we cannot apply MCD as well as the other affine equivari-ant estimators with high breakdown point. A second problem is the com-putation of these robust estimators in high dimensions. Indeed, FAST-MCD algorithm, as implemented in S-PLUS, cannot handle more than 50 variables. Moreover, when the number of variables is la rg er than, say 10 , the computation of the MCD estimator becomes less p recise.

We propose the Single-case Singular Value Decomposition (SSVD)

method, which combines “leave-one-out” methods and Singular Value Decomposition (SVD).

The SVD is m otivated by the following geometric fact: the image of the unit sphere under any pn³ matrix is a hyperellipse. We may define a hyperellipse in

n

Á as the surface obtained by stretching the un it sshereh

m

e

Page 66: CAPITOLO 1 - unina.it

correspond to the space on the right and the right singular vectors corre-spond to the space on the left.

precedent geometric observations allow us to understand the statisti-meaning of singular values and vectors. As the singular values of X

are the lengths of the

Thecal

p principal semiaxes of hyperellipsoid XS , from statistical point of view

tee statisticarp

XTj0.04Tj/Tm(int of)Tj12.049 04 11.7019 467.75998 .619.46027 T20 1 Tf19 00079 Tc 0.038576 Tw 1219 0 0 4.7074 94 675998 Th29 fore,86 t589.4values of X027 Tm( )Tj19 0031 Tc 0.07851 Tw 12.019 087 12.074 94 456.66 bethe lengt0.8 reW nBT/TT2 1 Tf305Tc 0 Tw 1305Tc 0 12. 12.074 64 456.66 n5.48029 Tm(XS)Tj/TTaxes

Page 67: CAPITOLO 1 - unina.it
Page 68: CAPITOLO 1 - unina.it
Page 69: CAPITOLO 1 - unina.it

of the study suggest limits on the breakdown point of the procedure. A simulation study should be performed as follows:

1.

Page 70: CAPITOLO 1 - unina.it

where the number of observations can be very large and the number of variables relatively small.

Table 2. 2

SSVD FAST-MCD n p #

( )

pp

N I0,

#

( )

pp

NIµ ,

Robust Time Robust Time 10 2 6 es <1 yes <1 4 y

10

Page 71: CAPITOLO 1 - unina.it

2.10. The deletion of Multivariate Outliers and

It is very im

Page 72: CAPITOLO 1 - unina.it

from the linear pattern set by the majority of the data. Therefore, a com-bination of high-breakdow

Page 73: CAPITOLO 1 - unina.it

CHAPTER 3

ROBUST METHODS FOR PARTIAL LEAST SQUARES REGRESSION

4.1. Outliers in PLS Regression

Page 74: CAPITOLO 1 - unina.it

cludes methods which use a robust cross-covariance matrix and a robust regression method. In th

Page 75: CAPITOLO 1 - unina.it
Page 76: CAPITOLO 1 - unina.it
Page 77: CAPITOLO 1 - unina.it

Form the loading vector c for the -block and normalize it Y

( )qk cccc ,,,, 21 33=c 3. 11

ccc = 3. 12

Calculate the score vector u

(

cccc =

3.

Page 78: CAPITOLO 1 - unina.it
Page 79: CAPITOLO 1 - unina.it

is maximised. The use of the ( )Ö2

nmad (see footnote 5) instead of com-mon )var(Ö modifies the covariance estimates. It protects the derived components from abnormal observations. A full description of PLAD al-gorithm is given below. 1. Center or standardise both and X y2. For ah ,,2,

Page 80: CAPITOLO 1 - unina.it

( )Mh

MMh wwW ,,1 2=

3.1.3. Partial Reweighted Least Squares regres-

sion Partial Reweighted Least Squares (PRLS) regression uses a weight vector

in order to classify at each iteration the observations in outliers and non outliers. The latter is obtained while regressing the predictors and the h

Page 81: CAPITOLO 1 - unina.it
Page 82: CAPITOLO 1 - unina.it
Page 83: CAPITOLO 1 - unina.it
Page 84: CAPITOLO 1 - unina.it

5%. Alternative rules could involve relative change in residuals or weights. They found that convergence occurs very quickly. In step 3 of IRPLS the authors suggest to pass the predicted residuals rather than the ordinary residuals into the weight function. This has intui-tive appea

l: OLS maximizes the fit of the model as measured by the

2

R; PLS has the additional step of using tTm(h)Tj Tw 12.019 0 0299.54453 588.25974 Tm(e cross validated )Tj-0.0006 Tc 0.01 0 0Tw 12.019 0 0 12.019 401.03998 588.2379.73334 to choose the optimal number of components, there by optimizing its pr

edictive ability

. It would seem appropriate to use the th; 619 has 211ortional te to useted t h r y r e s i d u a i s h a s z e s t h e f 7 T m n 2 1 1 p u l l 7 1 . 2 1 9 9 o u t T m 1 6 9 T m ( i s h a s ) T j T w 0 6 6 . 4 7 0 0 0 2 . 0 1 9 9 3 7 0 2 7 3 6 9 T m ( e a 6 1 d s t t o p 7 1 . 2 1 9 7 7 T m ( . ) T j T w 1 2 0 . 5 5 9 3 7 0 0 2 . 0 1 9 9 3 7 0 2 7 3 t . 4 6 1 6 9 T m 4 r e ) T j - 9 is hasr

Page 85: CAPITOLO 1 - unina.it

Let us assume that the data [ ]yXZ ,= come from a joint distribution with mean equal to zero and population covariance matrix consisting of the elements

Ɇ

3. 19

ùùúù ùú

5c414 Tm3�

Page 86: CAPITOLO 1 - unina.it
Page 87: CAPITOLO 1 - unina.it
Page 88: CAPITOLO 1 - unina.it

necessary. If we replace these population values by the sample values, they can be influenced by the presence of outliers. A global robust ver-sion can come from using again the robust covariance matrix

3. 28 ùùú

ø

éé

Page 89: CAPITOLO 1 - unina.it

depends on the estimation of the mean and the variance–covariance matrix of the data

ZɆ qnpnmn ,,, , YX

Page 90: CAPITOLO 1 - unina.it

Applying the MCD or the RMCD method once to the ),( iii yxz = yields an estimate of the variance–covariance matrix in 3.30. From this we deduce which are robust estimates of .

ZɆrXYɆ

Page 91: CAPITOLO 1 - unina.it
Page 92: CAPITOLO 1 - unina.it
Page 93: CAPITOLO 1 - unina.it

where kk ,

Page 94: CAPITOLO 1 - unina.it
Page 95: CAPITOLO 1 - unina.it

Robust regression estimates are obtained by replacing the classical mean and covariance matrix of ( )

Page 96: CAPITOLO 1 - unina.it
Page 97: CAPITOLO 1 - unina.it

3.5. A robust method for PLS regression based on SSVD

In this paragraph we suggest a robust method for PLS regression. We proceed in the following way. Stage 1. We apply SSVD on [ ]qnpnmn ,,, ,= . This yields a “clean sample” (sample without multivariate outliers) Stage 2. We apply PLS method (NIPALS or SIMPLS approach) on “clean sample” to obtain robust scores . h,tStage 3. Once the scores are derived, a linear regression is performed ob-taining a robust estimate of . The regression model is the same as in 1.23, but now based on the robust scores. Finally a robust estimate of

is obtained by rew estimates for the parameters in the origi-nal mode

ȸ

PLSRȸl.

We obtain a robust estimate of ee origi-W

Page 98: CAPITOLO 1 - unina.it

Stage 5. We calculate the robust scores and final regression estimates by applying PLS regression on these observation with weight equal to 1. This stage has the advantage that it might again include observations which are “good” leverage points.

ihc

We call this method SPLS.

Page 99: CAPITOLO 1 - unina.it
Page 100: CAPITOLO 1 - unina.it

servations . To avoid such small calibration sets, the authors alterna-tively define

n

Page 101: CAPITOLO 1 - unina.it
Page 102: CAPITOLO 1 - unina.it

Before defining the criteria of comparison, let us define what the true values of PLS are. There are two possibilities: the first consists of choos-ing the value resulting from applying PLS to the non-contamined data;

Page 103: CAPITOLO 1 - unina.it

To compare the vector ɓ , we can follow the method and notation used with the vectors to define the discrepancy measures hw

Page 104: CAPITOLO 1 - unina.it
Page 105: CAPITOLO 1 - unina.it

To measure the influence that one observation [ ]yx,z = exert on an esti-ator, we calculate its influence function. Moreover a robust method m

should also be resistant to groups of outlier, which is usually measured by means of the breakdown value. No theoretical results could yet be proven about the robust PLS method described above. Therefore Vanden Branden and Hubert suggest to study the empirical influence function and the empirical breakdown value of a specific data set. The empirical influence function is defined as

n

n

Page 106: CAPITOLO 1 - unina.it

CHAPTER 4

A ROBUST MODEL FOR THE EVALUATION OF VEHICLE EMISSIONS.

4.1. A look at data Several epidemiological studies demonstrated short-

tween high levels of pollution and inity. Vehicles emissions are an importtion. so it’s necessary to estimatof vehicles in different situations (tra

vironmental pollution.

The analysis is based on a research search Council (CNR),. Isti

between the pollutants produced by autorameters. considering different trafficles).

Before the IM research the evaluation use was generally obtained by utilizi

c l e p e r f o r m a n c e b y e i t h e r r o a d / t r a f f i

N O X , C O2) was considered as an individual response.

Page 107: CAPITOLO 1 - unina.it

However. the IM showed the way the performance of vehicle affects ex-haust pollutant emissions is complex and requires a larger number of pa-rameters than the simple mean speed to be effectively described. More-over. the different species of emissions are related each othe

Page 108: CAPITOLO 1 - unina.it
Page 109: CAPITOLO 1 - unina.it
Page 110: CAPITOLO 1 - unina.it
Page 111: CAPITOLO 1 - unina.it
Page 112: CAPITOLO 1 - unina.it
Page 113: CAPITOLO 1 - unina.it
Page 114: CAPITOLO 1 - unina.it

als are large or small. For this reason it is interesting to look at the stan-dardized residual plot.

Figure 4. 7

-6

-4

-2

0

2

-7 -6 -5 -4 -3 -2 -1 0 1 2 3

Y

Predicted

ln CO (g-test), Comp 3(Cum)

RMSEE=1,41435

Urban 1Urban 2

Urban 3

Urban 4

Urban 5

Rural 1

Rural 2

Rural 3

Rural 4

Rural 5

Mtw1

Mtw2

Simca-P 8.0 by Umetrics AB 2005-11-21 23:48

5,50

6,00

6,50

7,00

7,50

5,20 5,40 5,60 5,

Page 115: CAPITOLO 1 - unina.it
Page 116: CAPITOLO 1 - unina.it
Page 117: CAPITOLO 1 - unina.it
⁈†⁰⁰⁴†⁓⁓⁖⁄†⁵†⁴†⁴⁺⁺⁴††⁵⁴†›‱‱‱‰‱‱′‴†⁈†⁰⁰⁴†⁐⁌⁓†⁵†⁰†⁺†ⁱ⁵⁴†⁵⁴††⁵⁴⁴†⁵⁴†‱‱‱′
Page 118: CAPITOLO 1 - unina.it

4.4.2. Multicollinearity diagnostics on the subsample without multivariate outliers

Multicollinearity diagnostics are not immune to the presence of contami-nation. Therefore, the identification of linear dependencies in a factor space, combined with the detection of outlie

rs, is an im

portant problem of regression analysis. For this reason we calcu54944 573.68019 Tm(late ag)Tj12.045 0 0 12.045 377.94452 573.68019 Tm(ain m)Tj12.045 0 0 12.045 405.70587 573.68019 Tm(u)Tj12.045 0 0 12.045 411. 12.045 286.67v2.045 0 0 12.045 460.36485 65C2denciesMulticollineportant probtione .499453(without85.(Multicolline)Tj12.030456986.682.0 0 5 50022tween5i2801.2802t 045 0b 65 087.48031 Tm( )Tj-0.00101 Tc 0.0399 39 7.0i3-0.0573.6 0 12 167.23 659.90027 T9(without82/of reg.0 0 5 50022mt78.4.2.)m(Multicolline)Tj12.0 be(e .499453(without85.(Multicolline)Tj12.030456986.682.0 0 5 50022t 0 12.045 233.682 0 70 13.08thout8T8.4.24. 4 C be(e .499coeffir)Tnts31 Tm( )T61i2.0 0 5 11682.0 0 5 50022t 0 12.045 296.82089 70 13.08thout8m(tj12.045 5 50022t 0 12.045 32ob)15Tj70 13.08thout8thout82/of reg.0 0 5 5005.(MuETEMC q1.247th20 0 58.9 -5258..9 -9777hi21.127..9 20 0 58.9 mh045045 .9 20 0 58.9 -4.45 -9777hi2W* n1.021.127..9 .16.57..9 rrel4 9777hi2fQ1.021.247th.16.57..9 5rel2 9777hi2f/Arj1(sp BMC 2.02BTulti4olline.4.2.)m(M)Tj8022t )Tj80orrel18.9 .14/C612.045 14.(MuETEMC q196.45 20 0 58.9 -5254 -9777hi21450122 0 70 0 58.9 mh21.288 20 0 58.9 -4.45 -9777hi2W* n1.021777hi214.16.57..9 38sio 9777hi2fQ1.02196.45 .16.57..9 28.92 9777hi2f/Arj1(sp BMC 2.02BTulti1olline)Tj11g1i2.)Tj8022t )Tj802045178.9 .14/C612.045 1Ti a f))Tj8022t

Page 119: CAPITOLO 1 - unina.it

4.4.3. PLS regression on the subsample without multi-variate outliers

Let us apply t

e

alt

ondat(a)Tj-0.0238 Tc 0.5.829 Tw 12.045 0 0 12.045325.979457 615.08028 Tm(in whichn thehoutliershave been a)Tj-0.0114 Tc 0.0649 Tw 12.045 0 0 12.045 127.62 20.208068 Tmrltvariba

a

p

e

a314037654 5743.84025 Tmodel fits etacS rsponse very a

el

l a et petau

ant.u

Inst(e)Tj12.045 0 0 12.0454572851692519..88203 Tmtle

p

ee

t a

axplaioed by first two coa

onent is different for athetw l

Page 120: CAPITOLO 1 - unina.it
Page 121: CAPITOLO 1 - unina.it
Page 122: CAPITOLO 1 - unina.it
Page 123: CAPITOLO 1 - unina.it
Page 124: CAPITOLO 1 - unina.it

Figure 4. 20

Page 125: CAPITOLO 1 - unina.it

Figure 4. 21

Page 126: CAPITOLO 1 - unina.it
Page 127: CAPITOLO 1 - unina.it

APPENDIX

Page 128: CAPITOLO 1 - unina.it

Robust methods for Partial Least Squares Regression in environmental field

Page 129: CAPITOLO 1 - unina.it
Page 130: CAPITOLO 1 - unina.it
Page 131: CAPITOLO 1 - unina.it

Multicollinearity and Partial Least Squares Regression 129

Jong S. (1993). SIMPLS: An alternative approach to partial least squares aboratory Systems, 18: 251-

le, Dunod, Parigi.

ratory Systems, 2: 283-290

ood asthma in Mexico City.

squarem

m0T4218045

0

0

12.045

6029

Tm(

Dunod,

Parigexico

Ci)2

1

Tf-0.00169

Tc

0.0904245

6852.0040019j12.004

i05,

2:

,

.

Page 132: CAPITOLO 1 - unina.it