15 Model Selection - SSCCbhansen/718/NonParametrics14.pdf · AIC is an estimate of the expected...

15 Model Selection

15.1 KLIC

Suppose a random sample is y = y1; ; :::; yn has unknown density f(y) =Yf(yi):

A model density g(y) =Yg(yi):

How can we assess the ��t�of g as an approximation to f?

One useful measure is the Kullback-Leibler information criterion (KLIC)

KLIC(f ;g) =

Zf(y) log

�f(y)

g(y)

�dy

You can decompose the KLIC as

KLIC(f ;g) =

Zf(y) log f(y)dy �

Zf(y) log g(y)dy

= Cf � E log g(y)

The constant Cf =Rf(y) log f(y)dy is independent of the model g:

Notice that KLIC(f; g) � 0; and KLIC(f; g) = 0 i¤ g = f: Thus a �good� approximating

model g is one with a low KLIC.

15.2 Estimation

Let the model density g(y; �) depend on a parameter vector �: The negative log-likelihood

function is

L(�) = �nXi=1

log g(yi; �) = � log g(y; �)

and the MLE is � = argmin� L(�): Sometimes this is called a �quasi-MLE�when g(y; �) is acknowl-edged to be an approximation, rather than the truth.

Let the minimizer of �E log g(y; �) be written �0 and called the pseudo-true value. This valuealso minimizes KLIC(f; g(�)): As the likelihood divided by n is an estimator of �E log g(y; �); theMLE � converges in probability to �0: That is,

� !p �0 = argmin�

KLIC(f; g(�))

Thus QMLE estimates the best-�tting density, where best is measured in terms of the KLIC.

From conventional asymptotic theory, we know

pn��QMLE � �0

�!d N (0; V )

114

V = Q�1Q�1

Q = �E @2

@�@�0log g(y; �)

= E

�@

@�log g(y; �)

@

@�log g(y; �)0

�If the model is correctly speci�ed (g (y; �0) = f(y)), then Q = (the information matrix equality).

Otherwise Q 6= .

15.3 Expected KLIC

The MLE � = �(y) is a function of the data vector y:

The �tted model at any ~y is g(~y) = g(~y; �(y)) .

The �tted likelihood is L(�) = � log g(y; �(y)) (the model evaluated at the observed data).The KLIC of the �tted model is is

KLIC(f ; g) = Cf �Zf(~y) log g(~y; �(y))d~y

= Cf � E~y log g(~y; �(y))

where ~y has density f ; independent of y:

The expected KLIC is the expectation over the observed values y

E (KLIC(f ; g)) = Cf � EyE~y log g(~y; �(y))

= Cf � E~yEy log g(y; �(~y))

the second equality by symmetry. In this expression, ~y and y are independent vectors each with

density f : Letting ~� = �(~y); the estimator of � when the data is ~y, we can write this more compactly

as

E:y (KLIC(f ; g)) = Cf � E log g(y; ~�)

where y and ~� are independent.

An alternative interpretation is in terms of predicted likelihood. The expected KLIC is the

expected likelihood when the sample ~y is used to construct the estimate ~�; and an independent

sample y used for evaluation. In linear regression, the quasi-likelihood is Gaussian, and the expected

KLIC is the expected squared prediction error.

15.4 Estimating KLIC

We want an estimate of the expected KLIC.

As Cf is constant across models, it is ignored.

We want to estimate

T = �E log g(y; ~�)

115

Make a second-order Taylor expansion of � log g�y; ~��about � :

� log g(y; ~�) ' � log g(y; �)� @

@�log g(y; �)0

�~� � �

��12

�~� � �

�0� @2

@�@�0log g(y; �)

��~� � �

�The �rst term on the RHS is L(�); the second is linear in the FOC, so only the third term remains.

Writing

Q = �n�1 @2

@�@�0log g(y; �);

~� � � =�~� � �0

�� 0

�and expanding the quadratic, we �nd

� log g(y; ~�) ' L(�)+ 12n�~� � �0

�0Q�~� � �0

�+1

2n�� 0

�0Q�� 0

�+n

�~� � �0

�Q�� 0

�:

Nowpn�� 0

�!d Z1 � N (0; V )

pn�~� � �0

�!d Z2 � N (0; V )

which are independent, and Q!p Q. Thus for large n;

� log g(y; ~�) ' L(�) + 12Z 01QZ1 +

1

2Z 02QZ2 + Z

01QZ2:

Taking expectations

T = �E log g(y; ~�)

' EL(�) + E�1

2Z 01QZ1 +

1

2Z 02QZ2 + Z

01QZ2

�= EL(�) + tr (QV )

= EL(�) + tr�Q�1

�An (asymptotically) unbiased estimate of T is then

T = L(�) + \tr (Q�1)

where \tr (Q�1) is an estimate of tr�Q�1

�:

116

15.5 AIC

When g(x; �0) = f(x) (the model is correctly speci�ed) then Q = (the information matrix

equality). Hence

tr�Q�1

�= k = dim(�)

so

T = L(�) + k

This is the the Akaike Information Criterion (AIC). It is typically written as 2T , e.g.

AIC = 2L(�) + 2k

AIC is an estimate of the expected KLIC, based on the approximation that g includes the

correct model.

Picking a model with the smalled AIC is picking the model with the smallest estimated KLIC.

In this sense it is picking is the best-�tting model.

15.6 TIC

Takeuchi (1976) proposed a robust AIC, and is known as the Takeuchi Information Criterion

(TIC)

TIC = 2L(�) + 2 tr�Q�1

�where

Q = � 1n

nXi=1

@2

@�@�0log g(yi; �)

=1

n

nXi=1

�@

@�log g(yi; �)

@

@�log g(yi; �)

0�

The does not require that g is correctly speci�ed.

15.7 Comments on AIC and TIC

The AIC and TIC are designed for the likelihood (or quasi-likelihood) context. For proper

application, the �model�needs to be a conditional density, not just a conditional mean or set of

moment conditions. This is a strength and limitation.

The bene�t of AIC/TIC is that it selects �tted models whose densities are close to the true

density. This is a broad and useful feature.

The relation of the TIC to the AIC is very similar to the relationship between the conventional

and �White�covariance matrix estimators for the MLE/QMLE or LS. The TIC does not appear

to be widely appreciated nor used.

117

The AIC is known to be asymptotically optimal in linear regression (we discuss this below),

but in the general context I do not know of an optimality result. The desired optimality would

be that if a model is selected by minimizing AIC (or TIC) then the �tted KLIC of this model is

asymptotically equivalent to the KLIC of the infeasible best-�tting model.

15.8 AIC and TIC in Linear Regression

In linear regression or projection

yi = X 0i� + ei

E (Xiei) = 0

AIC or TIC cannot be directly applied, as the density of ei is unspeci�ed. However, the LS estimator

is the same as the Gaussian MLE, so it is natural to calculate the AIC or TIC for the Gaussian

quasi-MLE.

The Gaussian quasi-likelihood is

log gi(�) = �1

2log�2��2

�� 1

2�2�yi �X 0

i��2

where � = (�; �2) and �2 = Ee2i : The MLE � = (�; �2) is LS. The pseudo-true value �0 is the

projection coe¢ cient � = E (XiX0i)�1E (Xiyi) : If � is k � 1 then the number of parameters is

k + 1:

The sample log-likelihood is

2L(�) = n log��2�+ n log (2�) + n

The second/third parts can be ignored. The AIC is

AIC = n log��2�+ 2 (k + 1) :

Often this is written

AIC = n log��2�+ 2k

as adding/subtracting constants do not matter for model selection, or sometimes

AIC = log��2�+ 2

k

n

as scaling doesn�t matter.

118

Also

@

@�log g(yi; �) =

1

�2Xi�yi �X 0

i��

@

@�2log g(yi; �) = � 1

2�2+

1

2�4�yi �X 0

i��2;

and

� @2

@�@�0log g(yi; �) =

1

�2XiX

0i

� @2

@�@�2log g(yi; �) =

1

�4Xi�yi �X 0

i��

� @2

@ (�2)2log g(yi; �) = � 1

2�4+1

�6�yi �X 0

i��2

Evaluated at the pseudo-true values,

@

@�log g(yi; �0) =

1

�2Xiei

@

@�2log g(yi; �0) =

1

2�4�e2i � �2

�;

and

� @2

@�@�0log g(yi; �0) =

1

�2XiX

0i

� @2

@�@�2log g(yi; �0) =

1

�4Xie

� @2

@ (�2)2log g(yi; �0) =

1

2�6

�2�yi �X 0

i��2 � �2�

Thus

Q = �E

2664@2

@�@�0log g(yi; �0)

@2

@�@�2log g(yi; �0)

@2

@�2@�0log g(yi; �0)

@2

@ (�2)2log g(yi; �0)

3775= ��2

24 E (XiX 0i) 0

01

2�2

35

119

and

= E

2664@

@�log g(yi; �0)

@

@�log g(yi; �0)

0 @

@�log g(yi; �0)

@

@�2log g(yi; �0)

@

@�2log g(yi; �0)

@

@�log g(yi; �0)

0�@

@�2log g(yi; �0)

�23775

= ��2

264 E�XiX

0i

e2i�2

�1

2�4E�X 0ie3i

�1

2�4E�Xie

3i

� �44�2

375where

�4 = var

�e2i�2

�=E�e2i � �2

�2�4

=E�e4i�� 4

�4

We see that = Q if

E

�e2i�2j Xi

�= 1

E�Xie

3i

�= 0

�4 = 2

Essentially, this requies that ei � N(0; �2): Otherwise 6= Q:Thus the AIC is appropriate in Gaussian regression. It is an �approximation�in non-Gaussian

regression, heteroskedastic regression, or projection.

To calculate the TIC, note that since Q is block diagonal you do not need to estimate the

o¤-diagonal component of : Note that

trQ�1 = tr

�E�XiX

0i

��1E

�XiX

0i

e2i�2

��+

�1

2�2

��1 �44�2

= tr

�E�XiX

0i

��1E

�XiX

0i

e2i�2

��+�42

Let

�4 =1

n

nXi=1

�e2i � �2

�2The TIC is then

TIC = n log��2�+ tr

�Q�1

�= n log

��2�+ 2

24tr0@ nX

i=1

XiX0i

!�1 nXi=1

XiX0i

e2i�2

!1A+ �435

= n log��2�+2

�2

nXi=1

hie2i + �4

120

where hi = X 0i (X

0X)�1Xi.

When the errors are close to homoskedastic and Gaussian, then hi and e2i will be uncorrelated

�4 will be close to 2, so the penalty will be close to

2nXi=1

hi + 2 = 2 (k + 1)

as for AIC. In this case TIC will be close to AIC. In applications, the di¤erences will arise under

heteroskedasticity and non-Gaussianity.

The primary use of AIC and TIC is to compare models. As we change models, typically the

residuals ei do not change too much, so my guess is that the estimate � will not change much. In

this event, the TIC correction for estimation of �2 will not matter much.

15.9 Asymptotic Equivalence

Let ~�2 be a preliminary (model-free) estimate of �2: The AIC is equivalent to

~�2�AIC � n log ~�2 + n

�= n�2

�log

��2

~�2

�+ 1

�+ 2~�2k

' n~�2��2

~�2

�+ 2~�2k

= e0e+ 2~�2k

= Ck

The approximation is log(1 + a) ' a for a small. This is the Mallows criterion. Thus AIC is

approximately equal to Mallows, and the approximation is close when k=n is small.

Furtheremore, this expression approximately equalts

e0e

�1 +

2

nk

�= Sk

which is known as Shibata�s condition (Annals of Statistics, 1980; Biometrick, 1981).

The TIC (ignoring the correction for estimation of �2) is equivalent to

~�2�TIC � n log ~�2 + n

�= n~�2

�log

��2

~�2

�+ 1

�+2~�2

�2

nXi=1

hie2i

' e0e+ 2nXi=1

hie2i

'nXi=1

e2i(1� hi)2

= CV;

121

the cross-validation criterion. Thus TIC ' CV:They are both asymptotically equivalent to a �Heteroskedastic-Robust Mallows Criterion�

C�k = e0e+ 2

nXi=1

hie2i

which, strangely enough, I have not seen in the literature.

15.10 Mallows Criterion

Ker-Chau Li (1987, Annals of Statistics) provided a important treatment of the optimality of

model selection methods for homoskedastic linear regression. Andrews (1991, JoE) extended his

results to allow conditional heteroskedasticity.

Take the regression model

yi = g (Xi) + ei

= gi + ei

E (ei j Xi) = 0

E�e2i j Xi

�= 0

Written as an n� 1 vectory = g + e:

Li assumed that the Xi are non-random, but his analysis can be re-interpreted by treating every-

thing as conditional on Xi:

Li considered estimators of the n� 1 vector g which are linear in y and thus take the form

g(h) =M(h)y

where M(h) is n � n; a function of the X matrix, indexed by h 2 H; and H is a discrete set.

For example, a series estimator sets M(h) = Xh (X 0hXh)

�1Xh where Xh is an n � kh set of basisfunctions of the regressors, and H = f1; :::; �hg: The goal is to pick h to minimize the average squarederror

L(h) =1

n(g � g(h))0 (g � g(h)) :

The index h is selected by minimizing the Mallows, Generalized CV, or CV criterion. We discuss

Mallows in detail, as it is the easiest to analyze. Andrews showed that only CV is optimal under

heteroskedasticity.

The Mallows criterion is

C(h) =1

n(y � g(h))0 (y � g(h)) + 2�

2

ntrM(h)

122

The �rst term is the residual variance from model h; the second is the penalty. For series estimators,

trM(h) = kh: The Mallows selected index h minimizes C(h):

Since y = e+ g; then y � g(h) = e+ g � g(h); so

C(h) =1

n(y � g(h))0 (y � g(h)) + 2�

2

ntrM(h)

=1

ne0e+ L(h) + 2

1

ne0 (g � g(h)) + 2�

2

ntrM(h)

And

g(h) =M(h)y =M(h)g +M(h)e

then

g � g(h) = (I �M(h)) g �M(h)e

= b(h)�M(h)e

where b(h) = (I �M(h)) g; and C(h) equals

1

ne0e+ L(h) + 2

1

ne0b(h) +

2

n

��2 trM(h)� e0M(h)e

�As the �rst term doesn�t involve h; it follows that h minimizes

C�(h) = L(h) + 21

ne0b(h) +

2

n

��2 trM(h)� e0M(h)e

�over h 2 H:

The idea is that empirical criterion C�(h) equals the desired criterion L(h) plus a stochastically

small error.

We calculate that

L(h) =1

n(g � g(h))0 (g � g(h))

=1

nb(h)0b(h)� 2

nb(h)0M(h)e+

1

ne0M(h)0M(h)e

and

R(h) = E (L(h) j X)

=1

nb(h)0b(h) + E

�1

ne0M(h)0M(h)e j X

�=

1

nb(h)0b(h) +

�2

ntrM(h)0M(h)

The optimality result is:

123

Theorem 1. Let �max(A) denote the maximum eigenvalue of A: If for some positive integer m;

limn!1

suph2H

�max (M(h)) < 1

E�e4mi j Xi

�� <1X

h2H(nR(h))�m ! 0 (1)

then

L(h)

infh2H L(h)!p 1:

15.11 Whittle�s Inequalities

To prove Theorem 1, Li (1987) used two key inequalities from Whittle (1960, Theory of Prob-

ability and Its Applications).

Theorem. Suppose the observations are independent: Let b be any n � 1 vector and A any

n� n matrix, functions of X. If for some s � 2

maxiE (jeijs j Xi) � �s <1

then

E��b0e��s j X� � K1s �b0b�s=2 (2)

and

E��e0Ae� E �e0Ae j X��s j X� � K2s �trA0A�s=2 (3)

where

K1s =2s3=2p��

�s+ 1

2

��s

K2s = 2sK1sK

1=21;2s�2s

15.12 Proof of Theorem 1

The main idea is similar to that of consistent estimation. Recall that if Sn(�)!p S(�) uniformly

in �; then the minimizer of Sn(�) converges to the minimizer of S(�): We can write the uniform

convergence as

sup�

��Sn(�)S(�)� 1��!p 0

In the present case, we will show (below) that

suph

��C�(h)� L(h)L(h)

��!p 0 (4)

124

Let h0 denote the minimizer of L(h): Then

0 � L(h)� L(h0)L(h)

=C�(h)� L(h0)

L(h)� C

�(h)� L(h)L(h)

=C�(h)� L(h0)

L(h)+ op(1)

� C�(h0)� L(h0)L(h)

+ op(1)

� C�(h0)� L(h0)L(h0)

+ op(1)

= op(1)

This uses (4) twice, and the facts L(h0) � L(h) and C�(h) � C�(h0): This shows that

L(h0)

L(h)!p 1

which is equivalent to the Theorem.

The key is thus (4). We show below that

suph

��L(h)R(h)� 1��!p 0 (5)

which says that L(h) and R(h) are asymptotically equivalent, and thus (4) is equivalent to

suph

��C�(h)� L(h)R(h)

��!p 0: (6)

From our earlier equation for C�(h); we have

suph

��C�(h)� L(h)R(h)

�� 2 suph

je0b(h)jnR(h)

+ 2 suph

��2 trM(h)� e0M(h)e��nR(h)

: (7)

Take the �rst term on the right-hand-side. By Whittle�s �rst inequality,

E��e0b(h)��2m j X� � K �b(h)0b(h)�m

Now recall

nR(h) = b(h)0b(h) + �2 trM(h)0M(h) (8)

Thus

nR(h) � b(h)0b(h)

125

Hence

E��e0b(h)��2m j X� � K �b(h)0b(h)�m � K (nR(h))m

Then, since H is discrete, by applying Markov�s inequality and this bound,

P

�suph

je0b (h)jnR(h)

> � j X�

�Xh2H

P

�je0b(h)jnR(h)

> � j X�

�Xh2H

��2mE�je0b(h)j2m j X

�(nR(h))2m

�Xh2H

��2mK (nR(h))m

(nR(h))2m

=K

�2m

Xh2H

(nR(h))�m

! 0

by assumption (1). This shows

suph

je0b(h)jnR(h)

!p 0

Now take the second term in (7). By Whittle�s second inequality, since

E�e0M(h)e j X

�= �2 trM(h);

then

E��e0M(h)e� �2 trM(h)��2m j X� � K

�tr�M(h)0M(h)

��m� ��2mK (nR(h))m

the second inequality since (8) implies

trM(h)0M(h) � ��2nR(h)

126

Applying Markov�s inequality

P

suph

��e0M(h)e� �2 trM(h)��nR(h)

> � j X!

�Xh2H

P

��e0M(h)e� �2 trM(h)��nR(h)

> � j X!

�Xh2H

��2mE��e0M(h)e� �2 trM(h)��2m j X�

(nR(h))2m

�Xh2H

��2m��2mK (nR(h))m

(nR(h))2m

= K��2�2

��mXh2H

(nR(h))�m

! 0

For completeness, let us show (5). The demonstration is essentially the same as the above.

We calculate

L(h)�R(h) = � 2nb(h)0M(h)e+

1

ne0M(h)0M(h)e� �

2

ntrM(h)0M(h)

= � 2nb(h)0M(h)e+

1

n

�e0M(h)0M(h)e� E

�e0M(h)0M(h)e j X

��Thus

suph

��L(h)�R(h)R(h)

�� 2 suph

je0M(h)0b(h)jnR(h)

+ 2 suph

je0M(h)0M(h)e� E (e0M(h)0M(h)e j X)jnR(h)

:

By Whittle�s �rst inequality,

E��e0M(h)0b(h)��2m j X� � K �b(h)0M(h)M(h)0b(h)�m

Use the matrix inequality

tr (AB) � �max(A) tr (B)

and letting�M = lim

n!1suph2H

�max (M(h)) <1

then

b(h)0M(h)M(h)0b(h) = tr�M(h)M(h)0b(h)b(h)0

�� M2 tr

�b(h)b(h)0

�� M2b(h)0b(h)

� �M2nR(h)

127

Thus

E��e0M(h)0b(h)��2m j X� � K

�b(h)0M(h)M(h)0b(h)

�m� K �M2 (nR(h))m

Thus

P

�suph

je0M(h)0b (h)jnR(h)

> � j X�

�Xh2H

P

�je0M(h)0b(h)j

nR(h)> � j X

�

�Xh2H

��2mE�je0M(h)0b(h)j2m j X

�(nR(h))2m

�Xh2H

��2mK �M2 (nR(h))m

(nR(h))2m

=K �M2

�2m

Xh2H

(nR(h))�m

! 0

Similarly,

E��e0M(h)0M(h)e� E �e0M(h)0M(h)e j X��2m j X� � K

�tr�M(h)0M(h)M(h)0M(h)

��m� K �M2m

�tr�M(h)0M(h)

��m� ��2mK �M2m (nR(h))m

and thus

P

�suph

je0M(h)0M(h)e� E (e0M(h)0M(h)e j X)jnR(h)

> � j X�

�Xh2H

P

�je0M(h)0M(h)e� E (e0M(h)0M(h)e j X)j

nR(h)> � j X

�

�Xh2H

��2mE�je0M(h)0M(h)e� E (e0M(h)0M(h)e j X)j2m j X

�(nR(h))2m

�Xh2H

��2m��2mK �M2m (nR(h))m

(nR(h))2m

= K

� �M2

�2�2

�mXh2H

(nR(h))�m

! 0

We have shown

suph

��L(h)�R(h)R(h)

��!p 0

which is (5).

128

15.13 Mallows Model Selection

Li�s Theorem 1 applies to a variety of linear estimators. Of particular interest is model selection

(e.g. series estimation).

Let�s verify Li�s conditions, which were

limn!1

suph2H

�max (M(h)) < 1

E�e4mi j Xi

�� <1X

h2H(nR(h))�m ! 0 (9)

In linear estimation, M(h) is a projection matrix, so �max (M(h)) = 1 and the �rst equation is

automatically satis�ed.

The key is equation (9).

Suppose that for sample size n; there are Nn models. Let

�n = infh2H

nR(h)

and assume

�n !1

A crude bound is Xh2H

(nR(h))�m � Nn��mn

If Nn��mn ! 0 then (9) holds. Notice that by increasing m; we can allow for larger Nn (more

models) but a tighter moment bound.

The condition �n ! 0 says that for all �nite models h; the there is non-zero approximation error,

so that R(h) is non-zero. In contrast, if there is a �nite dimensional model h0 for which b(h0) = 0;

then nR(h0) = h0�2 does not diverge. In this case, Mallows (and AIC) are asymptotically sub-

optimal.

We can improve this condition if we consider the case of selection among models of increasing

size. Suppose that model h has kh regressors, and k1 < k2 < � � � and for some m � 2;

1Xh=1

k�mh <1

This includes nested model selection, where kh = h and m = 2: Note that

nR(h) = b(h)0b(h) + kh�2 � kh�2

129

Now pick Bn !1 so that Bn��mn ! 0 (which is possible since �n !1:) Then

1Xh=1

(nR(h))�m =

BnXh=1

(nR(h))�m +1X

h=Bn+1

(nR(h))�m

� Bn��mn + ��2

1Xh=Bn+1

k�mh ! 0

as required.

15.14 GMM Model Selection

This is an underdeveloped area. I list a few papers.

Andrews (1999, Econometrica). He considers selecting moment conditions to be used for GMM

estimation. Let p be the number of parameters, c represent a list of �selected�moment conditions,

jcj denote the cardinality (number) of these moments, and Jn(c) the GMM criterion computed

using these c moments. Andrews�proposes criteria of the form

IC(c) = Jn(c)� rn (jcj � p)

where jcj � p is the number of overidentifying restrictions and rn is a sequence. For an AIC-likecriterion, he sets rn = 2; for a BIC-like criterion, he sets rn = log n:

The model selection rule picks the moment conditions c which minimize Jn(c):

Assuming that a subset of the moments are incorrect, Andrews shows that the BIC-like rule

asymptotically selects the correct subset.

Andrews and Lu (2001, JoE) extend the above analysis to the case of jointly picking the moments

and the parameter vector (that is, imposing zero restrictions on the parameters). They show that

the same criterion has similar properties �that it can asymptotically select the �correct�moments

and �correct�zero restrictions.

Hong, Preston and Shum (ET, 2003) extend the analysis of the above papers to empirical

likelihood. They show that that this criterion has the same interpretation when Jn(c) is replaced

by the empirical likelihood.

These papers are an interesting �rst step, but they do not address the issue of GMM selection

when the true model is potentially in�nite dimensional and/or misspeci�ed. That is, the analysis

is not analogous to that of Li (1987) for the regression model.

In order to properly understand GMM selection, I believe we need to understand the behavior

of GMM under misspeci�cation.

Hall and Inoue (2003, Joe) is one of the few contributions on GMM under misspeci�caiton.

They did not investigate model selection.

Suppose that the model is

m(�) = Emi(�) = 0

130

where mi is ` � 1 and � is k � 1: Assume ` > r (overidenti�cation). The model is misspeci�ed ifthere is no � such that this moment condition holds. That is, for all �;

m(�) 6= 0

Suppose we apply GMM. What happens?

The �rst question is, what is the pseudo-true value? The GMM criterion is

Jn (�) = n �mn(�)0Wn �mn(�)

If Wn !p W; then

n�1Jn(�)!p m(�)0Wm(�):

Thus the GMM estimator � is consistent for the pseudo-true value

�0(W ) = argminm(�)0Wm(�):

Interstingly, the pseudo-true value �0(W ) is a function of W: This is a fundamental di¤erence

from the correctly speci�ed case, where the weight matrix only a¤ects e¢ ciency. In the misspeci�ed

case, it a¤ects what is being estimated.

This means that when we apply �iterated GMM�, the pseudo-true value changes with each step

of the iteration!

Hall and Inoue also derive the distribution of the GMM estimator. They �nd that the distri-

bution depends not only on the randomness in the moment conditions, but on the randomness in

the weight matrix. Speci�cally, they assume that n1=2 (Wn �W ) !d Normal; and �nd that this

a¤ects the asymptotic distributions.

Furthermore, the distribution of test statistics is non-standard (a mixture of chi-squares). So

inference on the pseudo-true values is troubling.

This subject deserves more study.

15.15 KLIC for Moment Condition Models Under Misspeci�cation

Suppose that the true density is f(y); and we have an over-identi�ed moment condition model,

e.g. for some function m(y); the model is

Em(y) = 0

However, we want to allow for misspeci�cation, namely that

Em(y) 6= 0

To explore misspe�cation, we have to ask: What is a desirable pseudo-true model?

131

Temporarily ignoring parameter estimation, we can ask: Which density g(y) satisfying this

moment condition is closest to f(y) in the sense of minimizing KLIC? We can call this g0(y) the

pseudo-true density.

The solution is nicely explained in Appendix A of Chen, Hong, and Shum (JoE, 2007). Recall

KLIC(f; g) =

Zf(y) log

�f(y)

g(y)

�dy

The problem is

mingKLIC(f; g)

subject to Zg(y)dy = 1Z

m(y)g(y)dy = 0

The Lagrangian isZf(y) log

�f(y)

g(y)

�dy + �

�Zg(y)dy � 1

�+ �0

Zm(y)g(y)dy

The FOC with respect to g(y) at some y is

0 = �f(y)g(y)

+ �+ �0m(y)

Multiplying by g(y) and integrating,

0 = �Zf(y)dy + �

Zg(y)dy + �0

Zm(y)g(y)dy

= �1 + �

so � = 1: Solving for g(y) we �nd

g(y) =f(y)

1 + �0m(y);

a tilted version of the true density f(y): Inserting this solution we �nd

KLIC(f; g) =

Zf(y) log

�1 + �0m(y)

�dy

By duality, the optimal Lagrange multiplier �0 maximizes this expression

�0 = argmax�

Zf(y) log

�1 + �0m(y)

�dy:

132

The pseudo-true density is

g0(y) =f(y)

1 + �00m(y);

with associated minimized KLIC

KLIC (f; g0) =

Zf(y) log

�1 + �00m(y)

�dy

= E log�1 + �00m(y)

�This is the smallest possible KLIC(f,g) for moment condition models.

This solution looks like empirical likelihood. Indeed, EL minimizes the empirical KLIC, and

this connection is widely used to motivate EL.

When the moment m(y; �) depends on a parameter �;then the pseudo-true values (�0; �0) are

the joint solution to the problem

min�max�E log

�1 + �0m(y; �)

�Theorem (Chen, Hong and Shum, JoE, 2007). If jm(y; �)j is bounded, then the EL estimates

(�; �) are n�1=2 consistent for the pseudo-true values (�0; �0).

This gives a simple interpretation to the de�nition of KLIC under misspeci�cation.

15.16 Schennach�s Impossibility Result

Schennach (Annals of Statistics, 2007) claims a fundamental �aw in the application of KLIC

to moment condition models. She shows that the assumption of bounded jm(y; �)j is not merely atechnical condition, it is binding.

[Notice: In the linear model, m(y; �) = z(y � x0�) is unbounded if the data has unboundedsupport. Thus the assumption is highly relevant.]

The key problem is that for any � 6= 0; if m(y; �) is unbounded, so is 1+�0m(y; �): In particular,it can take on negative values. Thus log

�1 + �0m(y; �)

�is ill-de�ned. Thus there is no pseudo-true

value of �: (It must be non-zero, but it cannot be non-zero!) Without a non-zero �; there is no way

to de�ne a pseudo-true �0 which satis�es the moment condition.

Technically, Schennach shows that when there is no � such that Em(y; �) = 0 and m(y; �) is

unbounded, then there is no �0 such thatpn�� 0

�= Op(1):

Her paper leaves open the question: For what is � consistent? Is there a pseudo-true value?

One possibility is that the pseudo-true value �n needs to be indexed by sample size. (This idea is

used in Hal White�s work.)

Never-the-less, Schennach�s theorem suggests that empirical likelihood is non-robust to mis-

speci�cation.

133

15.17 Exponential Tilting

Instead of

KLIC(f; g) =

Zf(y) log

�f(y)

g(y)

�dy

consider the reverse distance

KLIC(g; f) =

Zg(y) log

�g(y)

f(y)

�dy:

The pseudo-true g which minimizes this criterion is

ming

Zg(y) log

�g(y)

f(y)

�dy

subject to Zg(y)dy = 1Z

m(y)g(y)dy = 0

The Lagrangian isZg(y) log

�g(y)

f(y)

�dy � �

�Zg(y)dy � 1

�� 0

Zm(y)g(y)dy

with FOC

0 = log

�g(y)

f(y)

�+ 1� �� 0m(y):

Solving

g(y) = f(y) exp (�1 + �) exp��0m(y)

�:

ImposingRg(y)dy = 1 we �nd

g(y) =f(y) exp

��0m(y)

�Rf(y) exp

��0m(y)

�dy: (10)

Hence the name �exponential tilting�or ET

134

Inserting this into KLIC(g; f) we �nd

KLIC(g; f) =

Zg(y) log

exp

��0m(y)

�Rf(y) exp

��0m(y)

�dy

!dy

= �0Zm(y)g(y)dy �

Zg(y)dy log

�Zf(y) exp

��0m(y)

�dy

�= � log

�Zf(y) exp

��0m(y)

�dy

�(11)

= � logE exp��0m(y)

�(12)

By duality, the optimal Lagrange multiplier �0 maximizes this expression, equivalently

�0 = argmin�

E exp��0m(y)

�(13)

The pseudo-true density g0(y) is (10) with this �0; with associated minimized KLIC (11). This is

the smallest possible KLIC(g,f) for moment condition models.

Notice: the g0 which minimize KLIC(g,f) and KLIC(f,g) are di¤erent.

In contrast to the EL case, the ET problem (13) does not restrict �; and there are no �trouble

spots�. Thus ET is more robust than EL. The pseudo-true �0 and g0 are well de�ned under

misspeci�cation, unlike EL.

When the moment m(y; �) depends on a parameter �;then the pseudo-true values (�0; �0) are

the joint solution to the problem

max�min�E exp

��0m(y; �)

�:

15.18 Exponential Tilting �Estimation

The ET or exponential tilting estimator solves the problem

min�;p1;:::;pn

nXi=1

pi log pi

subject to

nXi=1

pi = 1

nXi=1

pim (yi; �) = 0

135

First, we concentrate out the probabilities. For any �; the Lagrangian is

nXi=1

pi log pi � �

nXi=1

pi � 1!� �0

nXi=1

pim (yi; �)

with FOC

0 = log pi � 1� �� 0m(yi; �):

Solving for pi and imposing the summability,

pi(�) =exp

��0m(yi; �)

�Pni=1 exp

��0m(yi; �)

�When � = 0 then pi = n�1; same as EL. The concentrated �entropy�criterion is then

nXi=1

pi(�) log pi(�) =nXi=1

pi(�)

"�0m(yi; �)� log

nXi=1

exp��0m(yi; �)

�!#

= � log

nXi=1


�!

By duality, the Lagrange multiplier maximizes this criterion, or equivalently

�(�) = argmin�

nXi=1


�The ET estimator � maximizes this concentrated function, e.g.

� = argmax�

nXi=1

exp��(�)0m(yi; �)

�

The ET probabilities are pi = pi(�)

15.19 Schennach�s Estimator

Schennach (2007) observed that while the ET probabilities have desirable properties, the EL

estimator for � has better bias properties. She suggested a hybrid estimator which achieves the

best of both worlds, called exponentially tilted empirical likelihood (ETEL).

This is

� = argmax�

ELET (�)

136

ETEL(�) =

nXi=1

log (pi(�))

= �(�)0nXi=1

m(yi; �)� log

nXi=1

exp��(�)0m(yi; �)

�!

pi(�) =exp

��(�)0m(yi; �)

�Pni=1 exp

��(�)0m(yi; �)

��(�) = argmin

�

nXi=1


�She claims the following advantages for the ETEL estimator �

� Under correct speci�cation, � is asymptotically second-order equivalent to EL

� Under misspeci�cation, the pseudo-true values �0; �0 are generically well de�ned, and mini-mize a KLIC analog

�pn�� 0

�!d N

�0;��1��10

�where � = E

@

@�0m (y; �) and = E m (y; �)m (y; �)0 :

137

15 Model Selection - SSCCbhansen/718/NonParametrics14.pdf · AIC is an estimate of the expected...

Documents

Transcript of 15 Model Selection - SSCCbhansen/718/NonParametrics14.pdf · AIC is an estimate of the expected...