The community performance index: a proposed model for community monitoring
15 Model Selection - SSCCbhansen/718/NonParametrics14.pdf · AIC is an estimate of the expected...
Transcript of 15 Model Selection - SSCCbhansen/718/NonParametrics14.pdf · AIC is an estimate of the expected...
15 Model Selection
15.1 KLIC
Suppose a random sample is y = y1; ; :::; yn has unknown density f(y) =Yf(yi):
A model density g(y) =Yg(yi):
How can we assess the ��t�of g as an approximation to f?
One useful measure is the Kullback-Leibler information criterion (KLIC)
KLIC(f ;g) =
Zf(y) log
�f(y)
g(y)
�dy
You can decompose the KLIC as
KLIC(f ;g) =
Zf(y) log f(y)dy �
Zf(y) log g(y)dy
= Cf � E log g(y)
The constant Cf =Rf(y) log f(y)dy is independent of the model g:
Notice that KLIC(f; g) � 0; and KLIC(f; g) = 0 i¤ g = f: Thus a �good� approximating
model g is one with a low KLIC.
15.2 Estimation
Let the model density g(y; �) depend on a parameter vector �: The negative log-likelihood
function is
L(�) = �nXi=1
log g(yi; �) = � log g(y; �)
and the MLE is � = argmin� L(�): Sometimes this is called a �quasi-MLE�when g(y; �) is acknowl-edged to be an approximation, rather than the truth.
Let the minimizer of �E log g(y; �) be written �0 and called the pseudo-true value. This valuealso minimizes KLIC(f; g(�)): As the likelihood divided by n is an estimator of �E log g(y; �); theMLE � converges in probability to �0: That is,
� !p �0 = argmin�
KLIC(f; g(�))
Thus QMLE estimates the best-�tting density, where best is measured in terms of the KLIC.
From conventional asymptotic theory, we know
pn��QMLE � �0
�!d N (0; V )
114
V = Q�1Q�1
Q = �E @2
@�@�0log g(y; �)
= E
�@
@�log g(y; �)
@
@�log g(y; �)0
�If the model is correctly speci�ed (g (y; �0) = f(y)), then Q = (the information matrix equality).
Otherwise Q 6= .
15.3 Expected KLIC
The MLE � = �(y) is a function of the data vector y:
The �tted model at any ~y is g(~y) = g(~y; �(y)) .
The �tted likelihood is L(�) = � log g(y; �(y)) (the model evaluated at the observed data).The KLIC of the �tted model is is
KLIC(f ; g) = Cf �Zf(~y) log g(~y; �(y))d~y
= Cf � E~y log g(~y; �(y))
where ~y has density f ; independent of y:
The expected KLIC is the expectation over the observed values y
E (KLIC(f ; g)) = Cf � EyE~y log g(~y; �(y))
= Cf � E~yEy log g(y; �(~y))
the second equality by symmetry. In this expression, ~y and y are independent vectors each with
density f : Letting ~� = �(~y); the estimator of � when the data is ~y, we can write this more compactly
as
E:y (KLIC(f ; g)) = Cf � E log g(y; ~�)
where y and ~� are independent.
An alternative interpretation is in terms of predicted likelihood. The expected KLIC is the
expected likelihood when the sample ~y is used to construct the estimate ~�; and an independent
sample y used for evaluation. In linear regression, the quasi-likelihood is Gaussian, and the expected
KLIC is the expected squared prediction error.
15.4 Estimating KLIC
We want an estimate of the expected KLIC.
As Cf is constant across models, it is ignored.
We want to estimate
T = �E log g(y; ~�)
115
Make a second-order Taylor expansion of � log g�y; ~��about � :
� log g(y; ~�) ' � log g(y; �)� @
@�log g(y; �)0
�~� � �
��12
�~� � �
�0� @2
@�@�0log g(y; �)
��~� � �
�The �rst term on the RHS is L(�); the second is linear in the FOC, so only the third term remains.
Writing
Q = �n�1 @2
@�@�0log g(y; �);
~� � � =�~� � �0
���� � �0
�and expanding the quadratic, we �nd
� log g(y; ~�) ' L(�)+ 12n�~� � �0
�0Q�~� � �0
�+1
2n�� � �0
�0Q�� � �0
�+n
�~� � �0
�Q�� � �0
�:
Nowpn�� � �0
�!d Z1 � N (0; V )
pn�~� � �0
�!d Z2 � N (0; V )
which are independent, and Q!p Q. Thus for large n;
� log g(y; ~�) ' L(�) + 12Z 01QZ1 +
1
2Z 02QZ2 + Z
01QZ2:
Taking expectations
T = �E log g(y; ~�)
' EL(�) + E�1
2Z 01QZ1 +
1
2Z 02QZ2 + Z
01QZ2
�= EL(�) + tr (QV )
= EL(�) + tr�Q�1
�An (asymptotically) unbiased estimate of T is then
T = L(�) + \tr (Q�1)
where \tr (Q�1) is an estimate of tr�Q�1
�:
116
15.5 AIC
When g(x; �0) = f(x) (the model is correctly speci�ed) then Q = (the information matrix
equality). Hence
tr�Q�1
�= k = dim(�)
so
T = L(�) + k
This is the the Akaike Information Criterion (AIC). It is typically written as 2T , e.g.
AIC = 2L(�) + 2k
AIC is an estimate of the expected KLIC, based on the approximation that g includes the
correct model.
Picking a model with the smalled AIC is picking the model with the smallest estimated KLIC.
In this sense it is picking is the best-�tting model.
15.6 TIC
Takeuchi (1976) proposed a robust AIC, and is known as the Takeuchi Information Criterion
(TIC)
TIC = 2L(�) + 2 tr�Q�1
�where
Q = � 1n
nXi=1
@2
@�@�0log g(yi; �)
=1
n
nXi=1
�@
@�log g(yi; �)
@
@�log g(yi; �)
0�
The does not require that g is correctly speci�ed.
15.7 Comments on AIC and TIC
The AIC and TIC are designed for the likelihood (or quasi-likelihood) context. For proper
application, the �model�needs to be a conditional density, not just a conditional mean or set of
moment conditions. This is a strength and limitation.
The bene�t of AIC/TIC is that it selects �tted models whose densities are close to the true
density. This is a broad and useful feature.
The relation of the TIC to the AIC is very similar to the relationship between the conventional
and �White�covariance matrix estimators for the MLE/QMLE or LS. The TIC does not appear
to be widely appreciated nor used.
117
The AIC is known to be asymptotically optimal in linear regression (we discuss this below),
but in the general context I do not know of an optimality result. The desired optimality would
be that if a model is selected by minimizing AIC (or TIC) then the �tted KLIC of this model is
asymptotically equivalent to the KLIC of the infeasible best-�tting model.
15.8 AIC and TIC in Linear Regression
In linear regression or projection
yi = X 0i� + ei
E (Xiei) = 0
AIC or TIC cannot be directly applied, as the density of ei is unspeci�ed. However, the LS estimator
is the same as the Gaussian MLE, so it is natural to calculate the AIC or TIC for the Gaussian
quasi-MLE.
The Gaussian quasi-likelihood is
log gi(�) = �1
2log�2��2
�� 1
2�2�yi �X 0
i��2
where � = (�; �2) and �2 = Ee2i : The MLE � = (�; �2) is LS. The pseudo-true value �0 is the
projection coe¢ cient � = E (XiX0i)�1E (Xiyi) : If � is k � 1 then the number of parameters is
k + 1:
The sample log-likelihood is
2L(�) = n log��2�+ n log (2�) + n
The second/third parts can be ignored. The AIC is
AIC = n log��2�+ 2 (k + 1) :
Often this is written
AIC = n log��2�+ 2k
as adding/subtracting constants do not matter for model selection, or sometimes
AIC = log��2�+ 2
k
n
as scaling doesn�t matter.
118
Also
@
@�log g(yi; �) =
1
�2Xi�yi �X 0
i��
@
@�2log g(yi; �) = � 1
2�2+
1
2�4�yi �X 0
i��2;
and
� @2
@�@�0log g(yi; �) =
1
�2XiX
0i
� @2
@�@�2log g(yi; �) =
1
�4Xi�yi �X 0
i��
� @2
@ (�2)2log g(yi; �) = � 1
2�4+1
�6�yi �X 0
i��2
Evaluated at the pseudo-true values,
@
@�log g(yi; �0) =
1
�2Xiei
@
@�2log g(yi; �0) =
1
2�4�e2i � �2
�;
and
� @2
@�@�0log g(yi; �0) =
1
�2XiX
0i
� @2
@�@�2log g(yi; �0) =
1
�4Xie
� @2
@ (�2)2log g(yi; �0) =
1
2�6
�2�yi �X 0
i��2 � �2�
Thus
Q = �E
2664@2
@�@�0log g(yi; �0)
@2
@�@�2log g(yi; �0)
@2
@�2@�0log g(yi; �0)
@2
@ (�2)2log g(yi; �0)
3775= ��2
24 E (XiX 0i) 0
01
2�2
35
119
and
= E
2664@
@�log g(yi; �0)
@
@�log g(yi; �0)
0 @
@�log g(yi; �0)
@
@�2log g(yi; �0)
@
@�2log g(yi; �0)
@
@�log g(yi; �0)
0�@
@�2log g(yi; �0)
�23775
= ��2
264 E�XiX
0i
e2i�2
�1
2�4E�X 0ie3i
�1
2�4E�Xie
3i
� �44�2
375where
�4 = var
�e2i�2
�=E�e2i � �2
�2�4
=E�e4i�� �4
�4
We see that = Q if
E
�e2i�2j Xi
�= 1
E�Xie
3i
�= 0
�4 = 2
Essentially, this requies that ei � N(0; �2): Otherwise 6= Q:Thus the AIC is appropriate in Gaussian regression. It is an �approximation�in non-Gaussian
regression, heteroskedastic regression, or projection.
To calculate the TIC, note that since Q is block diagonal you do not need to estimate the
o¤-diagonal component of : Note that
trQ�1 = tr
�E�XiX
0i
��1E
�XiX
0i
e2i�2
��+
�1
2�2
��1 �44�2
= tr
�E�XiX
0i
��1E
�XiX
0i
e2i�2
��+�42
Let
�4 =1
n
nXi=1
�e2i � �2
�2The TIC is then
TIC = n log��2�+ tr
�Q�1
�= n log
��2�+ 2
24tr0@ nX
i=1
XiX0i
!�1 nXi=1
XiX0i
e2i�2
!1A+ �435
= n log��2�+2
�2
nXi=1
hie2i + �4
120
where hi = X 0i (X
0X)�1Xi.
When the errors are close to homoskedastic and Gaussian, then hi and e2i will be uncorrelated
�4 will be close to 2, so the penalty will be close to
2nXi=1
hi + 2 = 2 (k + 1)
as for AIC. In this case TIC will be close to AIC. In applications, the di¤erences will arise under
heteroskedasticity and non-Gaussianity.
The primary use of AIC and TIC is to compare models. As we change models, typically the
residuals ei do not change too much, so my guess is that the estimate � will not change much. In
this event, the TIC correction for estimation of �2 will not matter much.
15.9 Asymptotic Equivalence
Let ~�2 be a preliminary (model-free) estimate of �2: The AIC is equivalent to
~�2�AIC � n log ~�2 + n
�= n�2
�log
��2
~�2
�+ 1
�+ 2~�2k
' n~�2��2
~�2
�+ 2~�2k
= e0e+ 2~�2k
= Ck
The approximation is log(1 + a) ' a for a small. This is the Mallows criterion. Thus AIC is
approximately equal to Mallows, and the approximation is close when k=n is small.
Furtheremore, this expression approximately equalts
e0e
�1 +
2
nk
�= Sk
which is known as Shibata�s condition (Annals of Statistics, 1980; Biometrick, 1981).
The TIC (ignoring the correction for estimation of �2) is equivalent to
~�2�TIC � n log ~�2 + n
�= n~�2
�log
��2
~�2
�+ 1
�+2~�2
�2
nXi=1
hie2i
' e0e+ 2nXi=1
hie2i
'nXi=1
e2i(1� hi)2
= CV;
121
the cross-validation criterion. Thus TIC ' CV:They are both asymptotically equivalent to a �Heteroskedastic-Robust Mallows Criterion�
C�k = e0e+ 2
nXi=1
hie2i
which, strangely enough, I have not seen in the literature.
15.10 Mallows Criterion
Ker-Chau Li (1987, Annals of Statistics) provided a important treatment of the optimality of
model selection methods for homoskedastic linear regression. Andrews (1991, JoE) extended his
results to allow conditional heteroskedasticity.
Take the regression model
yi = g (Xi) + ei
= gi + ei
E (ei j Xi) = 0
E�e2i j Xi
�= 0
Written as an n� 1 vectory = g + e:
Li assumed that the Xi are non-random, but his analysis can be re-interpreted by treating every-
thing as conditional on Xi:
Li considered estimators of the n� 1 vector g which are linear in y and thus take the form
g(h) =M(h)y
where M(h) is n � n; a function of the X matrix, indexed by h 2 H; and H is a discrete set.
For example, a series estimator sets M(h) = Xh (X 0hXh)
�1Xh where Xh is an n � kh set of basisfunctions of the regressors, and H = f1; :::; �hg: The goal is to pick h to minimize the average squarederror
L(h) =1
n(g � g(h))0 (g � g(h)) :
The index h is selected by minimizing the Mallows, Generalized CV, or CV criterion. We discuss
Mallows in detail, as it is the easiest to analyze. Andrews showed that only CV is optimal under
heteroskedasticity.
The Mallows criterion is
C(h) =1
n(y � g(h))0 (y � g(h)) + 2�
2
ntrM(h)
122
The �rst term is the residual variance from model h; the second is the penalty. For series estimators,
trM(h) = kh: The Mallows selected index h minimizes C(h):
Since y = e+ g; then y � g(h) = e+ g � g(h); so
C(h) =1
n(y � g(h))0 (y � g(h)) + 2�
2
ntrM(h)
=1
ne0e+ L(h) + 2
1
ne0 (g � g(h)) + 2�
2
ntrM(h)
And
g(h) =M(h)y =M(h)g +M(h)e
then
g � g(h) = (I �M(h)) g �M(h)e
= b(h)�M(h)e
where b(h) = (I �M(h)) g; and C(h) equals
1
ne0e+ L(h) + 2
1
ne0b(h) +
2
n
��2 trM(h)� e0M(h)e
�As the �rst term doesn�t involve h; it follows that h minimizes
C�(h) = L(h) + 21
ne0b(h) +
2
n
��2 trM(h)� e0M(h)e
�over h 2 H:
The idea is that empirical criterion C�(h) equals the desired criterion L(h) plus a stochastically
small error.
We calculate that
L(h) =1
n(g � g(h))0 (g � g(h))
=1
nb(h)0b(h)� 2
nb(h)0M(h)e+
1
ne0M(h)0M(h)e
and
R(h) = E (L(h) j X)
=1
nb(h)0b(h) + E
�1
ne0M(h)0M(h)e j X
�=
1
nb(h)0b(h) +
�2
ntrM(h)0M(h)
The optimality result is:
123
Theorem 1. Let �max(A) denote the maximum eigenvalue of A: If for some positive integer m;
limn!1
suph2H
�max (M(h)) < 1
E�e4mi j Xi
�� � <1X
h2H(nR(h))�m ! 0 (1)
then
L(h)
infh2H L(h)!p 1:
15.11 Whittle�s Inequalities
To prove Theorem 1, Li (1987) used two key inequalities from Whittle (1960, Theory of Prob-
ability and Its Applications).
Theorem. Suppose the observations are independent: Let b be any n � 1 vector and A any
n� n matrix, functions of X. If for some s � 2
maxiE (jeijs j Xi) � �s <1
then
E���b0e��s j X� � K1s �b0b�s=2 (2)
and
E���e0Ae� E �e0Ae j X���s j X� � K2s �trA0A�s=2 (3)
where
K1s =2s3=2p��
�s+ 1
2
��s
K2s = 2sK1sK
1=21;2s�2s
15.12 Proof of Theorem 1
The main idea is similar to that of consistent estimation. Recall that if Sn(�)!p S(�) uniformly
in �; then the minimizer of Sn(�) converges to the minimizer of S(�): We can write the uniform
convergence as
sup�
����Sn(�)S(�)� 1����!p 0
In the present case, we will show (below) that
suph
����C�(h)� L(h)L(h)
����!p 0 (4)
124
Let h0 denote the minimizer of L(h): Then
0 � L(h)� L(h0)L(h)
=C�(h)� L(h0)
L(h)� C
�(h)� L(h)L(h)
=C�(h)� L(h0)
L(h)+ op(1)
� C�(h0)� L(h0)L(h)
+ op(1)
� C�(h0)� L(h0)L(h0)
+ op(1)
= op(1)
This uses (4) twice, and the facts L(h0) � L(h) and C�(h) � C�(h0): This shows that
L(h0)
L(h)!p 1
which is equivalent to the Theorem.
The key is thus (4). We show below that
suph
����L(h)R(h)� 1����!p 0 (5)
which says that L(h) and R(h) are asymptotically equivalent, and thus (4) is equivalent to
suph
����C�(h)� L(h)R(h)
����!p 0: (6)
From our earlier equation for C�(h); we have
suph
����C�(h)� L(h)R(h)
���� � 2 suph
je0b(h)jnR(h)
+ 2 suph
���2 trM(h)� e0M(h)e��nR(h)
: (7)
Take the �rst term on the right-hand-side. By Whittle�s �rst inequality,
E���e0b(h)��2m j X� � K �b(h)0b(h)�m
Now recall
nR(h) = b(h)0b(h) + �2 trM(h)0M(h) (8)
Thus
nR(h) � b(h)0b(h)
125
Hence
E���e0b(h)��2m j X� � K �b(h)0b(h)�m � K (nR(h))m
Then, since H is discrete, by applying Markov�s inequality and this bound,
P
�suph
je0b (h)jnR(h)
> � j X�
�Xh2H
P
�je0b(h)jnR(h)
> � j X�
�Xh2H
��2mE�je0b(h)j2m j X
�(nR(h))2m
�Xh2H
��2mK (nR(h))m
(nR(h))2m
=K
�2m
Xh2H
(nR(h))�m
! 0
by assumption (1). This shows
suph
je0b(h)jnR(h)
!p 0
Now take the second term in (7). By Whittle�s second inequality, since
E�e0M(h)e j X
�= �2 trM(h);
then
E���e0M(h)e� �2 trM(h)��2m j X� � K
�tr�M(h)0M(h)
��m� ��2mK (nR(h))m
the second inequality since (8) implies
trM(h)0M(h) � ��2nR(h)
126
Applying Markov�s inequality
P
suph
��e0M(h)e� �2 trM(h)��nR(h)
> � j X!
�Xh2H
P
��e0M(h)e� �2 trM(h)��nR(h)
> � j X!
�Xh2H
��2mE���e0M(h)e� �2 trM(h)��2m j X�
(nR(h))2m
�Xh2H
��2m��2mK (nR(h))m
(nR(h))2m
= K��2�2
��mXh2H
(nR(h))�m
! 0
For completeness, let us show (5). The demonstration is essentially the same as the above.
We calculate
L(h)�R(h) = � 2nb(h)0M(h)e+
1
ne0M(h)0M(h)e� �
2
ntrM(h)0M(h)
= � 2nb(h)0M(h)e+
1
n
�e0M(h)0M(h)e� E
�e0M(h)0M(h)e j X
��Thus
suph
����L(h)�R(h)R(h)
���� � 2 suph
je0M(h)0b(h)jnR(h)
+ 2 suph
je0M(h)0M(h)e� E (e0M(h)0M(h)e j X)jnR(h)
:
By Whittle�s �rst inequality,
E���e0M(h)0b(h)��2m j X� � K �b(h)0M(h)M(h)0b(h)�m
Use the matrix inequality
tr (AB) � �max(A) tr (B)
and letting�M = lim
n!1suph2H
�max (M(h)) <1
then
b(h)0M(h)M(h)0b(h) = tr�M(h)M(h)0b(h)b(h)0
�� �M2 tr
�b(h)b(h)0
�� �M2b(h)0b(h)
� �M2nR(h)
127
Thus
E���e0M(h)0b(h)��2m j X� � K
�b(h)0M(h)M(h)0b(h)
�m� K �M2 (nR(h))m
Thus
P
�suph
je0M(h)0b (h)jnR(h)
> � j X�
�Xh2H
P
�je0M(h)0b(h)j
nR(h)> � j X
�
�Xh2H
��2mE�je0M(h)0b(h)j2m j X
�(nR(h))2m
�Xh2H
��2mK �M2 (nR(h))m
(nR(h))2m
=K �M2
�2m
Xh2H
(nR(h))�m
! 0
Similarly,
E���e0M(h)0M(h)e� E �e0M(h)0M(h)e j X���2m j X� � K
�tr�M(h)0M(h)M(h)0M(h)
��m� K �M2m
�tr�M(h)0M(h)
��m� ��2mK �M2m (nR(h))m
and thus
P
�suph
je0M(h)0M(h)e� E (e0M(h)0M(h)e j X)jnR(h)
> � j X�
�Xh2H
P
�je0M(h)0M(h)e� E (e0M(h)0M(h)e j X)j
nR(h)> � j X
�
�Xh2H
��2mE�je0M(h)0M(h)e� E (e0M(h)0M(h)e j X)j2m j X
�(nR(h))2m
�Xh2H
��2m��2mK �M2m (nR(h))m
(nR(h))2m
= K
� �M2
�2�2
�mXh2H
(nR(h))�m
! 0
We have shown
suph
����L(h)�R(h)R(h)
����!p 0
which is (5).
128
15.13 Mallows Model Selection
Li�s Theorem 1 applies to a variety of linear estimators. Of particular interest is model selection
(e.g. series estimation).
Let�s verify Li�s conditions, which were
limn!1
suph2H
�max (M(h)) < 1
E�e4mi j Xi
�� � <1X
h2H(nR(h))�m ! 0 (9)
In linear estimation, M(h) is a projection matrix, so �max (M(h)) = 1 and the �rst equation is
automatically satis�ed.
The key is equation (9).
Suppose that for sample size n; there are Nn models. Let
�n = infh2H
nR(h)
and assume
�n !1
A crude bound is Xh2H
(nR(h))�m � Nn��mn
If Nn��mn ! 0 then (9) holds. Notice that by increasing m; we can allow for larger Nn (more
models) but a tighter moment bound.
The condition �n ! 0 says that for all �nite models h; the there is non-zero approximation error,
so that R(h) is non-zero. In contrast, if there is a �nite dimensional model h0 for which b(h0) = 0;
then nR(h0) = h0�2 does not diverge. In this case, Mallows (and AIC) are asymptotically sub-
optimal.
We can improve this condition if we consider the case of selection among models of increasing
size. Suppose that model h has kh regressors, and k1 < k2 < � � � and for some m � 2;
1Xh=1
k�mh <1
This includes nested model selection, where kh = h and m = 2: Note that
nR(h) = b(h)0b(h) + kh�2 � kh�2
129
Now pick Bn !1 so that Bn��mn ! 0 (which is possible since �n !1:) Then
1Xh=1
(nR(h))�m =
BnXh=1
(nR(h))�m +1X
h=Bn+1
(nR(h))�m
� Bn��mn + ��2
1Xh=Bn+1
k�mh ! 0
as required.
15.14 GMM Model Selection
This is an underdeveloped area. I list a few papers.
Andrews (1999, Econometrica). He considers selecting moment conditions to be used for GMM
estimation. Let p be the number of parameters, c represent a list of �selected�moment conditions,
jcj denote the cardinality (number) of these moments, and Jn(c) the GMM criterion computed
using these c moments. Andrews�proposes criteria of the form
IC(c) = Jn(c)� rn (jcj � p)
where jcj � p is the number of overidentifying restrictions and rn is a sequence. For an AIC-likecriterion, he sets rn = 2; for a BIC-like criterion, he sets rn = log n:
The model selection rule picks the moment conditions c which minimize Jn(c):
Assuming that a subset of the moments are incorrect, Andrews shows that the BIC-like rule
asymptotically selects the correct subset.
Andrews and Lu (2001, JoE) extend the above analysis to the case of jointly picking the moments
and the parameter vector (that is, imposing zero restrictions on the parameters). They show that
the same criterion has similar properties �that it can asymptotically select the �correct�moments
and �correct�zero restrictions.
Hong, Preston and Shum (ET, 2003) extend the analysis of the above papers to empirical
likelihood. They show that that this criterion has the same interpretation when Jn(c) is replaced
by the empirical likelihood.
These papers are an interesting �rst step, but they do not address the issue of GMM selection
when the true model is potentially in�nite dimensional and/or misspeci�ed. That is, the analysis
is not analogous to that of Li (1987) for the regression model.
In order to properly understand GMM selection, I believe we need to understand the behavior
of GMM under misspeci�cation.
Hall and Inoue (2003, Joe) is one of the few contributions on GMM under misspeci�caiton.
They did not investigate model selection.
Suppose that the model is
m(�) = Emi(�) = 0
130
where mi is ` � 1 and � is k � 1: Assume ` > r (overidenti�cation). The model is misspeci�ed ifthere is no � such that this moment condition holds. That is, for all �;
m(�) 6= 0
Suppose we apply GMM. What happens?
The �rst question is, what is the pseudo-true value? The GMM criterion is
Jn (�) = n �mn(�)0Wn �mn(�)
If Wn !p W; then
n�1Jn(�)!p m(�)0Wm(�):
Thus the GMM estimator � is consistent for the pseudo-true value
�0(W ) = argminm(�)0Wm(�):
Interstingly, the pseudo-true value �0(W ) is a function of W: This is a fundamental di¤erence
from the correctly speci�ed case, where the weight matrix only a¤ects e¢ ciency. In the misspeci�ed
case, it a¤ects what is being estimated.
This means that when we apply �iterated GMM�, the pseudo-true value changes with each step
of the iteration!
Hall and Inoue also derive the distribution of the GMM estimator. They �nd that the distri-
bution depends not only on the randomness in the moment conditions, but on the randomness in
the weight matrix. Speci�cally, they assume that n1=2 (Wn �W ) !d Normal; and �nd that this
a¤ects the asymptotic distributions.
Furthermore, the distribution of test statistics is non-standard (a mixture of chi-squares). So
inference on the pseudo-true values is troubling.
This subject deserves more study.
15.15 KLIC for Moment Condition Models Under Misspeci�cation
Suppose that the true density is f(y); and we have an over-identi�ed moment condition model,
e.g. for some function m(y); the model is
Em(y) = 0
However, we want to allow for misspeci�cation, namely that
Em(y) 6= 0
To explore misspe�cation, we have to ask: What is a desirable pseudo-true model?
131
Temporarily ignoring parameter estimation, we can ask: Which density g(y) satisfying this
moment condition is closest to f(y) in the sense of minimizing KLIC? We can call this g0(y) the
pseudo-true density.
The solution is nicely explained in Appendix A of Chen, Hong, and Shum (JoE, 2007). Recall
KLIC(f; g) =
Zf(y) log
�f(y)
g(y)
�dy
The problem is
mingKLIC(f; g)
subject to Zg(y)dy = 1Z
m(y)g(y)dy = 0
The Lagrangian isZf(y) log
�f(y)
g(y)
�dy + �
�Zg(y)dy � 1
�+ �0
Zm(y)g(y)dy
The FOC with respect to g(y) at some y is
0 = �f(y)g(y)
+ �+ �0m(y)
Multiplying by g(y) and integrating,
0 = �Zf(y)dy + �
Zg(y)dy + �0
Zm(y)g(y)dy
= �1 + �
so � = 1: Solving for g(y) we �nd
g(y) =f(y)
1 + �0m(y);
a tilted version of the true density f(y): Inserting this solution we �nd
KLIC(f; g) =
Zf(y) log
�1 + �0m(y)
�dy
By duality, the optimal Lagrange multiplier �0 maximizes this expression
�0 = argmax�
Zf(y) log
�1 + �0m(y)
�dy:
132
The pseudo-true density is
g0(y) =f(y)
1 + �00m(y);
with associated minimized KLIC
KLIC (f; g0) =
Zf(y) log
�1 + �00m(y)
�dy
= E log�1 + �00m(y)
�This is the smallest possible KLIC(f,g) for moment condition models.
This solution looks like empirical likelihood. Indeed, EL minimizes the empirical KLIC, and
this connection is widely used to motivate EL.
When the moment m(y; �) depends on a parameter �;then the pseudo-true values (�0; �0) are
the joint solution to the problem
min�max�E log
�1 + �0m(y; �)
�Theorem (Chen, Hong and Shum, JoE, 2007). If jm(y; �)j is bounded, then the EL estimates
(�; �) are n�1=2 consistent for the pseudo-true values (�0; �0).
This gives a simple interpretation to the de�nition of KLIC under misspeci�cation.
15.16 Schennach�s Impossibility Result
Schennach (Annals of Statistics, 2007) claims a fundamental �aw in the application of KLIC
to moment condition models. She shows that the assumption of bounded jm(y; �)j is not merely atechnical condition, it is binding.
[Notice: In the linear model, m(y; �) = z(y � x0�) is unbounded if the data has unboundedsupport. Thus the assumption is highly relevant.]
The key problem is that for any � 6= 0; if m(y; �) is unbounded, so is 1+�0m(y; �): In particular,it can take on negative values. Thus log
�1 + �0m(y; �)
�is ill-de�ned. Thus there is no pseudo-true
value of �: (It must be non-zero, but it cannot be non-zero!) Without a non-zero �; there is no way
to de�ne a pseudo-true �0 which satis�es the moment condition.
Technically, Schennach shows that when there is no � such that Em(y; �) = 0 and m(y; �) is
unbounded, then there is no �0 such thatpn�� � �0
�= Op(1):
Her paper leaves open the question: For what is � consistent? Is there a pseudo-true value?
One possibility is that the pseudo-true value �n needs to be indexed by sample size. (This idea is
used in Hal White�s work.)
Never-the-less, Schennach�s theorem suggests that empirical likelihood is non-robust to mis-
speci�cation.
133
15.17 Exponential Tilting
Instead of
KLIC(f; g) =
Zf(y) log
�f(y)
g(y)
�dy
consider the reverse distance
KLIC(g; f) =
Zg(y) log
�g(y)
f(y)
�dy:
The pseudo-true g which minimizes this criterion is
ming
Zg(y) log
�g(y)
f(y)
�dy
subject to Zg(y)dy = 1Z
m(y)g(y)dy = 0
The Lagrangian isZg(y) log
�g(y)
f(y)
�dy � �
�Zg(y)dy � 1
�� �0
Zm(y)g(y)dy
with FOC
0 = log
�g(y)
f(y)
�+ 1� �� �0m(y):
Solving
g(y) = f(y) exp (�1 + �) exp��0m(y)
�:
ImposingRg(y)dy = 1 we �nd
g(y) =f(y) exp
��0m(y)
�Rf(y) exp
��0m(y)
�dy: (10)
Hence the name �exponential tilting�or ET
134
Inserting this into KLIC(g; f) we �nd
KLIC(g; f) =
Zg(y) log
exp
��0m(y)
�Rf(y) exp
��0m(y)
�dy
!dy
= �0Zm(y)g(y)dy �
Zg(y)dy log
�Zf(y) exp
��0m(y)
�dy
�= � log
�Zf(y) exp
��0m(y)
�dy
�(11)
= � logE exp��0m(y)
�(12)
By duality, the optimal Lagrange multiplier �0 maximizes this expression, equivalently
�0 = argmin�
E exp��0m(y)
�(13)
The pseudo-true density g0(y) is (10) with this �0; with associated minimized KLIC (11). This is
the smallest possible KLIC(g,f) for moment condition models.
Notice: the g0 which minimize KLIC(g,f) and KLIC(f,g) are di¤erent.
In contrast to the EL case, the ET problem (13) does not restrict �; and there are no �trouble
spots�. Thus ET is more robust than EL. The pseudo-true �0 and g0 are well de�ned under
misspeci�cation, unlike EL.
When the moment m(y; �) depends on a parameter �;then the pseudo-true values (�0; �0) are
the joint solution to the problem
max�min�E exp
��0m(y; �)
�:
15.18 Exponential Tilting �Estimation
The ET or exponential tilting estimator solves the problem
min�;p1;:::;pn
nXi=1
pi log pi
subject to
nXi=1
pi = 1
nXi=1
pim (yi; �) = 0
135
First, we concentrate out the probabilities. For any �; the Lagrangian is
nXi=1
pi log pi � �
nXi=1
pi � 1!� �0
nXi=1
pim (yi; �)
with FOC
0 = log pi � 1� �� �0m(yi; �):
Solving for pi and imposing the summability,
pi(�) =exp
��0m(yi; �)
�Pni=1 exp
��0m(yi; �)
�When � = 0 then pi = n�1; same as EL. The concentrated �entropy�criterion is then
nXi=1
pi(�) log pi(�) =nXi=1
pi(�)
"�0m(yi; �)� log
nXi=1
exp��0m(yi; �)
�!#
= � log
nXi=1
exp��0m(yi; �)
�!
By duality, the Lagrange multiplier maximizes this criterion, or equivalently
�(�) = argmin�
nXi=1
exp��0m(yi; �)
�The ET estimator � maximizes this concentrated function, e.g.
� = argmax�
nXi=1
exp��(�)0m(yi; �)
�
The ET probabilities are pi = pi(�)
15.19 Schennach�s Estimator
Schennach (2007) observed that while the ET probabilities have desirable properties, the EL
estimator for � has better bias properties. She suggested a hybrid estimator which achieves the
best of both worlds, called exponentially tilted empirical likelihood (ETEL).
This is
� = argmax�
ELET (�)
136
ETEL(�) =
nXi=1
log (pi(�))
= �(�)0nXi=1
m(yi; �)� log
nXi=1
exp��(�)0m(yi; �)
�!
pi(�) =exp
��(�)0m(yi; �)
�Pni=1 exp
��(�)0m(yi; �)
��(�) = argmin
�
nXi=1
exp��0m(yi; �)
�She claims the following advantages for the ETEL estimator �
� Under correct speci�cation, � is asymptotically second-order equivalent to EL
� Under misspeci�cation, the pseudo-true values �0; �0 are generically well de�ned, and mini-mize a KLIC analog
�pn�� � �0
�!d N
�0;��1��10
�where � = E
@
@�0m (y; �) and = E m (y; �)m (y; �)0 :
137