Statistical models for tourism ows...
Transcript of Statistical models for tourism ows...
hard copy ISSN 1974-3041
on-line ISSN 1974-305X
La Matematicae le sue Applicazionin. 17, 2008
Statistical models for tourism flows estimation
R. Fontana, G. Pistone
Quaderni del
Dipartimento di Matematica
Politecnico di TorinoCorso Duca degli Abruzzi, 24 – 10129 Torino – Italia
Edizioni C.L.U.T. - Torino
Corso Duca degli Abruzzi, 24
10129 Torino
Tel. 011 564 79 80 - Fax. 011 54 21 92
La Matematica e le sue Applicazioni
hard copy ISSN 1974-3041
on-line ISSN 1974-305X
Direttore: Claudio Canuto
Comitato editoriale: N. Bellomo, C. Canuto, G. Casnati, M. Gasparini, R. Monaco,
G. Monegato, L. Pandolfi, G. Pistone, S. Salamon, E. Serra, A. Tabacco
Esemplare fuori commercio
accettato nel mese di Dicembre 2008
1
STATISTICAL MODELS FOR TOURISM
FLOWS ESTIMATION
Roberto Fontana and Giovanni Pistone1
Abstract Tourism is a complex and highly competitive sector. In this scenario incoming tourism
flows represent one of the key indicators for public institutions, that wish to adopt an informed
decision making process for resource allocation. The accurate and timely knowledge of both the
interregional and the foreign component at a sufficiently detailed geographical level of such flows
allows a better strategical, tactical and operational planning of marketing activities.
The paper describes a methodology to complete the database of the official statistical data on
tourism flows with an estimate of missing data, that are originated by non respondent
accommodation structure. The methodology has been applied to the inbound tourism flows into
Piedmont.
Keywords Generalized Linear Models, Tourism statistics, Bednights
TOURISM FLOWS IN PIEDMONT
INTRODUCTION
The “statistical models for tourism flows prediction” project, funded under the
Interreg III A ALCOTRA program, considered two different case studies, the first
proposed by the Assessorato al Turismo della Regione Piemonte and the second
by the Comité Régional du Tourisme Riviera Côte d’Azur.
The paper describes the objective of the research project, the available data, the
methodology that has been developed and some of the results that have been
achieved concerning inbound tourism flows into Piedmont. The first part, mainly
related to the process of data collection, reproduces part of the presentation given
1 Dipartimento di Matematica
Politecnico di Torino
Corso Duca degli Abruzzi,24
10129 Torino
2
at the S.Co.07 conference, Venice, 06 September 2007 (Fontana et al 2007). It
has been included here to facilitate the reader with a self-contained presentation.
CONTEXT
The aim of the work is to use the Piedmont's regional database of tourist arrivals
and bednights to measure the performance of commercial accommodations in
Piedmont. The database constitutes the official statistical data source concerning
the use of commercial accommodation (hotel, camping, bed&breakfast, ...) in
Piedmont and is part of the national official data bank maintained by the National
Institute of Statistics (Istat).
THE AVAILABLE DATA
The data that have been analysed refer to the years between 2000 and 2008. For
every accommodation structure the following information are available:
• name, type of accommodation, number of beds/rooms, address,
opening/closing period in the year, etc;
• tourism flows (bednights and arrivals) classified according to the country
of origin of tourist on a monthly basis.
The databases concerning the years between 2000 and 2007 are final, in the sense
that have been officially consolidated by the Regional Authority. The 2008
database is provisional and contains tourism flows up to April. The number of
observations is almost 350,000 and the number of variables that have been
considered is around 30.
3
Figure 1. A view of a part of the database
The list of the major concerns of these data includes:
• the presence of missing values for tourism flows due to non-response (a
non-negligible percentage of accommodation structures doesn't provide
their data);
• the delay with which data become available (mainly due to the quite
complex data collection process).
Indeed the collection process, that is carried on under the supervision of the
Assessorato al Turismo della Regione Piemonte, involves all the provincial
statistics offices (there are 8 provincial statistics offices in Piedmont) and all the
accommodation structures (there were 4.719 structures at the end of 2007). This
census, even if is based on a monthly basis, takes 15 months to be fully
completed: it starts in January and it finishes at the end of March of the next year,
when the provincial statistics offices certify and make final the data that have
collected. It follows that, for example, the data concerning tourism flows of April
2006 will be known in official way not before March 2007.
Indeed, the analysis that have been done using the variable in the database that
registers the date of insertion of a single record into the database itself provides a
value of around 6 months as an estimate of the time needed to make final a certain
reference month.
This delay is often considered too large by the operators working in this sector.
4
The other issue concerns missing data due to the fact that some accommodation
structures do not transmit their own data. Their presence implies that the global
values underestimates the size of this component of the tourism industry. Besides
that, the difference between the response rates of different years, even if it is not
very high, makes difficult to interpret data, especially for small geographical
areas.
The main objective of the work is to use the data, that become available on a day
by day basis, to monitor the performance of this part of the tourism sector in order
to improve mainly the short term decision process.
The statistical modeling
The availability of the individual data has allowed to develop the following basic
idea. Let’s denote with �� the set of all the accommodation structures that are
open in a certain month (even for a single day). For very accommodation
structure in �:
• the bednights (B) can be considered as the number of “products” that have
been sold;
• the ratio between the bednights (B) and the total capacity of an
accommodation structure (C), defined as the product between the number
of days in which the structure is open and the number of beds, can be
viewed as the rate of success of the structure (R, the net occupancy rate):
C
BR =
For a certain structure R can be computed as soon as the structure transmits its
own data (in particular the bednights B that have registered in that month).
R has been modelled (Mc Cullagh et al 1989) in the framework of linear models
(LMs) and generalized linear models (GLMs) , as a function of time (T), of the
geographical area (G) and of the type of the accommodation structure (A)
),,(ˆ AGTfR =
5
In particolar binomial GLMs have been used to predict missing data, that is the
net occupancy rate of the accommodation structures that have not yet provided
their data . The number of bednights for accommodation structures that have
missing flows has been estimated using
CRB ⋅= ˆˆ
Finally the total number of bednights for a certain area and a certain month (BT)
is estimated summing up the values of bednights corresponding to the
accommodation structures that have provided their data )( �� ⊆ with the
estimates of bednights corresponding to the accommodation structures that do not
have provided their data )( , ������ =⊆ ∪ :
�� +=��
BBBTˆ
In more detail the following explanatory variables have been used:
• YEAR and MONTH to represent the time (quantitative variables);
• the geographical AREA, a qualitative variable whose values are the Local
Tourist Agencies (ATL) of Piedmont (according to the address every
accommodation structure has been related to the ATL to which it belongs).
The eleven ATLs of Piedmont allow to make a first division of the
regional territory into homogenous areas but more accurate segmentations
can be considered;
• TYPE and BEDS: the accommodation structures have been classified
according to the TYPE (“1 or 2 star hotel”, “3-star hotel”, “4 or 5 star
hotels”, “bed & breakfast”, etc) and the number of beds (BEDS).
The following link functions have been tested:
• logit
• probit
• complementary log-log
The presence of main effects and interactions between the explanatory variables
has been evaluated using the Analysis of Deviance. The best results in terms of
residuals have been obtained using the logit function. The most significant
6
predictors are the geographical area and the type of accommodation structure. The
interaction between MONTH and ATL was found to be slightly significant.
The variance of BT
As explained in the previous section, we got TB adding to the known bednights
values ii
B� ∈� an estimate of the bednights values corresponding to the
accommodation structures that have not sent their data � ∈�i iB̂ . It follows that the
variance of TB is the variance of� ∈�i iB̂ .
Let the value of the logit link function corresponding to C
BR =
��
���
�
−=
R
R
1logη
It follows that
)exp(1
)exp(
η
η
+=R
Using a standard notation for GLM’s we write εβη +⋅= X , where
• X is the design matrix whose rows correspond to the accommodation
structures that provided their data )( �� ⊆ and whose columns
correspond to the explanatory variables T,G,A and their interactions
• represents the error term, ),0(~ 2σε IIND
The estimate of the bednights for an accommodation structure that has not
communicated their data ( B̂ ) has been obtained as
)ˆexp(1
)ˆexp(ˆˆη
η
+⋅=⋅= CRCB
where C is the capacity of the structure.
7
Let’s now introduce subscripts to distinguish between different accommodation
structures. In particular we denote with βη ˆˆ ⋅= ii x the estimate of for the
structure i of � .
We have
� �� �� ∈ ∈∈ ∈∈===
� �� �� i jj ijii jj ii iT RRCCBBBB ]ˆ,ˆcov[]ˆ,ˆcov[]ˆvar[]var[
We use the 1st
order Taylor polynomials to approximate the iR̂ in the
neighbourhood of ]ˆ[ iR� . The 1st order approximation of R̂ (we omit the suffix i
to keep the notation simple) in a neighbourhood of 0η̂ is
)ˆˆ())ˆ(ˆ1()ˆ(ˆ)ˆ(ˆ)ˆˆ(ˆ
ˆ)ˆ(ˆ)ˆ(ˆ
00000
ˆˆ
0
0
ηηηηηηηη
ηηηη
−⋅−⋅+=−∂
∂+≈
=
RRRR
RR
If we choose the expected value of η̂ as 0η̂ we get
])ˆ[(ˆ)]ˆ(ˆ[ ηη �� RR ≈
and therefore, restoring the subscripts,
])ˆ[ˆ())ˆ(ˆ1()ˆ(ˆ]ˆ[)ˆ(ˆiiiiiiii RRRR ηηηηη �� −⋅−⋅≈−
Using these approximations we get
T
jii j jjiijii iT xxRRRRCCBB Γ−−≈= � �� ∈ ∈∈ � ��)ˆ1(ˆ)ˆ1(ˆ]ˆvar[]var[
where is the variance-covariance matrix of β̂ and T
jx is the transposed vector
of jx . Confidence intervals are derived using such approximations.
Software
During the project statistical analysis has been carried on using R (http://www.R-
project.org) for early prototypes and SAS (http://www.sas.com) for the production
versions.
In particular the core of the module is the Proc GENMOD (SAS, 2004) that is part
of SAS/STAT® 9.1.3, while the computation of the variance has been
8
implemented in the Proc IML environment. It should be noted that the problem of
the computation of the variance reaches a quite high dimension if we consider all
the available months (between January 2000 and April 2008 there are 100
months) all over the region. If we still use � to denote all the missing values
(non-respondent structures for all the considered months) we get that � has more
than 15.000 observations. To limit both the computational time and the use of
memory it has been necessary to split the computation of the variance into steps in
the following way. The set � � of all the non-respondent structures has been
partitioned into the union of N subsets (N=50 in the current application):
N��� ++= �1
and the following matrices Xi i=1,…,N have been defined
[ ]�
�∈−= jxRRCX jjjji ,)ˆ1(ˆ
It follows that
��� �= =
∈ ∈Γ=Γ−−≈
N
r
N
c
T
cr
T
jii j jjiijiT XXxxRRRRCCB1 1
)sum()ˆ1(ˆ)ˆ1(ˆ]var[� �
where sum(A) is the sum of all the elements of a matrix A.
Discussion of the statistical methodology
Before to proceed, we discuss some points concerning the approach that have
been studied and developed.
The model that has been adopted does not take into account the time, in the sense
that it does not consider, as regressors, response values that are referred to
previous months. The idea was to develop a model that could be able to quickly
react to events (like Olympic games or exceptional weather conditions) without
being influenced from the near past. As our main aim is the quick imputation of
missing values, we do not perform a time series analysis and exceptional events
are detected, not modelled.
9
Another key point concerns the use of a self determined sample (also known as a
natural sample), i.e. the data that are available at a certain moment, to predict late
(or even missing forever) responses. It is known (Gismondi, 2007) that this could
lead to biased estimates. However, in the proposed approach, the theory
concerning general linear models identification can be used. More precisely the
observations that have a large influence on the regression can be identified and
• if the value of the response is anomalous, the observation can be excluded
up to the moment in which it is checked (contacting the accommodation
structure if necessary)
• if it is the position in the regressors space that make the observation
influential it is again possible to verify it before using it for the model
identification.
Besides that, we have suggested to augment the existing panel with some
carefully selected points to improve model accuracy. Indeed it is mandatory for a
structure to communicate their data and so it is evident that a structure cannot
ignore a formal solicit from the Regional Authority.
Finally we point out that, indeed, not all the a-priori relevant variables are
considered because they are not in the data base. Among these variables we
mention those related to the commercial policy of an accommodation structure.
Indeed the model considers equivalent two structures that are close from a
geographical point of view, are of the same type and size (e.g. 3-star hotel with
30 rooms) . In real world, however, these two structures can perform differently
because their different commercial strategies (e.g. special agreement with
companies or special offers able to attract a large number of tourists).
Some results
The first part of the analysis has focused on the years 2000 – 2007. The results
that have been obtained are quite similar to those presented at the S.Co.07
conference and so we invite the interested reader to consult that paper (Fontana et
al, 2007). For these years the values estimated using the models described so far
represent a “correction” for the observed data. The main advantages are therefore
to facilitate the comparison between different years and to provide an estimate of
10
the real-size of the phenomenon useful, for example for properly design services
like water supply or waste collection.
We observe that, with respect to all the Piedmont region, the total values of
bednights in one year, iii iT BBB �� ∈∈
+=��
ˆ is approximately 30% higher than
the values of the officially registered bednights, � ∈�i iB .
As we said above, the Assessorato al Turismo della Regione Piemonte has made
available the data up to March 2008 (the data referring to April 2008 were just
few units and have not been considered in this work). This gave us the possibility
to test the system in a real application. Indeed the main goal of the methodology
was to provide some indications exploiting the data that become available over the
year.
For the first quarter of the year we know that mountains (with the full range of
winter sports) and Turin, together with its surroundings, represent the most
important areas. Therefore we focused on the Piedmont region as a whole and
then we considered Turin and the mountains, focusing on the Olympic resort in
particular.
First analysis identified some influential observations (e.g. observations with net
occupancy rate close to 1 or to 0). We excluded these observations from the
sample (they will be checked by the provincial authorities).
Then using the methodology that we described in the previous section we built the
estimated total bednights for the first quarter all over the Piedmont region.
The picture in Figure 2 summarizes the results
11
2. 7
12. 7
Presenze predet t eRegi one Pi emont e
Val or i anomal i <> . 05 escl usi
presenze predet t e
2. 000. 000
2. 200. 000
2. 400. 000
2. 600. 000
2. 800. 000
3. 000. 000
3. 200. 000
3. 400. 000
3. 600. 000
3. 800. 000
4. 000. 000
anno
2000 2001 2002 2003 2004 2005 2006 2007 2008
Figure 2 Piedmont Region – First Quarters Estimated Bednights 2000 - 2008
An increasing trend and the impact of the Olympic Games that were in February
2006 can be observed. The error predicted using the approximated variance is
about 1% for this and for the other analysis that we present in this section.
The next two figures (Figure 3 and Figure 4) represent the results for Turin and
the Olympic mountains.
0. 3
7. 1
Presenze predet t earea=ci t t à
Val or i anomal i <> . 05 escl usi
presenze predet t e
500. 000
600. 000
700. 000
800. 000
900. 000
1. 000. 000
1. 100. 000
1. 200. 000
anno
2000 2001 2002 2003 2004 2005 2006 2007 2008
Figure 3 Turin and metropolitan area – First Quarters Estimated Bednights 2000 - 2008
12
6. 4
26. 7
Presenze predet t earea=mont . ol i m.
Val or i anomal i <> . 05 escl usi
presenze predet t e
500. 000
600. 000
700. 000
800. 000
900. 000
1. 000. 000
1. 100. 000
1. 200. 000
anno
2000 2001 2002 2003 2004 2005 2006 2007 2008
Figure 4 Olympic Mountains – First Quarters Estimated Bednights 2000 - 2008
Both areas show positive trends. In particular the latter seems to have benefited in
a stable way of the effect of the Olympic Games.
Discussion
It is well known that commercial accommodations represent only a part of
tourism. Second-home tourism around the world has exploded in recent decades
as well as the so called visiting friends and relatives phenomenon. Besides that
other information are extremely relevant for proper planning like the reason and
the duration of the trip, the impact of commercial promotion and so on.
Nevertheless statistical methodologies for empirical model building can lead to a
better use of the data concerning accommodation structures that are collected on a
monthly basis.
It should be pointed out that the suggested methodology eliminates the effect of
non-respondent structures. In this sense data becomes more homogenous over the
time and this opens the way to the use of time series analysis. Some preliminary
work to explore the possibility to build short term forecasts has been done using
the SAS Forecast Server module.
The model has been implemented by CSI Piemonte and it is one of the modules of
the software used by the OTRP in the day by day activity. It has been recently
presented to other regional tourism bodies in the workshop RIST, Sestri Levante,
13
20 May 2008. The pictures reported in Figure 2,3 and 4 are available to the users
of the system.
Acknowledgments
The authors wish to thank all the working group of the “statistical models for tourism flows
prediction” project, funded under the Interreg III A ALCOTRA program, in particular Cristina
Bergonzo, Valentina Carbone and Livia Falomo (Osservatorio Turistico della Regione Piemonte),
Patrick Vece (Comité Régional du Tourisme Riviera Côte d’Azur), Serena Chiarle (CSI
Piemonte), Giulia Cernicchiaro, Mauro Gasparini, Gianfranco Genta and Daniela Ichim
(Politecnico di Torino).
The authors also thank Gaudenzio De Paoli, director of the Direzione Turismo Sport e Parchi della
Regione Piemonte and Marzia Baracchino, responsible for the Coordinamento della Promozione
Turistica della Regione Piemonte for having provided to the Politecnico research group the direct
access to the Piedmont database (individual data).
The use of SAS has been facilitated by Simone Crucchi and Elena Fabbris (SAS Italy), mainly for
what concerns some experiments using SAS® Forecast Server.
Finally a special thank to Silvia Polettini (Università di Napoli), discussant at the S.Co.2007
Conference and to the Scientific Committee of the SAS Business Analytics Gallery, Rome 2008,
co-chaired by Professor Giancarlo Diana (Università di Padova), for having awarded the poster
describing this application as the best in the Universities and Research Institutes category.
References
MC CULLAGH, P., NELDER , J.A (1989): Generalized Linear Models, Chapman and Hall, 2nd
Edition.
FONTANA, R., & PISTONE, G. (December, 2007). Statistical models for tourism flows
prediction. Paper presented at S.Co.2007, Università Ca' Foscari, Venice, , Italy. Available at
http://venus.unive.it/sco2007/ocs/viewabstract.php?id=86
GISMONDI R.: Quick estimation of tourist nights spent in Italy, Statistical Methods &
Applications (2007) 16:141-168
VENABLES, W.N., RIPLEY, B.D. (2003): Modern Applied Statistics with S-Plus. Springer, 4th
edition, Heidelberg.
SAS Institute Inc. 2004. SAS/Stat ® 9.1 User’s Guide. Cary, NC: SAS Institute Inc.