Statistical models for tourism ows...

hard copy ISSN 1974-3041

on-line ISSN 1974-305X

La Matematicae le sue Applicazionin. 17, 2008

Statistical models for tourism flows estimation

R. Fontana, G. Pistone

Quaderni del

Dipartimento di Matematica

Politecnico di TorinoCorso Duca degli Abruzzi, 24 – 10129 Torino – Italia

Edizioni C.L.U.T. - Torino

Corso Duca degli Abruzzi, 24

10129 Torino

Tel. 011 564 79 80 - Fax. 011 54 21 92

La Matematica e le sue Applicazioni

hard copy ISSN 1974-3041

on-line ISSN 1974-305X

Direttore: Claudio Canuto

Comitato editoriale: N. Bellomo, C. Canuto, G. Casnati, M. Gasparini, R. Monaco,

G. Monegato, L. Pandolfi, G. Pistone, S. Salamon, E. Serra, A. Tabacco

Esemplare fuori commercio

accettato nel mese di Dicembre 2008

1

STATISTICAL MODELS FOR TOURISM

FLOWS ESTIMATION

Roberto Fontana and Giovanni Pistone1

Abstract Tourism is a complex and highly competitive sector. In this scenario incoming tourism

flows represent one of the key indicators for public institutions, that wish to adopt an informed

decision making process for resource allocation. The accurate and timely knowledge of both the

interregional and the foreign component at a sufficiently detailed geographical level of such flows

allows a better strategical, tactical and operational planning of marketing activities.

The paper describes a methodology to complete the database of the official statistical data on

tourism flows with an estimate of missing data, that are originated by non respondent

accommodation structure. The methodology has been applied to the inbound tourism flows into

Piedmont.

Keywords Generalized Linear Models, Tourism statistics, Bednights

TOURISM FLOWS IN PIEDMONT

INTRODUCTION

The “statistical models for tourism flows prediction” project, funded under the

Interreg III A ALCOTRA program, considered two different case studies, the first

proposed by the Assessorato al Turismo della Regione Piemonte and the second

by the Comité Régional du Tourisme Riviera Côte d’Azur.

The paper describes the objective of the research project, the available data, the

methodology that has been developed and some of the results that have been

achieved concerning inbound tourism flows into Piedmont. The first part, mainly

related to the process of data collection, reproduces part of the presentation given

1 Dipartimento di Matematica

Politecnico di Torino

Corso Duca degli Abruzzi,24

10129 Torino

[email protected], [email protected]

2

at the S.Co.07 conference, Venice, 06 September 2007 (Fontana et al 2007). It

has been included here to facilitate the reader with a self-contained presentation.

CONTEXT

The aim of the work is to use the Piedmont's regional database of tourist arrivals

and bednights to measure the performance of commercial accommodations in

Piedmont. The database constitutes the official statistical data source concerning

the use of commercial accommodation (hotel, camping, bed&breakfast, ...) in

Piedmont and is part of the national official data bank maintained by the National

Institute of Statistics (Istat).

THE AVAILABLE DATA

The data that have been analysed refer to the years between 2000 and 2008. For

every accommodation structure the following information are available:

• name, type of accommodation, number of beds/rooms, address,

opening/closing period in the year, etc;

• tourism flows (bednights and arrivals) classified according to the country

of origin of tourist on a monthly basis.

The databases concerning the years between 2000 and 2007 are final, in the sense

that have been officially consolidated by the Regional Authority. The 2008

database is provisional and contains tourism flows up to April. The number of

observations is almost 350,000 and the number of variables that have been

considered is around 30.

3

Figure 1. A view of a part of the database

The list of the major concerns of these data includes:

• the presence of missing values for tourism flows due to non-response (a

non-negligible percentage of accommodation structures doesn't provide

their data);

• the delay with which data become available (mainly due to the quite

complex data collection process).

Indeed the collection process, that is carried on under the supervision of the

Assessorato al Turismo della Regione Piemonte, involves all the provincial

statistics offices (there are 8 provincial statistics offices in Piedmont) and all the

accommodation structures (there were 4.719 structures at the end of 2007). This

census, even if is based on a monthly basis, takes 15 months to be fully

completed: it starts in January and it finishes at the end of March of the next year,

when the provincial statistics offices certify and make final the data that have

collected. It follows that, for example, the data concerning tourism flows of April

2006 will be known in official way not before March 2007.

Indeed, the analysis that have been done using the variable in the database that

registers the date of insertion of a single record into the database itself provides a

value of around 6 months as an estimate of the time needed to make final a certain

reference month.

This delay is often considered too large by the operators working in this sector.

4

The other issue concerns missing data due to the fact that some accommodation

structures do not transmit their own data. Their presence implies that the global

values underestimates the size of this component of the tourism industry. Besides

that, the difference between the response rates of different years, even if it is not

very high, makes difficult to interpret data, especially for small geographical

areas.

The main objective of the work is to use the data, that become available on a day

by day basis, to monitor the performance of this part of the tourism sector in order

to improve mainly the short term decision process.

The statistical modeling

The availability of the individual data has allowed to develop the following basic

idea. Let’s denote with �� the set of all the accommodation structures that are

open in a certain month (even for a single day). For very accommodation

structure in �:

• the bednights (B) can be considered as the number of “products” that have

been sold;

• the ratio between the bednights (B) and the total capacity of an

accommodation structure (C), defined as the product between the number

of days in which the structure is open and the number of beds, can be

viewed as the rate of success of the structure (R, the net occupancy rate):

C

BR =

For a certain structure R can be computed as soon as the structure transmits its

own data (in particular the bednights B that have registered in that month).

R has been modelled (Mc Cullagh et al 1989) in the framework of linear models

(LMs) and generalized linear models (GLMs) , as a function of time (T), of the

geographical area (G) and of the type of the accommodation structure (A)

),,(ˆ AGTfR =

5

In particolar binomial GLMs have been used to predict missing data, that is the

net occupancy rate of the accommodation structures that have not yet provided

their data . The number of bednights for accommodation structures that have

missing flows has been estimated using

CRB ⋅= ˆˆ

Finally the total number of bednights for a certain area and a certain month (BT)

is estimated summing up the values of bednights corresponding to the

accommodation structures that have provided their data )( �� ⊆ with the

estimates of bednights corresponding to the accommodation structures that do not

have provided their data )( , �� =⊆ ∪ :

�� +=��

BBBTˆ

In more detail the following explanatory variables have been used:

• YEAR and MONTH to represent the time (quantitative variables);

• the geographical AREA, a qualitative variable whose values are the Local

Tourist Agencies (ATL) of Piedmont (according to the address every

accommodation structure has been related to the ATL to which it belongs).

The eleven ATLs of Piedmont allow to make a first division of the

regional territory into homogenous areas but more accurate segmentations

can be considered;

• TYPE and BEDS: the accommodation structures have been classified

according to the TYPE (“1 or 2 star hotel”, “3-star hotel”, “4 or 5 star

hotels”, “bed & breakfast”, etc) and the number of beds (BEDS).

The following link functions have been tested:

• logit

• probit

• complementary log-log

The presence of main effects and interactions between the explanatory variables

has been evaluated using the Analysis of Deviance. The best results in terms of

residuals have been obtained using the logit function. The most significant

6

predictors are the geographical area and the type of accommodation structure. The

interaction between MONTH and ATL was found to be slightly significant.

The variance of BT

As explained in the previous section, we got TB adding to the known bednights

values ii

B� ∈� an estimate of the bednights values corresponding to the

accommodation structures that have not sent their data � ∈�i iB̂ . It follows that the

variance of TB is the variance of� ∈�i iB̂ .

Let the value of the logit link function corresponding to C

BR =

��

��

�

−=

R

R

1logη

It follows that

)exp(1

)exp(

η

η

+=R

Using a standard notation for GLM’s we write εβη +⋅= X , where

• X is the design matrix whose rows correspond to the accommodation

structures that provided their data )( �� ⊆ and whose columns

correspond to the explanatory variables T,G,A and their interactions

• represents the error term, ),0(~ 2σε IIND

The estimate of the bednights for an accommodation structure that has not

communicated their data ( B̂ ) has been obtained as

)ˆexp(1

)ˆexp(ˆˆη

η

+⋅=⋅= CRCB

where C is the capacity of the structure.

7

Let’s now introduce subscripts to distinguish between different accommodation

structures. In particular we denote with βη ˆˆ ⋅= ii x the estimate of for the

structure i of � .

We have

� �� ∈ ∈∈ ∈∈===

� �� i jj ijii jj ii iT RRCCBBBB ]ˆ,ˆcov[]ˆ,ˆcov[]ˆvar[]var[

We use the 1st

order Taylor polynomials to approximate the iR̂ in the

neighbourhood of ]ˆ[ iR� . The 1st order approximation of R̂ (we omit the suffix i

to keep the notation simple) in a neighbourhood of 0η̂ is

)ˆˆ())ˆ(ˆ1()ˆ(ˆ)ˆ(ˆ)ˆˆ(ˆ

ˆ)ˆ(ˆ)ˆ(ˆ

00000

ˆˆ

0

0

ηηηηηηηη

ηηηη

−⋅−⋅+=−∂

∂+≈

=

RRRR

RR

If we choose the expected value of η̂ as 0η̂ we get

])ˆ[(ˆ)]ˆ(ˆ[ ηη �� RR ≈

and therefore, restoring the subscripts,

])ˆ[ˆ())ˆ(ˆ1()ˆ(ˆ]ˆ[)ˆ(ˆiiiiiiii RRRR ηηηηη �� −⋅−⋅≈−

Using these approximations we get

T

jii j jjiijii iT xxRRRRCCBB Γ−−≈= � �� ∈ ∈∈ � ��)ˆ1(ˆ)ˆ1(ˆ]ˆvar[]var[

where is the variance-covariance matrix of β̂ and T

jx is the transposed vector

of jx . Confidence intervals are derived using such approximations.

Software

During the project statistical analysis has been carried on using R (http://www.R-

project.org) for early prototypes and SAS (http://www.sas.com) for the production

versions.

In particular the core of the module is the Proc GENMOD (SAS, 2004) that is part

of SAS/STAT® 9.1.3, while the computation of the variance has been

8

implemented in the Proc IML environment. It should be noted that the problem of

the computation of the variance reaches a quite high dimension if we consider all

the available months (between January 2000 and April 2008 there are 100

months) all over the region. If we still use � to denote all the missing values

(non-respondent structures for all the considered months) we get that � has more

than 15.000 observations. To limit both the computational time and the use of

memory it has been necessary to split the computation of the variance into steps in

the following way. The set � � of all the non-respondent structures has been

partitioned into the union of N subsets (N=50 in the current application):

N�� ++= �1

and the following matrices Xi i=1,…,N have been defined

[ ]�

�∈−= jxRRCX jjjji ,)ˆ1(ˆ

It follows that

�� = =

∈ ∈Γ=Γ−−≈

N

r

N

c

T

cr

T

jii j jjiijiT XXxxRRRRCCB1 1

)sum()ˆ1(ˆ)ˆ1(ˆ]var[� �

where sum(A) is the sum of all the elements of a matrix A.

Discussion of the statistical methodology

Before to proceed, we discuss some points concerning the approach that have

been studied and developed.

The model that has been adopted does not take into account the time, in the sense

that it does not consider, as regressors, response values that are referred to

previous months. The idea was to develop a model that could be able to quickly

react to events (like Olympic games or exceptional weather conditions) without

being influenced from the near past. As our main aim is the quick imputation of

missing values, we do not perform a time series analysis and exceptional events

are detected, not modelled.

9

Another key point concerns the use of a self determined sample (also known as a

natural sample), i.e. the data that are available at a certain moment, to predict late

(or even missing forever) responses. It is known (Gismondi, 2007) that this could

lead to biased estimates. However, in the proposed approach, the theory

concerning general linear models identification can be used. More precisely the

observations that have a large influence on the regression can be identified and

• if the value of the response is anomalous, the observation can be excluded

up to the moment in which it is checked (contacting the accommodation

structure if necessary)

• if it is the position in the regressors space that make the observation

influential it is again possible to verify it before using it for the model

identification.

Besides that, we have suggested to augment the existing panel with some

carefully selected points to improve model accuracy. Indeed it is mandatory for a

structure to communicate their data and so it is evident that a structure cannot

ignore a formal solicit from the Regional Authority.

Finally we point out that, indeed, not all the a-priori relevant variables are

considered because they are not in the data base. Among these variables we

mention those related to the commercial policy of an accommodation structure.

Indeed the model considers equivalent two structures that are close from a

geographical point of view, are of the same type and size (e.g. 3-star hotel with

30 rooms) . In real world, however, these two structures can perform differently

because their different commercial strategies (e.g. special agreement with

companies or special offers able to attract a large number of tourists).

Some results

The first part of the analysis has focused on the years 2000 – 2007. The results

that have been obtained are quite similar to those presented at the S.Co.07

conference and so we invite the interested reader to consult that paper (Fontana et

al, 2007). For these years the values estimated using the models described so far

represent a “correction” for the observed data. The main advantages are therefore

to facilitate the comparison between different years and to provide an estimate of

10

the real-size of the phenomenon useful, for example for properly design services

like water supply or waste collection.

We observe that, with respect to all the Piedmont region, the total values of

bednights in one year, iii iT BBB �� ∈∈

+=��

ˆ is approximately 30% higher than

the values of the officially registered bednights, � ∈�i iB .

As we said above, the Assessorato al Turismo della Regione Piemonte has made

available the data up to March 2008 (the data referring to April 2008 were just

few units and have not been considered in this work). This gave us the possibility

to test the system in a real application. Indeed the main goal of the methodology

was to provide some indications exploiting the data that become available over the

year.

For the first quarter of the year we know that mountains (with the full range of

winter sports) and Turin, together with its surroundings, represent the most

important areas. Therefore we focused on the Piedmont region as a whole and

then we considered Turin and the mountains, focusing on the Olympic resort in

particular.

First analysis identified some influential observations (e.g. observations with net

occupancy rate close to 1 or to 0). We excluded these observations from the

sample (they will be checked by the provincial authorities).

Then using the methodology that we described in the previous section we built the

estimated total bednights for the first quarter all over the Piedmont region.

The picture in Figure 2 summarizes the results

11

2. 7

12. 7

Presenze predet t eRegi one Pi emont e

Val or i anomal i <> . 05 escl usi

presenze predet t e

2. 000. 000

2. 200. 000

2. 400. 000

2. 600. 000

2. 800. 000

3. 000. 000

3. 200. 000

3. 400. 000

3. 600. 000

3. 800. 000

4. 000. 000

anno

2000 2001 2002 2003 2004 2005 2006 2007 2008

Figure 2 Piedmont Region – First Quarters Estimated Bednights 2000 - 2008

An increasing trend and the impact of the Olympic Games that were in February

2006 can be observed. The error predicted using the approximated variance is

about 1% for this and for the other analysis that we present in this section.

The next two figures (Figure 3 and Figure 4) represent the results for Turin and

the Olympic mountains.

0. 3

7. 1

Presenze predet t earea=ci t t à


presenze predet t e

500. 000

600. 000

700. 000

800. 000

900. 000

1. 000. 000

1. 100. 000

1. 200. 000

anno

2000 2001 2002 2003 2004 2005 2006 2007 2008

Figure 3 Turin and metropolitan area – First Quarters Estimated Bednights 2000 - 2008

12

6. 4

26. 7

Presenze predet t earea=mont . ol i m.


presenze predet t e

500. 000

600. 000

700. 000

800. 000

900. 000

1. 000. 000

1. 100. 000

1. 200. 000

anno

2000 2001 2002 2003 2004 2005 2006 2007 2008

Figure 4 Olympic Mountains – First Quarters Estimated Bednights 2000 - 2008

Both areas show positive trends. In particular the latter seems to have benefited in

a stable way of the effect of the Olympic Games.

Discussion

It is well known that commercial accommodations represent only a part of

tourism. Second-home tourism around the world has exploded in recent decades

as well as the so called visiting friends and relatives phenomenon. Besides that

other information are extremely relevant for proper planning like the reason and

the duration of the trip, the impact of commercial promotion and so on.

Nevertheless statistical methodologies for empirical model building can lead to a

better use of the data concerning accommodation structures that are collected on a

monthly basis.

It should be pointed out that the suggested methodology eliminates the effect of

non-respondent structures. In this sense data becomes more homogenous over the

time and this opens the way to the use of time series analysis. Some preliminary

work to explore the possibility to build short term forecasts has been done using

the SAS Forecast Server module.

The model has been implemented by CSI Piemonte and it is one of the modules of

the software used by the OTRP in the day by day activity. It has been recently

presented to other regional tourism bodies in the workshop RIST, Sestri Levante,

13

20 May 2008. The pictures reported in Figure 2,3 and 4 are available to the users

of the system.

Acknowledgments

The authors wish to thank all the working group of the “statistical models for tourism flows

prediction” project, funded under the Interreg III A ALCOTRA program, in particular Cristina

Bergonzo, Valentina Carbone and Livia Falomo (Osservatorio Turistico della Regione Piemonte),

Patrick Vece (Comité Régional du Tourisme Riviera Côte d’Azur), Serena Chiarle (CSI

Piemonte), Giulia Cernicchiaro, Mauro Gasparini, Gianfranco Genta and Daniela Ichim

(Politecnico di Torino).

The authors also thank Gaudenzio De Paoli, director of the Direzione Turismo Sport e Parchi della

Regione Piemonte and Marzia Baracchino, responsible for the Coordinamento della Promozione

Turistica della Regione Piemonte for having provided to the Politecnico research group the direct

access to the Piedmont database (individual data).

The use of SAS has been facilitated by Simone Crucchi and Elena Fabbris (SAS Italy), mainly for

what concerns some experiments using SAS® Forecast Server.

Finally a special thank to Silvia Polettini (Università di Napoli), discussant at the S.Co.2007

Conference and to the Scientific Committee of the SAS Business Analytics Gallery, Rome 2008,

co-chaired by Professor Giancarlo Diana (Università di Padova), for having awarded the poster

describing this application as the best in the Universities and Research Institutes category.

References

MC CULLAGH, P., NELDER , J.A (1989): Generalized Linear Models, Chapman and Hall, 2nd

Edition.

FONTANA, R., & PISTONE, G. (December, 2007). Statistical models for tourism flows

prediction. Paper presented at S.Co.2007, Università Ca' Foscari, Venice, , Italy. Available at

http://venus.unive.it/sco2007/ocs/viewabstract.php?id=86

GISMONDI R.: Quick estimation of tourist nights spent in Italy, Statistical Methods &

Applications (2007) 16:141-168

VENABLES, W.N., RIPLEY, B.D. (2003): Modern Applied Statistics with S-Plus. Springer, 4th

edition, Heidelberg.

SAS Institute Inc. 2004. SAS/Stat ® 9.1 User’s Guide. Cary, NC: SAS Institute Inc.

Statistical models for tourism ows...

Documents

Transcript of Statistical models for tourism ows...