An introduction to data integration

76
1 Corso di Rappresentazione della Informazione e della Conoscenza Anno Accademico 2007-2008 Matteo Palmonari Query Processing in Data Integration Materiale organizzato da presentazioni di: L. Tanca, S. Costantini, J. Sun & L. Zhao, M. Van Der Wielen, Ir. R. Vdovjak, Kambhampati & Knoblock, S. Bergamaschi

description

Corso di Rappresentazione della Informazione e della Conoscenza Anno Accademico 2007-2008 Matteo Palmonari Query Processing in Data Integration . - PowerPoint PPT Presentation

Transcript of An introduction to data integration

Page 1: An introduction to data integration

1

Corso di Rappresentazione della Informazione e della Conoscenza

Anno Accademico 2007-2008Matteo Palmonari

Query Processing in Data Integration

Materiale organizzato da presentazioni di: L. Tanca, S. Costantini, J. Sun & L. Zhao, M. Van

Der Wielen, Ir. R. Vdovjak, Kambhampati & Knoblock, S. Bergamaschi

Page 2: An introduction to data integration

2

An introduction to data integration

Prof. Letizia TancaPolitecnico di Milano

Technologies for Information Systems

Page 3: An introduction to data integration

3

An Online Shopper’s Information Integration ProblemEl Cheapo: “Where can I get the cheapest copy (including shipping cost) of

Wittgenstein’s Tractatus Logicus-Philosophicus within a week?”

?Information Integration

addall.com

““One-World”One-World”MediationMediation

amazon.comamazon.com A1books.comA1books.comhalf.comhalf.combarnes&noble.combarnes&noble.com

Page 4: An introduction to data integration

4

What is a Data Integration System?

– Uniform (same query interface to all sources)– Access to (queries; eventually updates too)– Multiple (we want many, but 2 is hard too)– Autonomous (DBA doesn’t report to you)– Heterogeneous (data models are different)– Structured (or at least semi-structured)– Data Sources (not only databases).

A system providing:

Page 5: An introduction to data integration

5

Virtual Integration Architecture

Datasource

wrapper

Datasource

wrapper

Datasource

wrapper

Sources can be: relational, hierarchical (IMS), structured files, web sites.

Mediator:

User queriesMediated schema

Data sourcecatalog

Reformulator

Optimizer

Execution engine

Page 6: An introduction to data integration

6

Query Model in Virtual Integration

• User formulates query in terms of his/her ontology on the mediated (or “global”) schema

• System reformulates queries in terms of sub-queries for each source (“local” schema)

• Structure of the query model should be more intuitive for the user

Page 7: An introduction to data integration

7

Reformulation Problem• Given:

– A query Q posed over the mediated schema– Descriptions of the data sources

• Find:– A query Q’ over the data source relations,

such that:• Q’ provides only correct answers to Q, and• Q’ provides all possible answers from to Q

given the sources.

Page 8: An introduction to data integration

8

Mapping between the global logical schema and the single source schemata

(logical view definition)

• Two basic approaches– GAV (Global As View)– LAV (Local As View)

• Can be used also in case of different data models

• In that case a model transformation is required

Page 9: An introduction to data integration

9

GAV (Global As View)GAV (Global As View)

• Up to now we supposed that the global schema be derived from the integration process of the data source schemata

• Thus the global schema is expressed in terms of the data source schemata

• Such approach is called the Global As View approach

Page 10: An introduction to data integration

10

The other possible ways…The other possible ways…

LAV (Local As View)LAV (Local As View)• The global schema has been designed

independently of the data source schemata• The relationship (mapping) between sources

and global schema is obtained by defining each data source as a view over the global schema GLAV (Global and Local As View)GLAV (Global and Local As View)

• The relationship (mapping) between sources and global schema is obtained by defining a set of views, some over the global schema and some over the data sources

Page 11: An introduction to data integration

11

• Global schema GG• Source schemata schemata SS • Mapping Mapping MM between sources and global between sources and global

schema: a set of assertions schema: a set of assertions qqS S q qGG

qqG G q qSS

Intuitively, the first assertion specifies that the Intuitively, the first assertion specifies that the concept represented by a view (query) qconcept represented by a view (query) qSS over a over a

source schema S corresponds to the concept source schema S corresponds to the concept specified by qspecified by qG G over the global schema. Viceversa over the global schema. Viceversa

for the second assertion.for the second assertion.

Mapping between data sources and global schema

Page 12: An introduction to data integration

12

• A data integration system is a triple A data integration system is a triple (G, S, M)(G, S, M)• The query to the integrated system are posed The query to the integrated system are posed

in terms of G and specify which data of the in terms of G and specify which data of the virtual database we are interested invirtual database we are interested in

• The problem is understanding which real data The problem is understanding which real data (in the data sources) correspond to those (in the data sources) correspond to those virtual datavirtual data

Mapping between data sources and mediated schema

Page 13: An introduction to data integration

13

GAVGAV• A GAV mapping is a set of assertions, one

for each element g of Gg g q qSS

That is, the mapping specifies g as a query qS over the data sources. This means that the mapping tells us exactly how the element g is computed.

OK for stable data sourcesOK for stable data sources Difficult to extend with a new data sourceDifficult to extend with a new data source

Page 14: An introduction to data integration

14

GAV exampleSOURCE 1

Product(Code, Name, Description, Warnings, Notes, CatID)Category(ID, Name, Description)Version(ProductCode, VersionCode, Size, Color, Name,

Description, Stock, Price)

SOURCE 2

Product(Code, Name, Size, Color, Description, Type, Price, Q.ty)Tipe(TypeCode, Name, Description)

n.b.: WE DO NOT CARE ABOUT DATA TYPES…

Page 15: An introduction to data integration

15

SOURCE 1

Product(Code, Name, Description, Warnings, Notes, CatID)Version(ProductCode, VersionCode, Size, Color, Name, Description, Stock,

Price)

SOURCE 2

Product(Code, Name, Size, Color, Description, Type, Price, Q.ty)

GLOBAL SCHEMA

CREATE VIEW GLOB-PROD ASSELECT Code AS PCode, VersionCode as VCode, Version.NAme AS Name,

Size, Color, Version.Description as Description, CatID, Version.Price, Stock

FROM SOURCE1.Product, SOURCE1.VersionWHERE Code = ProductCodeUNIONSELECT Code AS PCode, null as VCode, Name, Size, Color, Description,Type

as CatID, Price, Q.ty AS StockFROM SOURCE2.Product

Page 16: An introduction to data integration

16

GAVGAV• Suppose now we introduce a new sourceSuppose now we introduce a new source• The simple view we have just created is The simple view we have just created is

to be modified to be modified • In the simplest case we only need to add In the simplest case we only need to add

a union with a new SELECT-FROM-a union with a new SELECT-FROM-WHERE clauseWHERE clause

• This is not true in general, view This is not true in general, view definitions may be much more complexdefinitions may be much more complex

Page 17: An introduction to data integration

17

GAV• Quality depends on how well we have

compiled the sources into the global schema through the mapping

• Whenever a source changes or a new one is added, the global schema needs to be reconsidered

• Query processing can be based on some sort of unfolding

• Example: one already seen

Page 18: An introduction to data integration

18

LAVThe Local-as-View approach, see Figure, describes local sources as a view defined in terms of the global schema. The global schema is predefined and for each local source is described how it delivers information to the global schema.Each mapping associates the entities in the source schemas by way of a query over the global schema. Thereby, each source schema is defined as a view over the global schema, hence the name “Local-as-View”.

Page 19: An introduction to data integration

19

LAVLAVA mapping LAV is a set of assertions, one for

each element s of each source Ss s q qGG

Thus the content of each source is characterized in terms of a view qG over the global schema

• OK if the global schema is stable, e.g. based on a OK if the global schema is stable, e.g. based on a domain ontology or an enterprise model domain ontology or an enterprise model

• It favours extensibility It favours extensibility • Query processing much more complexQuery processing much more complex

Page 20: An introduction to data integration

20

LAV• Quality depends on how well we have

characterized the sources• High modularity and extensibility (if

the global schema is well designed, when a source changes or is added, only its definition is to be updated)

• Query processing needs reasoning

Page 21: An introduction to data integration

21

SOURCE 1

Product(Code, Name, Description, Warnings, Notes, CatID)Version(ProductCode, VersionCode, Size, Color, Name, Description, Stock,

Price)

SOURCE 2

Product(Code, Name, Size, Color, Description, Type, Price, Q.ty)

GLOBAL SCHEMA

GLOB-PROD (PCode, VCode, Name, Size, Color, Description, CatID, Price, Stock)

In this case we have to express the sources as views over the global schema

Page 22: An introduction to data integration

22

GLOBAL SCHEMA

GLOB-PROD (PCode, VCode, Name, Size, Color, Description, CatID, Price, Stock)

SOURCE 2

Product (Code, Name, Size, Color, Description, Type, Price, Q.ty)

CREATE VIEW SOURCE2.Product ASSELECT PCode AS Code, Name, Size, Color,

Description, CatID as Type, Price, Stock AS Q.ty

FROM GLOB-PROD

Page 23: An introduction to data integration

23

GLOBAL SCHEMA

GLOB-PROD (PCode, VCode, Name, Size, Color, Description, CatID, Price, Stock)

SOURCE 1

Product(Code, Name, Description, Warnings, Notes, CatID)Version(ProductCode, VersionCode, Size, Color, Name, Description, Stock,

Price)

CREATE VIEW SOURCE1.Product ASSELECT Pcode AS Code, ?Name?, ?Description?, ? Warnings ?, ?Notes?, CatIDFROM GLOB-PROD

CREATE VIEW SOURCE1. Version ASSELECT Pcode AS ProductCode, VCode as VersionCode, Size, Color, Name, ?

Description ? , Stock, Price)FROM GLOB-PROD

N.B.: Some information is lacking: either we don’t know where description has to be taken from, or there is no correspondent value (e.g.Warnings, Notes, etc..). A difficult job

Page 24: An introduction to data integration

24

The GLAV approach• Is a combination of LAV and GAV. Part

of the mappings are of LAV type, part of GAV type.

• It combines the advantages of the two approaches.

Page 25: An introduction to data integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 25

Approaches for relating source & Mediator Schemas

• Global-as-view (GAV): express the mediated schema relations as a set of views over the data source relations

• Local-as-view (LAV): express the source relations as views over the mediated schema.

• Can be combined…?

CREATE VIEW Seattle-view AS

SELECT buyer, seller, product, store FROM Person, Purchase WHERE Person.city = “Seattle” AND Person.name = Purchase.buyer

We can later use the views: SELECT name, store FROM Seattle-view, Product WHERE Seattle-view.product = Product.name AND Product.category = “shoes”

Virtual vsMaterialized

“View” Refresher

Let’s compare them in a movie

Database integration scenario..

Differences minor for data aggregation…

Page 26: An introduction to data integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 26

Global-as-ViewMediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time).Create View Movie AS select * from S1 [S1(title,dir,year,genre)] union select * from S2 [S2(title, dir,year,genre)] union [S3(title,dir), S4(title,year,genre)] select S3.title, S3.dir, S4.year, S4.genre from S3, S4 where S3.title=S4.title

Express mediator schemarelations as views oversource relations

Page 27: An introduction to data integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 27

Global-as-ViewMediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time).Create View Movie AS select * from S1 [S1(title,dir,year,genre)] union select * from S2 [S2(title, dir,year,genre)] union [S3(title,dir), S4(title,year,genre)] select S3.title, S3.dir, S4.year, S4.genre from S3, S4 where S3.title=S4.title

Express mediator schemarelations as views oversource relations

Mediator schema relations are Virtual views on source relations

Page 28: An introduction to data integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 28

Global-as-View: Example 2Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time).

Create View Movie AS [S1(title,dir,year)] select title, dir, year, NULL from S1 union [S2(title, dir,genre)] select title, dir, NULL, genre from S2

Null values

Express mediator schemarelations as views oversource relations

Page 29: An introduction to data integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 29

Global-as-View: Example 2Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time).Source S4: S4(cinema, genre)

Create View Movie AS

select NULL, NULL, NULL, genre

from S4

Create View Schedule AS

select cinema, NULL, NULL

from S4.

But what if we want to find which cinemas are playing comedies?

“Lossy Mediation”

Express mediator schemarelations as views oversource relations

Page 30: An introduction to data integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 30

Local-as-View: example 1Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time).

S1(title,dir,year,genre)

S3(title,dir)

S5(title,dir,year), year >1960

Create Source S1 AS

select * from Movie

Create Source S3 AS

select title, dir from Movie

Create Source S5 AS

select title, dir, year

from Movie

where year > 1960 AND genre=“Comedy”Sources are “materialized views” ofmediator schema

Express source schemarelations as views overmediator relations

Page 31: An introduction to data integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 31

Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time).

S1(title,dir,year,genre)

S3(title,dir)

S5(title,dir,year), year >1960

Create Source S1 AS

select * from Movie

Create Source S3 AS

select title, dir from Movie

Create Source S5 AS

select title, dir, year

from Movie

where year > 1960 AND genre=“Comedy”

Creat Source S4 AS

select cinema, genre

from Movie m, Schedule s

where m.title=s.title

S4(Cinema,Genre)

Express source schemarelations as views overmediator relations

Page 32: An introduction to data integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 32

Local-as-View: Example 2Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time).Source S4: S4(cinema, genre)

Create Source S4

select cinema, genre

from Movie m, Schedule s

where m.title=s.title

Now if we want to find which cinemas are playing comedies, there is hope!

Express source schemarelations as views overmediator relations

Page 33: An introduction to data integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 33

GAV vs. LAVMediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time).

Create View Movie AS

select NULL, NULL, NULL, genre

from S4

Create View Schedule AS

select cinema, NULL, NULL

from S4.

But what if we want to find which cinemas are playing comedies?

Create Source S4

select cinema, genre

from Movie m, Schedule s

where m.title=s.title

Now if we want to find which cinemas are playing comedies, there is hope!

Source S4: S4(cinema, genre)

Lossy mediation

Page 34: An introduction to data integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 34

GAV vs. LAV• Not modular

– Addition of new sources changes the mediated schema

• Can be awkward to write mediated schema without loss of information

• Query reformulation easy– reduces to view unfolding

(polynomial)– Can build hierarchies of mediated

schemas

• Best when– Few, stable, data sources– well-known to the mediator (e.g.

corporate integration)• Garlic, TSIMMIS, HERMES,

MOMIS

• Modular--adding new sources is easy

• Very flexible--power of the entire query language available to describe sources

• Reformulation is hard– Involves answering queries only

using views (can be intractable—see below)

• Best when– Many, relatively unknown data

sources– possibility of addition/deletion of

sources• Information Manifold,

InfoMaster, Emerac, Havasu

Page 35: An introduction to data integration

35

Views: Sound & Completeness• The terms sound, exact and complete are used to express to what degree the

extent of a view corresponds to its definition. • A view defined over some data source is sound if it provides a subset of the

available data in the data source that corresponds to the definition. – It delivers only, but not necessarily all answers to its definition. The answers it does

deliver might be incomplete, but they are correct (sound).• If a view is complete, it provides a superset of the available data in the data

source that corresponds to the definition. It delivers all answers to its definition, and maybe more.

– Since the set of answers might contain more than the answers corresponding to the views definition, but does contain all answers corresponding to the views definition, the term complete is being used.

• A view is exact if it provides all and only data corresponding to the definition.

Page 36: An introduction to data integration

36

Query answering in GAVQuery answering in the Global-as-View happens through query unfolding.

Unfolding queries to the local sources is relatively easy compared to the Local as-View approach in which queries posted to the global schema have to be rewritten before being able to answer the query.

Page 37: An introduction to data integration

37

Query answering in GAV• Because each global entity is

defined as a view over the local schemas, the total number of GAV-rules necessary to define the mappings in a data-integration system corresponds to the number of entities in the global schema.

• the extent of a global schema defined by GAV-rules is assumed to be exact.

Page 38: An introduction to data integration

38

Query answering in LAVThe Local-as-View approach, see Figure, describes local sources as a view defined in terms of the global schema. The global schema is predefined and for each local source is described how it delivers information to the global schema.Each mapping associates the entities in the source schemas by way of a query over the global schema. Thereby, each source schema is defined as a view over the global schema, hence the name “Local-as-View”.

Page 39: An introduction to data integration

39

Query answering in LAV• Each entity in the local schemas is defined as a view over the

global schema. Therefore, the total number of LAV-rules necessary to define the mappings in a data-integration system corresponds to the number of entities in the local schemas.

• Processing queries in the Local-as-View approach is difficult because the only knowledge of the global-schema available is through the views representing the sources. Such view only provides partial information about the data. Since the mapping associated to each source as a view over the global schema it is not clear how to use the sources in order to answer queries expressed over the global schema. Therefore extracting information from an integration system using the Local-as-View approach is a complex task because one has to answer queries with incomplete information.

• Views defined by LAV-rules might be sound or complete, in terms of 3.1.

Page 40: An introduction to data integration

40

IntroductionIntroduction Traditional Integration Formalisms

Global as View (GAV)

Mediated schema as views over data sources Local as View (LAV)

Data sources as views of mediated schema

Med. Schema TGAV:

T :- S1, S2, S3

LAV:

S1

S1 S2 S3

Page 41: An introduction to data integration

41

Some DB Theory Schema Conjunctive queries

For example:Give me students together

with their advisors who took courses given by their advisors after the 1st trimester 2003.

namearea name

Faculty StudentAdvises

* *

CourseRegisteredTeaches

* *

trimester trimestertitlec-number* *

Page 42: An introduction to data integration

42

DB Theory SQLSELECT Advises.prof, Advises.studentFROM Registered, Teaches, AdvisesWHERE Registered.c-number = Teaches.c-number and

Registered.trimester=Teaches.trimester and Advises.prof=Teaches.prof and Advises.student=Registered.student and Registered.trimester > “2003\1”

DATALOGQ(prof, student) :-

Registered(student,course, trimester), Teaches(prof,course, trimester) , Advises(prof, student), trimester > “2003\1”

head

body

Page 43: An introduction to data integration

43

Problem DefinitionProblem Definition GAV-like Definition – Definition in Datalog

9DC : SkilledPerson(PID, “Doctor”) : -

H : Doctor(SID, h, l, s, e)

9DC : SkilledPerson(PID, “EMT”) : -

H : EMT(SID, h, vid, s, e)

9DC : SkilledPerson(PID, “EMT”) : -

FS : Schedule(PID, vid),

FS : FirstResponse(vid, s, l, d),

FS : Skills(PID, “medical”)

Page 44: An introduction to data integration

44

Problem DefinitionProblem Definition LAV-like Definition – Inclusion Definition

LH : CritBed(bed, hosp, room, PID, status)

H : CritBed(bed, hosp, room),

H : Patient(PID, bed, status)

LH : EmergBed(bed, hosp, room, PID, status)

H : EmergBed(bed, hosp, room),

H : Patient(PID, bed, status)

Page 45: An introduction to data integration

45

Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/

Deductive Databases» Tables viewed as predicates. » Ops on tables expressed as “datalog” rules

› (Horn clauses, without function symbols) Enames(Name) :- Employe(Name, SSN) [Projection]

Wealthy-Employee(Name):- Employee(Name,SSN), Salary(SSN,Money),Money> 10 [Selection]Ed(Name, Dname):- Employee(Name, SSN), E_Dependents(SSN, Dname) [Join]

ERelated(Name,Dname) :- Ed(Name,Dname)ERelated(Name,Dname) :- Ed(Name,D1), ERelated(D1,D2) [Recursion]

Page 46: An introduction to data integration

46

Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/

More datalog terminology

A datalog program is a set of datalog rules.

A program with one rule is a conjunctive query.

We distinguish EDB predicates and IDB predicates

» EDB’s are stored in the database, › appear only in rule bodies

» IDB’s are intensionally defined, › appear in both bodies and heads.

Page 47: An introduction to data integration

47

Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/

Approaches to Specifying Schema Descriptions

» Global-as-View: express the mediated schema relations as a set of views over the data source relations

» Local-as-View: express the source relations as views over the mediated schema.

Page 48: An introduction to data integration

48

DB Theory, query containment A query Q’ is contained in Q if for all databases D, the

set of tuples computed for Q’ is a subset of those computed for Q, i.e., Q’ Q iff D Q’(D) Q(D)

A query Q’ is equivalent to Q iffQ’ Q and Q Q’

Example:Q’(prof, student) :-

Registered(student,course, trimester), Teaches(prof,course, trimester) , Advises(prof, student), trimester > “2003\2”

is contained in Q(prof, student) :-

Registered(student,course, trimester), Teaches(prof,course, trimester) , Advises(prof, student), trimester > “2003\1”

Page 49: An introduction to data integration

49

Query containment check Create a canonical database D that is the “frozen” body of Q’ Compute Q(D) If Q(D) contains the frozen head of Q’ then Q’ Q otherwise not

Example:Q’: p3(x,y) :- arc(x,z),arc(z,w),arc(w,y) Q: path(x,y) :- arc(x,y) path(x,y) :- path(x,z), path(z,y)

1. Freeze Q’ with say 0,1,2,3 for x,z,w,y respectively:Q’(0,3) :- arc(0,1), arc(1,2), arc(2,3)D={arc(0,1), arc(1,2), arc(2,3)}

2. Compute Q(D): Q(D)={(0,1), (1,2), (2,3), (0,2), (1,3), (0,3)}

3. The frozen head of Q’ = (0,3) is in the Q(D) Q’Q

p3xz w

y

arc

arcarc

Page 50: An introduction to data integration

50

Query reformulation using views

Problem: Given a user query Q and view definitions V={V1..Vn} Q’ is an Equivalent Rewriting of Q using V iff

Q’ refers only to views in V, Q’ is equivalent to Q

Q’ is a Maximally-contained Rewriting of Q using V iff

Q’ refers only to views in V, Q’ Q, there is no such rewriting Q1 that Q’ Q1 Q, Q1 Q

Page 51: An introduction to data integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 51

Reformulation in LAV• Given a set of views V1,…,Vn, and a query

Q, can we answer Q using only the answers to V1,…,Vn?– Notice the use of materialized (pre-computed)

views• ONLY materialized views should be used…

• Approaches– Bucket algorithm [Levy; 96]– Inverse rules algorithm [Duschka, 99]– Hybrid versions

• SV-Bucket [2001], MiniCon [2001]

Page 52: An introduction to data integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 52

Mediated schema:Movie(title, dir, year, genre), Schedule(cinema, title, time).

S1(title,dir,year,genre)

S3(title,dir)

S5(title,dir,year), year >1960

S1(title,dir,year,genre)

S3(title,dir)

S5(title,dir,year), year >1960

Create Source S1 AS

select * from Movie

Create Source S3 AS

select title, dir from Movie

Create Source S5 AS

select title, dir, year

from Movie

where year > 1960 AND genre=“Comedy”Sources are “materialized views” ofVirtual schema

Create Source S1 AS

select * from Movie

Create Source S3 AS

select title, dir from Movie

Create Source S5 AS

select title, dir, year

from Movie

where year > 1960 AND genre=“Comedy”Sources are “materialized views” ofVirtual schema

Reformulation in LAV: The issues

Query: Find all the years in which Zhang Yimou released movies.

Select year from movie M where M.dir=yimou

Q(y) :- movie(T,D,Y,G),D=yimou

Not executable

Q(y) :- S1(T,D,Y,G) , D=yimou (1)

Q(y) :- S1(T,D,Y,G) , D=yimou Q(y) :- S5(T,D,Y) , D=yimou (2)

Which is the better plan? What are we looking for? --equivalence? --containment? --Maximal Containment --Smallest plan?

Page 53: An introduction to data integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 53

Maximal Containment

• Query plan should be sound and complete– Sound implies that Query should be contained in the

Plan (I.e., tuples satisfying the query are subset of the tuples satisfying the plan

– Completeness?– Traditional DBs aim for equivalence

• Query contains the plan; Plan contains the query– Impossible

• We want all query tuples that can be extracted given the sources we have

– Maximal containment (no other query plan, which “contains” this plan is available)

P contains Q if P |= Q

(exponential even for

“conjunctive”

(SPJ) query plans)

Page 54: An introduction to data integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 54

The world of Containment• Consider Q1(.) :- B1(.) Q2(.) :- B2(.)

• Q1 Q2 (“contained in”) if the answer to Q1 is a subset of Q2

– Basically holds if B1(x) |= B2(x)

• Given a query Q, and a query plan Q1,

– Q1 is a sound query plan if Q1 is contained in Q

– Q1 is a complete query plan if Q is contained in Q1

– Q1 is a maximally contained query plan if there is no Q2 which is a sound query plan for Q1, such that Q1 is contained in Q2

Page 55: An introduction to data integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 55

Computing Containment Checks• Consider Q1(.) :- B1(.) Q2(.) :- B2(.)

• Q1 Q2 (“contained in”) if the answer to Q1 is a subset of Q2

– Basically holds if B1(x) |= B2(x)– (but entailment is undecidable in general… boo hoo)

• Containment can be checked through containment mappings, if the queries are “conjunctive” (select/project/join queries, without constraints) [ONLY EXPONENTIAL!!!--aren’t you relieved?]

• m is a containment mapping from Vars(Q2) to Vars(Q1) if m maps every subgoal in the body of Q2 to a subgoal in the body of Q1

m maps the head of Q2 to the head of Q1

Eg: Q1(x,y) :- R(x), S(y), T(x,y) Q2(u,v) :- R(u), S(v) Containment mapping: [u/x ; v/y]

Page 56: An introduction to data integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 56

Reformulation Algorithms

• Bucket algorithm– Cartesian product of buckets – Followed by “containment” check

• Inverse Rules– plan fragments for mediator relations

V1 V2

Q(.) :- V1() & V2() S11() :- V1() S12 :- V1()S21() :- V2() S22 :- V2()S00() :- V1(), V2()

S11S12S00

S21S22S00

V1() :- S11()V1() :- S12()V1() :- S00()V2() :- S21()V2() :- S22()V2() :- S00()

Q(.) :- V1() & V2()

Bucket Algorithm Inverse Rules

[Levy] [Duschka]P1 contains P2 ifP2 |= P1

Page 57: An introduction to data integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 57

Movie(T,D,Y,G) :- S1(T,D,Y,G)Movie(T,D,?y,?g ) :- S3(T,D) Movie(T,D,Y,?g) :- S5(T,D,Y), y>1960

Movie(?t,?d,?y,G) :- S4(C,G)Schedule(C,?t,?ti) :- S4(C,G)

Movie(T,D,Y,G)

S1(T,D,Y,G)S3(T,D)S5(T,D,Y)S4(C,G)

Mediated schema: Movie(title, dir, year, genre), Schedule(cinema, title, time).

S1(title,dir,year,genre)

S3(title,dir)

S5(title,dir,year), year >1960

Create Source S1 AS

select * from Movie

Create Source S3 AS

select title, dir from Movie

Create Source S5 AS

select title, dir, year

from Movie

where year > 1960 AND genre=“Comedy”

Creat Source S4 AS

select cinema, genre

from Movie m, Schedule s

where m.title=s.title

S4(Cinema,Genre)

S4(C,G)

SkolemizeSfs5-1(T,D,Y)

Sched(C,T,Ti)

Q(C,G) :- Movie(T,D,Y,G) , Schedule(C,T,Ti)

SkolemizeSfs4-1(G,C)

Page 58: An introduction to data integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 62

Complexity of finding maximallycontained plans in LAV

• Complexity does change if the sources are not “conjunctive queries”– Sources as unions of conjunctive queries (NP-hard)

• Disjunctive descriptions– Sources as recursive queries (Undecidable)

• Comparison predicates• Complexity is less dependent on the query

– Recursion okay; but inequality constraints lead to NP-hardness• Complexity also changes based on Open vs. Closed world assumption

True source contents (of Big Two)

Advertised description “All cars”

[Abiteboul & Duschka, 98]

You can “reduce” the complexity by taking a conjunctive query that is an upperbound. This just pushes the complexity to minimization phase

Advertised description “Toyota” U “Olds”

Big Two

Page 59: An introduction to data integration

Kambhampati & Knoblock Information Integration on the Web (MA-1) 64

Local Completeness Information• If sources are incomplete, we need to look at each one of them.• Often, sources are locally complete.• Movie(title, director, year) complete for years after 1960, or for

American directors.• Question: given a set of local completeness statements, is a query Q’ a

complete answer to Q?

True source contentsAdvertised description

Guarantees (LCW; Inter-source comparisons)

Problems: 1. Sources may not be interested in giving these! Need to learn hard to learn!

2. Even if sources are willing to give, there may not be any “big enough” LCWs Saying “I definitely have the car with vehicle ID XXX is useless

Page 60: An introduction to data integration

Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/

Data Integration: Global-as-View in Answer Set ProgrammingS. CostantiniUniversità degli Studi di L’[email protected]

Page 61: An introduction to data integration

66

Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/

Computing Answers in Data Integration Systems» Answer set programming (ASP) gives a

semantics to datalog (logic) programs with negation (and possibly disjunction in the heads)

» ASP-based specification of a data integration system

» ASP-based specification of extensions (plausible answers, repairs)

Page 62: An introduction to data integration

67

Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/

Computing Answers in Data Integration Systems

Methodology works for first-order queries (and Datalog extensions), and many ICs

› LAV: Bravo and Bertossi, IJCAI’03› GAV: Costantini and Formisano,

ASP’03 (experiments and a working

program)

Page 63: An introduction to data integration

68

Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/

GaV in ASP» Instead of unfolding w.r.t. each query,

helper model so as to get a general inference engine

» Easily express integrity constraints» Cope with incomplete information (also

resulting form adding new sources)

Page 64: An introduction to data integration

69

Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/

GaV: an example

Page 65: An introduction to data integration

70

Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/

Front-end: Global Schema through a Helper Modelperson(X):- entity(1,X). % meta-concepts +

namingorganization(X):- entity(2,X).member(X,Y):- relat(1,X,Y).student(X):- entity(3,X).university(X):- entity(4,X).enrolled(X,Y):- relat(2,X,Y).age(X,Y):- attr(1,X,Y).

% Constraints on student and enrolledenrolledE(X):- student(X),university(Y),enrolled(X,Y).:- not student(X),enrolled(X,Y).:- student(X),not enrolledE(X).

Page 66: An introduction to data integration

71

Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/

Back-end: Mapping to the Source Schemasentity(1,X):- s(1,X).entity(2,X):- s(2,X).entity(4,X):- s(5,X).

% student is described implicitly as the domain of an unknown relationship

entity(3,X):- s(3,X,Y). % student is described implicitly as the

domain of one of the sources of enrolledentity(3,X):- s(4,X,Z).

Page 67: An introduction to data integration

72

Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/

Back-end: Mapping to the Source Schemas% Relationships

relat(1,X,Y):- s(7,X,Z),s(8,Z,Y).relat(2,X,Y):- s(4,X,Y).relat(2,X,Y):-s(9,X,Y).attr(1,X,Y):- s(3,X,Y).attr(1,X,Y):- s(6,X,Y,Z).

Page 68: An introduction to data integration

73

Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/

Helper Model: Relational Inference Engine» Meta-rules for is_a chaining:

relat(M,X,Y):- schema(N,E1,E2), entity(E1,X),

entity(E2,Y), is_a_r(N,M), relat(N,X,Y).

entity(M,X):- is_a_e(N,M), entity(N,X).

Page 69: An introduction to data integration

74

Dipartimento di InformaticaUniversità degli Studi di L’Aquilahttp://www.di.univaq.it/

Dealing with Incomplete Information» Assume to add a new source for Enrolled,

that in the back-end becomes, e.g.,:

relat(2,X,Y):- s(4,X,Y).

relat(2,X,Y):- s(9,X,Y).

» No connection to student: but, one who is enrolled to a University should be a student

Page 70: An introduction to data integration

MOMISMOMIS – www.dbgroup.unimo.it – www.dbgroup.unimo.it 7575

DBGroupDATABASE GROUP

MOMIS MOMIS (Mediator envirOnment for Multiple Information (Mediator envirOnment for Multiple Information

Sources)Sources)

The MOMIS Query ManagerThe MOMIS Query ManagerA query execution exampleA query execution example

http://www.dbgroup.unimo.it/Momishttp://www.dbgroup.unimo.it/Momis

Page 71: An introduction to data integration

MOMISMOMIS – www.dbgroup.unimo.it – www.dbgroup.unimo.it 7676

DBGroupDATABASE GROUP

The MOMIS Query Manager reformulates the query taking into account the mappings defined among the local classes and the global classes of the GVV (Global Virtual View).The mappings are defined by using a GAV (Global as View) approach: each global class of the GVV is expressed by means of the full-disjunction operator over the local classes.

Query ExecutionQuery Execution

Query rewritingQuery rewriting GAV approachGAV approach: : the query is processed by means of

unfolding

Fusion and ReconciliationFusion and Reconciliation of the local answers into the global answer Object IdentificationObject Identification : Join conditions among local

classes Inconsistencies:Inconsistencies: Resolution functions to deal with

conflits

Page 72: An introduction to data integration

MOMISMOMIS – www.dbgroup.unimo.it – www.dbgroup.unimo.it 7777

DBGroupDATABASE GROUP Query rewriting and execution Query rewriting and execution

Global Virtual View

(GVV)

query q0 = scqG1 scqG2

Hotels

Services

Hotels

Services

single class query

scqG1

single class queryscqG2

L3scqG1

L1scqG2

L2scqG1

L1scqG1

L2scqG2

VENERE

hotels

facilities

map_hotelsLocal

Schema

SAPERVIAGGIARE

hotelsLocal

Schema

GUIDACAMPEGGI

facilities

map_hotelshotels

facilities

hotels

facilitiesLocal

Schema

Query execution on the local sourcesQuery execution on the local sources

Page 73: An introduction to data integration

MOMISMOMIS – www.dbgroup.unimo.it – www.dbgroup.unimo.it 7878

DBGroupDATABASE GROUP Query rewriting Query rewriting

SELECT H.name, H.address, H.city, H.price, S.facility, S.structure_name, S.structure_cityFROM hotels as H, services as SWHERE H.city = S.structure_city and H.name = S.structure_nameand H.city = 'rimini‘ and H.price > 50 and H.price < 80 and S.facility = 'air conditioning'order by H.price

q0 =

SELECT H.name , H.address , H.city , H.price FROM hotels as H WHERE (H.city = 'rimini' ) and (H.price > 50) and (H.price < 80)

scqG1 =

SELECT S.facility , S.structure_name , S.structure_city FROM services as S WHERE (S.facility = 'air conditioning')

scqG2 =

SELECT hotels.name2, hotels.address, hotels.price, hotels.city FROM hotels WHERE ((city) = ('rimini') and ((price) > (50) and (price) < (80)))

L3scqG1 =

SELECT maps_hotels.hotels_name2, maps_hotels.hotels_city FROM maps_hotels WHERE (hotels_city) = ('rimini')

L2scqG1 =

SELECT hotels.name, hotels.address, hotels.city FROM hotels WHERE (city) = ('rimini')

L1scqG1 =

SELECT facilities_hotels.hotel_name2, facilities_hotels.hotels_city, facilities_hotels.facility FROM facilities_hotels WHERE (facility) = ('air conditioning')

L1scqG2 =

SELECT facilities_campings.campings_name, facilities_campings.campings_city, facilities_campings.name FROM facilities_campings WHERE (name) = ('air conditioning')

L2scqG2 =

SINGLE CLASS QUERIES

UNFOLDING

UNFOLDING

Page 74: An introduction to data integration

MOMISMOMIS – www.dbgroup.unimo.it – www.dbgroup.unimo.it 7979

DBGroupDATABASE GROUP Fusion and ReconciliationFusion and Reconciliation

L3scqG1

result set

L1scqG2

result set

L2scqG1

result set

L1scqG1result

set

L2scqG2

result set

VENERESAPERVIAGGIARE GUIDACAMPEGGI

partial

results

partial

results

partial

results

L1scqG1 full join full join L2scqG1 full joinfull join

L3scqG1

L1scqG2 full joinfull join L2scqG2

scqG1result

set

scqG2result

set

q0 result set

scqG1 joinjoin scqG2

Page 75: An introduction to data integration

MOMISMOMIS – www.dbgroup.unimo.it – www.dbgroup.unimo.it 8080

DBGroupDATABASE GROUP Fusion and ReconciliationFusion and Reconciliation

scqG2 result set= L1scqG2 full join L2scqG2 guidacampeggi.facilities full outer join venere.facilities on ((venere.facilities.facility) = (guidacampeggi.facilities.name) AND (venere.facilities.hotels_city) = (guidacampeggi.facilities.campings_city) AND (venere.facilities.hotel_name2) = (guidacampeggi.facilities.campings_name))

scqG1 result set = L1scqG1 full join L2scqG1 full join L3scqG1saperviaggiare.hotels full outer join venereEn.hotels on (

((venereEn.hotels.name2) = (saperviaggiare.hotels.name) AND (venereEn.hotels.city) = (saperviaggiare.hotels.city))) full outer join venereEn.maps_hotels on (((venereEn.maps_hotels.hotels_name2) = (saperviaggiare.hotels.name) AND (venereEn.maps_hotels.hotels_city) = (saperviaggiare.hotels.city)) OR ((venereEn.maps_hotels.hotels_name2) = (venereEn.hotels.name2) AND (venereEn.maps_hotels.hotels_city) = (venereEn.hotels.city)))

SELECT H.name , H.address , H.city , H.price , S.facility , S.structure_name , S.structure_city FROM hotels as H , facilities as S WHERE (H.city = S.structure_city ) AND (H.name = S.structure_name )ORDER BY H.price ASC

q0 = scqG1 result set join scqG2 result set

Page 76: An introduction to data integration

MOMISMOMIS – www.dbgroup.unimo.it – www.dbgroup.unimo.it 8181

DBGroupDATABASE GROUP Query ExecutionQuery Execution

The MOMIS Query Manager at work