close

Enter

Log in using OpenID

embedDownload
+
Advanced Database
Topics
(ΗΥ562)
Query Rewritting
(contd.)
Haridimos Kondylakis
kondylak@csd.uoc.gr
Spring 2014
http://www.csd.uoc.gr/~hy562/
+ Assignment 1

Due to March 13


23:59.00
What’s in there?

Query Containment

LAV & GAV mappings

Bucket & Minicon
2
+
Languages for Schema Mapping
Global
Schema
Q
Mediator Schema
Mediated
GAV
GLAV
LAV
Q’
Q’
Source
Local
Schema
Q’
Source
Local
Schema
Q’
Source
Local
Schema
Q’
Source
Local
Schema
Source
Local
Schema
+ GAV: Example revisited


Global Schema:

Movie(title, dir, year, genre)

Schedule(cinema, title, time)
Global
Schema
Integrating View

Create View Movie AS
SELECT title, dir, year, NULL
S2(title,dir, genre)
FROM S1
Union
S1(title,dir,year)
SELECT title, dir, NULL, genre
FROM S2

Query answering via unfolding
Select title,dir from Movie  select title, dir from
(SELECT title, dir, year, NULL FROM S1
Union
SELECT
4 title, dir, NULL, genre FROM S2)
+ Global-as-View Summary

Query reformulation boils
down to view unfolding

Very easy conceptually

Can build hierarchies of global
schemas

You sometimes loose
information


Mediator Relations
Source Relations
Not always natural
Adding sources is hard

Need to consider all other
sources that are available
5
+ LAV: Example revisited
Global
Schema

Global Schema:

Movie(title, dir, year, genre)

Schedule(cinema, title, time)

Source Views

Create View S1 AS
SELECT * FROM Movie

Create View S3 AS
S5(title,dir,year)
S1(title,dir,year,genre)
SELECT title, dir FROM Movie

Create View S5 AS
S3(title,dir)
SELECT title, dir, year FROM Movie
WHERE year > 1960 AND genre=‘Comedy’
6
+
Example revisited
V1(Std,Crs,Sem,Title) :- reg(Std,Crs,Sem), course(Crs,Title),
Crs ≥ 500, Sem ≥ Aut2010
V2(Std,Prof,Crs,Sem) :- reg(Std,Crs,Sem),teaches(Prof,Crs,Sem)
V3(Std,Crs) :- reg(Std,Crs,Sem), Sem ≤ Aut2005
V4(Prof,Crs,Title,Sem) :- reg(Std,Crs,Sem), course(Crs,Title),
teaches(Prof,Crs,Sem), Sem ≤ Aut2008
Query answering??
q(S,C,P) :- teaches(P,C,Q), reg(S,C,Q), course(C,T),
C ≥ 300, Q ≥ Aut2006
+ Query answering using views: a naïve
algorithm


Given:

conjunctive queries V1, . . . , Vk over schema σ

a query Q over σ
…if there is a rewriting of Q using V1, .
. . , Vk, then there is a rewriting with no
more conjuncts than in Q…
Algorithm:

guess a query Q′ over the views

Unfold Q′ in terms of the views

Check if the unfolding is contained in Q

If one unfolding is equivalent to Q, then Q′ is a rewriting

Otherwise take the union of all unfoldings contained in Q

it is a maximally contained rewriting
8
+
The State of Affairs

Bucket algorithm:
 deals well with predicates, Cartesian product can be large
(containment check required for every candidate
rewriting)

Inverse rules:
 modular (extensible to binding patterns, FD’s)
 no treatment of predicates
 resulting rewritings need significant further optimization
Neither scales up

The MiniCon algorithm:
 change perspective: look at query variables
+ The MiniCon Algorithm

A “much smarter” bucket algorithm:
 In many cases, we don’t need to perform the cross-product of all
items in all buckets
 Eliminates the need for the containment check

Basically, a modification to the bucket approach concentrating on
variables rather than subgoals (MiniCon Descriptions – MCD)
 “head homomorphism” – defines what variables must be equated
 variable-substituted version of the subgoals (mapping of variable
names)
 info about what’s covered
+ The MiniCon Algorithm

Property 1:
 If a variable occurs in the head of a query, then there must be a
corresponding variable in the head of the MCD view
 If a variable participates in a join predicate in the query, then it must
be in the head of the MCD view

For each subgoal of the query
 For each subgoal of each view
 Choose the least restrictive head homomorphism to match the
subgoal of the query
 If we can find a way of mapping the variables, then add MCD for
each possible “maximal” extension of the mapping that satisfies
Property 1
+
MiniCon Example
distinguished
variable
existential variable
Query:
q(X) :- cites(X,Y), cites(Y,X), sameTopic(X,Y)
Views:
V4(A) :- cites(A,B), cites(B,A)
V5(C,D) :- sameTopic(C,D)
V6(F,H) :- cites(F,G), cites(G,H), sameTopic(F,G)

Form buckets (aka MCDs) more intelligently:
Ask what is the minimal set of query subgoals that must be covered (via
mappings) by each view
First, look at join conditions in q
Then, follow joins on existential variables in views
MiniCon Example
q(X) :- cites(X,Y), cites(Y,X), sameTopic(X,Y)
Consider V4(A) :- cites(A,B), cites(B,A)

Can cover query subgoal cites(X,Y)

To do this, we map: XA, YB

Y is an existential join variable

So V4 needs to cover the query subgoals that contain Y, which are:
cites(Y,X) and sameTopic(X,Y)

V4 can cover the cites subgoal (XA, YB), but not the sameTopic

Hence, no MCD is created for V4
+
MiniCon Example
q(X) :- cites(X,Y), cites(Y,X), sameTopic(X,Y)
Consider V5(C,D) :- sameTopic(C,D)
• Can cover sameTopic(X,Y) (and nothing else)
• To do this, we map: XC, YD
• The MCD says how the query subgoals may be covered by V5
MiniCon Descriptions (MCDs)
View
V5
Mappings
XC, YD
Query Subgoals Covered
3
MiniCon Example
q(X) :- cites(X,Y), cites(Y,X), sameTopic(X,Y)
Consider V6(F,H) :- cites(F,G), cites(G,H), sameTopic(F,G)

Can cover query subgoal sameTopic(X,Y)

To do this, we map: XF, YG

Y is an existential join variable

So V6 needs to cover the query subgoals that contain Y, which are:
cites(X,Y) and cites(Y,X)

To do this, we map: XF, YG and XH
+
MiniCon Example

We get the following MCD for
V6(F,H) :- cites(F,G), cites(G,H), sameTopic(F,G)
MiniCon Descriptions (MCDs)
View
V5
Mappings
XC, YD
Query Subgoals Covered
3
V6
XF, YG, XH 1,2,3
Problem: Mappings do not define a function (homomorphism)

X is mapped to both F and H

Can we fix this?


Yes, if both F and H are distinguished in V6

No, otherwise
We fix the problem by making F and H equal when producing
the rewriting:

Query rewriting: q’(F) :- V6(F,F)

Since V6 covers all query subgoals, no other view is needed !!
+


MiniCon Algorithm
When forming MCDs:

Join obligations in query via existential variables cannot be fulfilled by
another view

Unless those variables are mapped to the view’s distinguished variables!

Current view should fulfill them all
When combining MCDs:


Combined MCDs must cover pairwise disjoint sets of query subgoals

Greatly reduces the number of candidates!

Also proven to be correct without the use of a containment check!
For every possible way of covering all query subgoals, “chain” the
corresponding MCDs of views in the covering, and form a rewriting

All query subgoals must be covered

The union of all such rewritings is the rewriting of the query w.r.t. the
given views
+
MiniCon Second Example
Query:
q(X) :- cites(X,Y), cites(Z,X), inSIGMOD(X)
Views:
V7(A) :- cites(A,B), inSIGMOD(A)
V8(C) :- cites(D,C), inSIGMOD(C)
MiniCon Descriptions (MCDs)
Step 1:
View
V7
Mappings
XA, YB
Query Subgoals Covered
1
V7
XA
3
V8
ZD, XC
2
V8
XC
3
MiniCon Second Example
MiniCon Descriptions (MCDs)
View
V7
Mappings
XA, YB
Query Subgoals Covered
1
V7
XA
3
V8
ZD, XC
2
V8
XC
3
Step 2:
Query rewriting 1: q1(X) :- V7(X), V8(X), V7(X)
Query rewriting 2: q2(X) :- V7(X), V8(X), V8(X)
Final rewriting: q’(X) :- V7(X), V8(X)
+ MiniCon and LAV Summary

Variations need to be made for:
 Constants in general (I sneaked those in)
 “Semi-interval” predicates (x <= c)

Note that full-blown inequality predicates are co-NP-hard in the size
of the data, so they don’t work

State-of-the-art algorithm for answering queries using views (AQUV) in the
relational world of data integration
 It’s been extended to support “conjunctive XQuery” as well

Scales to large numbers of views, which we need in LAV data integration
 Empirically shown to scale to 1000s of views, i.e., sources

A similar approach: Chase & Backchase by Tannen et al.
 Slightly more general in some ways – but:


Produces equivalent rewritings, not maximally contained ones
Not always polynomial in the size of the data
+ MiniCon Performance, Many Rewritings
Chain queries with 5 subgoals and all variables
distinguished
MiniCon
Inverse
Bucket
Time (sec)
10
8
6
4
2
0
1
2
3
4
5
6
7
Number of Views
8
9
10
11
+ Larger Query, Fewer Rewritings
Chain queries; 2 variables distinguished,
query of length 12, views of lengths 2, 3, and 4
Time (sec)
2
Minicon
Inverse
1.5
1
0.5
0
0
50
100
Number of Views
150
+
View-Based Query Answering

Maximally-contained rewritings are parameterized by query language.

More general question:


Given a set of view definitions, view instances and a query, what are all
the answers we can find?
We introduce certain answers as a mechanism for providing a formal
answer.
+
View Instances = Possible DB’s
Consider the two views:
V8 (dir ) : -Movie(ID, dir, actor)
V9 (actor) : -Movie(ID, dir, actor)
And suppose the extensions of the views
are:
V8: {Allen, Copolla}
V9: {Keaton, Pacino}
+
Possible Databases
There are multiple databases that satisfy the
above view definitions: (we ignore the first
argument of Movie below)
DB1. {(Allen, Keaton), (Coppola, Pacino)}
DB2. {(Allen, Pacino), (Coppola, Keaton)}
If we ask whether Allen directed a movie in
which Keaton acted, we can’t be sure.
Certain answers are those true in all databases that are
consistent with the views and their extensions.
+Certain Answers: Formal Definition
[Open-world Assumption]

Given:
 Views: V1,…,Vn
 View extensions v1,…vn
 A query Q

A tuple t is a certain answer to Q under the open-world assumption if t
 Q(D) for all databases D such that:
 Vi(D)  vi for all i.
+Certain Answers
[Closed-world Assumption]

Given:
 Views: V1,…,Vn
 View extensions v1,…vn
 A query Q

A tuple t is a certain answer to Q under the open-world assumption if t
 Q(D) for all databases D such that:
 Vi(D) = vi for all i.
+
Certain Answers: Example
V8 (dir) : -Director(ID,dir)
V9 (actor) : -Actor(ID,actor)
V8: {Allen}
V9: {Keaton}
Q(dir,actor) : -Director(ID,dir), Actor(ID,actor)
Under closed-world assumption:
single DB possible  (Allen, Keaton)
Under open-world assumption:
no certain answers.
+
The Good News

The MiniCon algorithm produce all certain answers

Assuming no interpreted predicates in the query (ok to have them in
the views)

Under open-world assumption

Corollary: they produce a maximally-contained rewriting
+
Summary
+ Paradigms of Data Integration
Integration
global defined
from local
CWA
global-schema-as-view
Database Schema Integration
global “independent”
of local
OWA
global-as-viewof-local
local-as-viewof-global
Data Warehousing
Mediation
+
Ontology Based Data Integration
+
Knowledge Representation


Knowledge representation (KR) focuses on more expressive languages
that database schemata and integrity constraints:

Designed for artificial intelligence applications (e.g., natural language
understanding, planning) where complex relationships exist between
objects.

KR uses ontologies to represent relationships between elements in a
knowledge base.
KR is relevant to data integration because relationships between data
sources can be complex.

The use of KR in data integration was investigated since the early days
of the field.
+KR in Data Integration: Example
Mediated
schema:
ontology with
classes and
relationships
Data sources: S1 has comedies and S2 documentaries
S3: movies with at least two awards
S4: comedies with at least one Oscar
+
Example: Part 1
Q1 ( x) : Movie( x)
S1 is relevant to Q1 because Comedy is a
subclass of Movie (by subsumption)
+
Example: Part 2
Q2 (x) : -Comedy(x)
S2 is irrelevant to Q2 because Comedy and
Documentary are declared disjoint.
+
Example: Part 3(a)
Q3 (x) : -Comedy(x), Award(x, y)
S3 is relevant to Q3 because movies with two awards
will definitely satisfy the second subgoal.
+
Example: Part 3(b)
Q3 (x) : -Comedy(x), Award(x, y)
S4 is relevant to Q3 because oscar is a subproperty of award.
+
Description Logics: Introduction

Description Logics are a subset of first-order logic:



Only unary predicates (called concepts) and binary predicates (called
roles, properties).
Knowledge bases are composed of:

T-box: defining the concepts and the roles

A-box: including ground facts about individuals
Complex concepts are defined by concept descriptions:

The expressive power of the language is determined by the set of
constructors in the grammar of concept descriptions

Complex roles can also be defined via constructors
+
T-Boxes

Can include statements of the form:
A : -C
A Í C (it should really be a square inclusion)

A is a base concept and C can be a concept description.

Example grammar for concept descriptions – see next slide.
+
An example Grammar for Concept
Descriptions.
C,D are complex concepts. A is a base concept.
Many other constructors possible: union, existential
quantification, equality on role paths,…
+
Example Terminology
a1: Italians are people
a2: Comedies are movies
a3: Comedies are disjoint from documentaries
a4: Movies have at most one director
a5: Award movies are those that have at least one award
a6: Italian hits are award movies whose director is Italian
+
Abox: the Ground Facts

A set of assertions of the form C(a), or R(a,b)


b is called an R-filler of a.
C and R can be concept descriptions

Akin to asserting that a tuple is in a view rather than in base relations

Below, we state that LaStrada is an Italian hit, we’re not given the
director or the award it won.
+
Semantics of Description Logics
 Semantics
are based on interpretations.
a knowledge base Δ, the models of Δ are the
interpretations that are consistent Δ’s T-box and A-box.
 Given
fact that is true in all models of Δ are said to be
entailed by Δ.
 Any
+
Interpretations: Formally

An interpretation I contains a non-empty domain of objects, OI .

I assigns an object aI in to every constant a in the A-box.
 We make the unique names assumption: a≠b implies that aI≠bI

I assigns CI , a subset of OI, to every concept C

I assigns a binary relation RI, a subset of OI x OI to every role R.
+
Extensions of Complex Expressions
The extensions of concept and role descriptions
are given by the following equations. (#S
denotes the cardinality of the set S).
+Conditions on Models

An interpretation of Δ is a model if the following conditions hold:
A Í C for every inclusion A Í C
I
I
AI = CI for every statement A := C
aI Í CI for every statement C(a)
(a , b ) Í R for every statement R(a, b)
I
I
I
+
Example Interpretation
Assume an interpretation with the identity
mapping on individuals in the knowledge base
and a few extra elements (Director1, Award1,
Actor2, …).
The following interpretation is a model:
+
Example Interpretation

Notes:

We do not know the director of LaStrada or its award.

Removing LifeIsBeautiful from Comedy would make it a non-model.

Adding another director would also make it a non-model.
+
Inference in Description Logics

This is where all the action is: coming up with efficient algorithms for
inference and understanding the complexity of the problems.

Subsumption (only for the T-box):

A concept C is said to be subsumed by concept D w.r.t. a T-box T, if in
every model, I, of T,

Examples:
CI Í D I
MovieÙ(³2award) is subsumed by AwardMovies
MovieÙ("director.Person) is subsumed by ItalianHits
+
Query answering with DLs

The simple case – instance checking:


Does Δ entail C(a) or R(a,b)? i.e., does C(a)/R(a,b) hold in every model
of Δ?
The more general problem is query answering. Find the answers to a
conjunctive query
Q(X) : -g1 (X1 ),..., gn (Xn )

where g1,…, gn can be concept and/or role names.
+
Semantics of Conjunctive Queries
Q(X) : -g1 (X1 ),..., gn (Xn )
• Compute the answer to Q in every model of Δ
• Any tuple that is in the intersection of the
answers is entailed by Δ.
• This should remind you of the semantics of
certain answers!
• Let’s look at a few examples.
+
Query Answering: Example 1

Consider the Q1 over the following A-box

Applying Q1 directly to the A-box would yield no answers (award would
not be matched)

However, ItalianHits(LifeIsBeautiful) implies that LifeIsBeautiful won at
least one award.

Hence, LifeIsBeautiful should be in the answer!
+
Query Answering: Example 2

Consider Q2:

With the following A-box

{Comedy(LaFunivia), director(LaFunivia,Antonioni), Italian(Antonioni)}

Neither conjunctive query will yield an answer because we know nothing
about awards.

However, we can reason by cases that the following is entailed by Q2.
+
End of Example 2

Ok, we have

But that’s not enough to infer that LaFunivia should be in the answer.

However, we also know that movies have at most one director, so:
Italian(Antonioni)Ù director(LaFunivia, Antonioni) Þ
"director.Italian(LaFunivia)

Hence, LaFunivia is an answer to Q2.
+
Comparing DLs to OODB


Object-oriented databases:

Also focus on unary and binary relations

OODB’s are more focused on modeling the physical aspects of objects
and their properties

An object can only belong to a single (most specific) class.
Description logics are about knowledge and complex relationships:

Class membership can be inferred

An individual can belong to multiple classes.
+
Comparing DLs to Relational Views


In principle, concept descriptions are view definitions

Relational views employ: selection, projection, join, union and apply to
more than unary and binary relations.

DLs: universal quantification, number restrictions, intersection, …
Subsumption = query containment


Universal quantification and number restrictions would require
negation in conjunctive queries. Hence containment would be
undecidable
In DL’s you can put facts directly in views (i.e., complex concept).
+
More Issues To Tackle
+ More Issues to Tackle

Performance issues
 the size of the problem, remember?
 network behavior (delay, bursting data)

Source Discovery

Schema Discovery

Approximate answers
 a less precise answer is often better than no
(precise) answer at all

Choosing the right sources wrt to user
demands
 not everybody has the same evaluation
criterion for an answer especially when
multimedia are involved, e.g. (e.g. most
“vivid”, with highest resolution, fastest, most
reliable, etc. )
Readings

Chapter 9 in “Web Data Management” Book

Chapter 12 in “Principles of Data Integration”

For formal foundations and query rewriting algorithms read:
 Alon Halevy
 Answering Queries Using Views: A Survey, VLDB Journal, 2001
 MiniCon: A scalable algorithm for answering queries using
views, VLDB Journal, 2001
 Data Integration: a Status Report Invited Talk German Database
Conference (BTW), 2003
 Maurizio Lenzerini
 Data Integration: A Theoretical Perspective Invited Talk PODS
2002
+ Acknowledgements

Diego Calvanese
 Query Processing in Data Integration Lecture Slides
www.inf.unibz.it/~calvanese/teaching/05-06-data-integration/

Jeffrey D. Ullman
 Lecture Slides www-db.stanford.edu/~ullman/cs345-notes.html

Laks VS Lakshmanan
 Lecture Slides http://www.cs.ubc.ca/~laks/534b/notes.html

Zachary G. Ives
 Lecture Slides www.seas.upenn.edu/~zives/05s/cis650/

Michalis Petropoulos
 Lecture Slides http://www.cse.buffalo.edu/~mpetropo/CSE636-FA08/
+
Advanced Database
Topics
(ΗΥ562)
Data Integration
Questions?
Spring 2014
http://www.csd.uoc.gr/~hy562/
“The scientific mind does not so much provide the
right answers as ask the right questions.”
-strauss
+ Complexity classes: what you always wanted
to know but never dared to ask!

The big divide: PTIME (computable in polynomial time, i.e.O(nk) for
some fixed k)

Inside PTIME: tractable queries (although high-degree polynomial
are intractable)

Outside PTIME: intractable queries (efficient algorithms are
unlikely)

Way outside PTIME: provably intractable queries (efficient
algorithms do not exist)

EXPTIME: cn-algorithms for a constant c. Could still be ok for not
very large inputs

Even further – 2-EXPTIME: cc . Cannot be ok even for small inputs
(compare 2 10 and 2 2 10 ).
n
63
+ Inside PTIME
AC0 ⊆ TC0 ⊆ NC1 ⊆ DLOG ⊆ NLOG ⊆ PTIME

AC0: very efficient parallel algorithms (constant time, simple
circuits)


TC0: very efficient parallel algorithms in a more powerful
computational model with counting gates


regular languages
DLOG: very little – O(log n) – space is required


basic SQL (relational calculus/grouping/aggregation)
NC1: efficient parallel algorithms


relational calculus
SQL + (restricted) transitive closure
NLOG: O(log n) space is required if nondeterminism is allowed

SQL + transitive closure (as in the SQL3 standard)
64
+ Beyond PTIME

PTIME: can solve a problem in polynomial time

NP: can check a given candidate solution in polynomial time


coNP: complement of NP – verify that all “reasonable” candidates
are solutions to a given problem.


another way of looking at it: guess a solution, and then verify if
you guessed it right in polynomial time
Appears to be harder than NP but the precise relationship isn’t
known
PSPACE: can be solved using memory of polynomial size (but
perhaps an exponential-time algorithm)
65
+ Complete problems

These are the hardest problems in a class.

If our problem is as hard as a complete problem, it is very unlikely it
can be done with lower complexity.

For NP:


SAT (satisfiability of Boolean formulae)

many graph problems (e.g. 3-colourability)

Integer linear programming etc
For PSPACE:

Quantified SAT

Two XML DTDs are equivalent
66
+ Complexity of Relational Calculus



We consider the complexity of the recognition problem, i.e., checking
whether a tuple of constants is in the answer to a query:

measured wrt the size of the database; data complexity

measured wrt the size of the query and the database; combined complexity
Complexity of relational calculus

data complexity: polynomial, actually in LogSpace

combined complexity: PSpace-complete
Complexity of conjunctive queries

data complexity: in LogSpace

combined complexity: NP-complete
Complexity of Query Containment


Conjunctive Queries (CQ) (NPComplete)
 Q1: p(X,Z):- a(X,Y) & a(Y,Z)
 Q2: p(X,Z):- a(X,Y) & a(V,Z)
CQ’s With Negation ( 2-Complete)
 Q1: p(X,Z) :- a(X,Y) & a(Y,Z) &
NOT a(X,Z)
p

CQ’s
With Arihmetic Comparison (
p
-Complete)
2
 Q1: p(X,Z) :- a(X,Y) & a(Y,Z) &
X<Y

Datalog Programs
 p(A,C) :- a(A,B) & b(B,C)

p
k
 k  
p
p
k 1
 k  k
p
p
+ Complexity of View-based Query Answering
User Query
Source Descriptions
Serge Abiteboul, Oliver Duschka 2010
1/--pages
Report inappropriate content