+ Advanced Database Topics (ΗΥ562) Query Rewritting (contd.) Haridimos Kondylakis kondylak@csd.uoc.gr Spring 2014 http://www.csd.uoc.gr/~hy562/ + Assignment 1 Due to March 13 23:59.00 What’s in there? Query Containment LAV & GAV mappings Bucket & Minicon 2 + Languages for Schema Mapping Global Schema Q Mediator Schema Mediated GAV GLAV LAV Q’ Q’ Source Local Schema Q’ Source Local Schema Q’ Source Local Schema Q’ Source Local Schema Source Local Schema + GAV: Example revisited Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Global Schema Integrating View Create View Movie AS SELECT title, dir, year, NULL S2(title,dir, genre) FROM S1 Union S1(title,dir,year) SELECT title, dir, NULL, genre FROM S2 Query answering via unfolding Select title,dir from Movie select title, dir from (SELECT title, dir, year, NULL FROM S1 Union SELECT 4 title, dir, NULL, genre FROM S2) + Global-as-View Summary Query reformulation boils down to view unfolding Very easy conceptually Can build hierarchies of global schemas You sometimes loose information Mediator Relations Source Relations Not always natural Adding sources is hard Need to consider all other sources that are available 5 + LAV: Example revisited Global Schema Global Schema: Movie(title, dir, year, genre) Schedule(cinema, title, time) Source Views Create View S1 AS SELECT * FROM Movie Create View S3 AS S5(title,dir,year) S1(title,dir,year,genre) SELECT title, dir FROM Movie Create View S5 AS S3(title,dir) SELECT title, dir, year FROM Movie WHERE year > 1960 AND genre=‘Comedy’ 6 + Example revisited V1(Std,Crs,Sem,Title) :- reg(Std,Crs,Sem), course(Crs,Title), Crs ≥ 500, Sem ≥ Aut2010 V2(Std,Prof,Crs,Sem) :- reg(Std,Crs,Sem),teaches(Prof,Crs,Sem) V3(Std,Crs) :- reg(Std,Crs,Sem), Sem ≤ Aut2005 V4(Prof,Crs,Title,Sem) :- reg(Std,Crs,Sem), course(Crs,Title), teaches(Prof,Crs,Sem), Sem ≤ Aut2008 Query answering?? q(S,C,P) :- teaches(P,C,Q), reg(S,C,Q), course(C,T), C ≥ 300, Q ≥ Aut2006 + Query answering using views: a naïve algorithm Given: conjunctive queries V1, . . . , Vk over schema σ a query Q over σ …if there is a rewriting of Q using V1, . . . , Vk, then there is a rewriting with no more conjuncts than in Q… Algorithm: guess a query Q′ over the views Unfold Q′ in terms of the views Check if the unfolding is contained in Q If one unfolding is equivalent to Q, then Q′ is a rewriting Otherwise take the union of all unfoldings contained in Q it is a maximally contained rewriting 8 + The State of Affairs Bucket algorithm: deals well with predicates, Cartesian product can be large (containment check required for every candidate rewriting) Inverse rules: modular (extensible to binding patterns, FD’s) no treatment of predicates resulting rewritings need significant further optimization Neither scales up The MiniCon algorithm: change perspective: look at query variables + The MiniCon Algorithm A “much smarter” bucket algorithm: In many cases, we don’t need to perform the cross-product of all items in all buckets Eliminates the need for the containment check Basically, a modification to the bucket approach concentrating on variables rather than subgoals (MiniCon Descriptions – MCD) “head homomorphism” – defines what variables must be equated variable-substituted version of the subgoals (mapping of variable names) info about what’s covered + The MiniCon Algorithm Property 1: If a variable occurs in the head of a query, then there must be a corresponding variable in the head of the MCD view If a variable participates in a join predicate in the query, then it must be in the head of the MCD view For each subgoal of the query For each subgoal of each view Choose the least restrictive head homomorphism to match the subgoal of the query If we can find a way of mapping the variables, then add MCD for each possible “maximal” extension of the mapping that satisfies Property 1 + MiniCon Example distinguished variable existential variable Query: q(X) :- cites(X,Y), cites(Y,X), sameTopic(X,Y) Views: V4(A) :- cites(A,B), cites(B,A) V5(C,D) :- sameTopic(C,D) V6(F,H) :- cites(F,G), cites(G,H), sameTopic(F,G) Form buckets (aka MCDs) more intelligently: Ask what is the minimal set of query subgoals that must be covered (via mappings) by each view First, look at join conditions in q Then, follow joins on existential variables in views MiniCon Example q(X) :- cites(X,Y), cites(Y,X), sameTopic(X,Y) Consider V4(A) :- cites(A,B), cites(B,A) Can cover query subgoal cites(X,Y) To do this, we map: XA, YB Y is an existential join variable So V4 needs to cover the query subgoals that contain Y, which are: cites(Y,X) and sameTopic(X,Y) V4 can cover the cites subgoal (XA, YB), but not the sameTopic Hence, no MCD is created for V4 + MiniCon Example q(X) :- cites(X,Y), cites(Y,X), sameTopic(X,Y) Consider V5(C,D) :- sameTopic(C,D) • Can cover sameTopic(X,Y) (and nothing else) • To do this, we map: XC, YD • The MCD says how the query subgoals may be covered by V5 MiniCon Descriptions (MCDs) View V5 Mappings XC, YD Query Subgoals Covered 3 MiniCon Example q(X) :- cites(X,Y), cites(Y,X), sameTopic(X,Y) Consider V6(F,H) :- cites(F,G), cites(G,H), sameTopic(F,G) Can cover query subgoal sameTopic(X,Y) To do this, we map: XF, YG Y is an existential join variable So V6 needs to cover the query subgoals that contain Y, which are: cites(X,Y) and cites(Y,X) To do this, we map: XF, YG and XH + MiniCon Example We get the following MCD for V6(F,H) :- cites(F,G), cites(G,H), sameTopic(F,G) MiniCon Descriptions (MCDs) View V5 Mappings XC, YD Query Subgoals Covered 3 V6 XF, YG, XH 1,2,3 Problem: Mappings do not define a function (homomorphism) X is mapped to both F and H Can we fix this? Yes, if both F and H are distinguished in V6 No, otherwise We fix the problem by making F and H equal when producing the rewriting: Query rewriting: q’(F) :- V6(F,F) Since V6 covers all query subgoals, no other view is needed !! + MiniCon Algorithm When forming MCDs: Join obligations in query via existential variables cannot be fulfilled by another view Unless those variables are mapped to the view’s distinguished variables! Current view should fulfill them all When combining MCDs: Combined MCDs must cover pairwise disjoint sets of query subgoals Greatly reduces the number of candidates! Also proven to be correct without the use of a containment check! For every possible way of covering all query subgoals, “chain” the corresponding MCDs of views in the covering, and form a rewriting All query subgoals must be covered The union of all such rewritings is the rewriting of the query w.r.t. the given views + MiniCon Second Example Query: q(X) :- cites(X,Y), cites(Z,X), inSIGMOD(X) Views: V7(A) :- cites(A,B), inSIGMOD(A) V8(C) :- cites(D,C), inSIGMOD(C) MiniCon Descriptions (MCDs) Step 1: View V7 Mappings XA, YB Query Subgoals Covered 1 V7 XA 3 V8 ZD, XC 2 V8 XC 3 MiniCon Second Example MiniCon Descriptions (MCDs) View V7 Mappings XA, YB Query Subgoals Covered 1 V7 XA 3 V8 ZD, XC 2 V8 XC 3 Step 2: Query rewriting 1: q1(X) :- V7(X), V8(X), V7(X) Query rewriting 2: q2(X) :- V7(X), V8(X), V8(X) Final rewriting: q’(X) :- V7(X), V8(X) + MiniCon and LAV Summary Variations need to be made for: Constants in general (I sneaked those in) “Semi-interval” predicates (x <= c) Note that full-blown inequality predicates are co-NP-hard in the size of the data, so they don’t work State-of-the-art algorithm for answering queries using views (AQUV) in the relational world of data integration It’s been extended to support “conjunctive XQuery” as well Scales to large numbers of views, which we need in LAV data integration Empirically shown to scale to 1000s of views, i.e., sources A similar approach: Chase & Backchase by Tannen et al. Slightly more general in some ways – but: Produces equivalent rewritings, not maximally contained ones Not always polynomial in the size of the data + MiniCon Performance, Many Rewritings Chain queries with 5 subgoals and all variables distinguished MiniCon Inverse Bucket Time (sec) 10 8 6 4 2 0 1 2 3 4 5 6 7 Number of Views 8 9 10 11 + Larger Query, Fewer Rewritings Chain queries; 2 variables distinguished, query of length 12, views of lengths 2, 3, and 4 Time (sec) 2 Minicon Inverse 1.5 1 0.5 0 0 50 100 Number of Views 150 + View-Based Query Answering Maximally-contained rewritings are parameterized by query language. More general question: Given a set of view definitions, view instances and a query, what are all the answers we can find? We introduce certain answers as a mechanism for providing a formal answer. + View Instances = Possible DB’s Consider the two views: V8 (dir ) : -Movie(ID, dir, actor) V9 (actor) : -Movie(ID, dir, actor) And suppose the extensions of the views are: V8: {Allen, Copolla} V9: {Keaton, Pacino} + Possible Databases There are multiple databases that satisfy the above view definitions: (we ignore the first argument of Movie below) DB1. {(Allen, Keaton), (Coppola, Pacino)} DB2. {(Allen, Pacino), (Coppola, Keaton)} If we ask whether Allen directed a movie in which Keaton acted, we can’t be sure. Certain answers are those true in all databases that are consistent with the views and their extensions. +Certain Answers: Formal Definition [Open-world Assumption] Given: Views: V1,…,Vn View extensions v1,…vn A query Q A tuple t is a certain answer to Q under the open-world assumption if t Q(D) for all databases D such that: Vi(D) vi for all i. +Certain Answers [Closed-world Assumption] Given: Views: V1,…,Vn View extensions v1,…vn A query Q A tuple t is a certain answer to Q under the open-world assumption if t Q(D) for all databases D such that: Vi(D) = vi for all i. + Certain Answers: Example V8 (dir) : -Director(ID,dir) V9 (actor) : -Actor(ID,actor) V8: {Allen} V9: {Keaton} Q(dir,actor) : -Director(ID,dir), Actor(ID,actor) Under closed-world assumption: single DB possible (Allen, Keaton) Under open-world assumption: no certain answers. + The Good News The MiniCon algorithm produce all certain answers Assuming no interpreted predicates in the query (ok to have them in the views) Under open-world assumption Corollary: they produce a maximally-contained rewriting + Summary + Paradigms of Data Integration Integration global defined from local CWA global-schema-as-view Database Schema Integration global “independent” of local OWA global-as-viewof-local local-as-viewof-global Data Warehousing Mediation + Ontology Based Data Integration + Knowledge Representation Knowledge representation (KR) focuses on more expressive languages that database schemata and integrity constraints: Designed for artificial intelligence applications (e.g., natural language understanding, planning) where complex relationships exist between objects. KR uses ontologies to represent relationships between elements in a knowledge base. KR is relevant to data integration because relationships between data sources can be complex. The use of KR in data integration was investigated since the early days of the field. +KR in Data Integration: Example Mediated schema: ontology with classes and relationships Data sources: S1 has comedies and S2 documentaries S3: movies with at least two awards S4: comedies with at least one Oscar + Example: Part 1 Q1 ( x) : Movie( x) S1 is relevant to Q1 because Comedy is a subclass of Movie (by subsumption) + Example: Part 2 Q2 (x) : -Comedy(x) S2 is irrelevant to Q2 because Comedy and Documentary are declared disjoint. + Example: Part 3(a) Q3 (x) : -Comedy(x), Award(x, y) S3 is relevant to Q3 because movies with two awards will definitely satisfy the second subgoal. + Example: Part 3(b) Q3 (x) : -Comedy(x), Award(x, y) S4 is relevant to Q3 because oscar is a subproperty of award. + Description Logics: Introduction Description Logics are a subset of first-order logic: Only unary predicates (called concepts) and binary predicates (called roles, properties). Knowledge bases are composed of: T-box: defining the concepts and the roles A-box: including ground facts about individuals Complex concepts are defined by concept descriptions: The expressive power of the language is determined by the set of constructors in the grammar of concept descriptions Complex roles can also be defined via constructors + T-Boxes Can include statements of the form: A : -C A Í C (it should really be a square inclusion) A is a base concept and C can be a concept description. Example grammar for concept descriptions – see next slide. + An example Grammar for Concept Descriptions. C,D are complex concepts. A is a base concept. Many other constructors possible: union, existential quantification, equality on role paths,… + Example Terminology a1: Italians are people a2: Comedies are movies a3: Comedies are disjoint from documentaries a4: Movies have at most one director a5: Award movies are those that have at least one award a6: Italian hits are award movies whose director is Italian + Abox: the Ground Facts A set of assertions of the form C(a), or R(a,b) b is called an R-filler of a. C and R can be concept descriptions Akin to asserting that a tuple is in a view rather than in base relations Below, we state that LaStrada is an Italian hit, we’re not given the director or the award it won. + Semantics of Description Logics Semantics are based on interpretations. a knowledge base Δ, the models of Δ are the interpretations that are consistent Δ’s T-box and A-box. Given fact that is true in all models of Δ are said to be entailed by Δ. Any + Interpretations: Formally An interpretation I contains a non-empty domain of objects, OI . I assigns an object aI in to every constant a in the A-box. We make the unique names assumption: a≠b implies that aI≠bI I assigns CI , a subset of OI, to every concept C I assigns a binary relation RI, a subset of OI x OI to every role R. + Extensions of Complex Expressions The extensions of concept and role descriptions are given by the following equations. (#S denotes the cardinality of the set S). +Conditions on Models An interpretation of Δ is a model if the following conditions hold: A Í C for every inclusion A Í C I I AI = CI for every statement A := C aI Í CI for every statement C(a) (a , b ) Í R for every statement R(a, b) I I I + Example Interpretation Assume an interpretation with the identity mapping on individuals in the knowledge base and a few extra elements (Director1, Award1, Actor2, …). The following interpretation is a model: + Example Interpretation Notes: We do not know the director of LaStrada or its award. Removing LifeIsBeautiful from Comedy would make it a non-model. Adding another director would also make it a non-model. + Inference in Description Logics This is where all the action is: coming up with efficient algorithms for inference and understanding the complexity of the problems. Subsumption (only for the T-box): A concept C is said to be subsumed by concept D w.r.t. a T-box T, if in every model, I, of T, Examples: CI Í D I MovieÙ(³2award) is subsumed by AwardMovies MovieÙ("director.Person) is subsumed by ItalianHits + Query answering with DLs The simple case – instance checking: Does Δ entail C(a) or R(a,b)? i.e., does C(a)/R(a,b) hold in every model of Δ? The more general problem is query answering. Find the answers to a conjunctive query Q(X) : -g1 (X1 ),..., gn (Xn ) where g1,…, gn can be concept and/or role names. + Semantics of Conjunctive Queries Q(X) : -g1 (X1 ),..., gn (Xn ) • Compute the answer to Q in every model of Δ • Any tuple that is in the intersection of the answers is entailed by Δ. • This should remind you of the semantics of certain answers! • Let’s look at a few examples. + Query Answering: Example 1 Consider the Q1 over the following A-box Applying Q1 directly to the A-box would yield no answers (award would not be matched) However, ItalianHits(LifeIsBeautiful) implies that LifeIsBeautiful won at least one award. Hence, LifeIsBeautiful should be in the answer! + Query Answering: Example 2 Consider Q2: With the following A-box {Comedy(LaFunivia), director(LaFunivia,Antonioni), Italian(Antonioni)} Neither conjunctive query will yield an answer because we know nothing about awards. However, we can reason by cases that the following is entailed by Q2. + End of Example 2 Ok, we have But that’s not enough to infer that LaFunivia should be in the answer. However, we also know that movies have at most one director, so: Italian(Antonioni)Ù director(LaFunivia, Antonioni) Þ "director.Italian(LaFunivia) Hence, LaFunivia is an answer to Q2. + Comparing DLs to OODB Object-oriented databases: Also focus on unary and binary relations OODB’s are more focused on modeling the physical aspects of objects and their properties An object can only belong to a single (most specific) class. Description logics are about knowledge and complex relationships: Class membership can be inferred An individual can belong to multiple classes. + Comparing DLs to Relational Views In principle, concept descriptions are view definitions Relational views employ: selection, projection, join, union and apply to more than unary and binary relations. DLs: universal quantification, number restrictions, intersection, … Subsumption = query containment Universal quantification and number restrictions would require negation in conjunctive queries. Hence containment would be undecidable In DL’s you can put facts directly in views (i.e., complex concept). + More Issues To Tackle + More Issues to Tackle Performance issues the size of the problem, remember? network behavior (delay, bursting data) Source Discovery Schema Discovery Approximate answers a less precise answer is often better than no (precise) answer at all Choosing the right sources wrt to user demands not everybody has the same evaluation criterion for an answer especially when multimedia are involved, e.g. (e.g. most “vivid”, with highest resolution, fastest, most reliable, etc. ) Readings Chapter 9 in “Web Data Management” Book Chapter 12 in “Principles of Data Integration” For formal foundations and query rewriting algorithms read: Alon Halevy Answering Queries Using Views: A Survey, VLDB Journal, 2001 MiniCon: A scalable algorithm for answering queries using views, VLDB Journal, 2001 Data Integration: a Status Report Invited Talk German Database Conference (BTW), 2003 Maurizio Lenzerini Data Integration: A Theoretical Perspective Invited Talk PODS 2002 + Acknowledgements Diego Calvanese Query Processing in Data Integration Lecture Slides www.inf.unibz.it/~calvanese/teaching/05-06-data-integration/ Jeffrey D. Ullman Lecture Slides www-db.stanford.edu/~ullman/cs345-notes.html Laks VS Lakshmanan Lecture Slides http://www.cs.ubc.ca/~laks/534b/notes.html Zachary G. Ives Lecture Slides www.seas.upenn.edu/~zives/05s/cis650/ Michalis Petropoulos Lecture Slides http://www.cse.buffalo.edu/~mpetropo/CSE636-FA08/ + Advanced Database Topics (ΗΥ562) Data Integration Questions? Spring 2014 http://www.csd.uoc.gr/~hy562/ “The scientific mind does not so much provide the right answers as ask the right questions.” -strauss + Complexity classes: what you always wanted to know but never dared to ask! The big divide: PTIME (computable in polynomial time, i.e.O(nk) for some ﬁxed k) Inside PTIME: tractable queries (although high-degree polynomial are intractable) Outside PTIME: intractable queries (eﬃcient algorithms are unlikely) Way outside PTIME: provably intractable queries (eﬃcient algorithms do not exist) EXPTIME: cn-algorithms for a constant c. Could still be ok for not very large inputs Even further – 2-EXPTIME: cc . Cannot be ok even for small inputs (compare 2 10 and 2 2 10 ). n 63 + Inside PTIME AC0 ⊆ TC0 ⊆ NC1 ⊆ DLOG ⊆ NLOG ⊆ PTIME AC0: very efficient parallel algorithms (constant time, simple circuits) TC0: very efficient parallel algorithms in a more powerful computational model with counting gates regular languages DLOG: very little – O(log n) – space is required basic SQL (relational calculus/grouping/aggregation) NC1: efficient parallel algorithms relational calculus SQL + (restricted) transitive closure NLOG: O(log n) space is required if nondeterminism is allowed SQL + transitive closure (as in the SQL3 standard) 64 + Beyond PTIME PTIME: can solve a problem in polynomial time NP: can check a given candidate solution in polynomial time coNP: complement of NP – verify that all “reasonable” candidates are solutions to a given problem. another way of looking at it: guess a solution, and then verify if you guessed it right in polynomial time Appears to be harder than NP but the precise relationship isn’t known PSPACE: can be solved using memory of polynomial size (but perhaps an exponential-time algorithm) 65 + Complete problems These are the hardest problems in a class. If our problem is as hard as a complete problem, it is very unlikely it can be done with lower complexity. For NP: SAT (satisfiability of Boolean formulae) many graph problems (e.g. 3-colourability) Integer linear programming etc For PSPACE: Quantified SAT Two XML DTDs are equivalent 66 + Complexity of Relational Calculus We consider the complexity of the recognition problem, i.e., checking whether a tuple of constants is in the answer to a query: measured wrt the size of the database; data complexity measured wrt the size of the query and the database; combined complexity Complexity of relational calculus data complexity: polynomial, actually in LogSpace combined complexity: PSpace-complete Complexity of conjunctive queries data complexity: in LogSpace combined complexity: NP-complete Complexity of Query Containment Conjunctive Queries (CQ) (NPComplete) Q1: p(X,Z):- a(X,Y) & a(Y,Z) Q2: p(X,Z):- a(X,Y) & a(V,Z) CQ’s With Negation ( 2-Complete) Q1: p(X,Z) :- a(X,Y) & a(Y,Z) & NOT a(X,Z) p CQ’s With Arihmetic Comparison ( p -Complete) 2 Q1: p(X,Z) :- a(X,Y) & a(Y,Z) & X<Y Datalog Programs p(A,C) :- a(A,B) & b(B,C) p k k p p k 1 k k p p + Complexity of View-based Query Answering User Query Source Descriptions Serge Abiteboul, Oliver Duschka 2010

1/--pages