Querying Semi-Structured Data ? Serge Abiteboul INRIA-Rocquencourt [email protected] 1 Introduction The amount of data of all kinds available electronically has increased dramat- ically in recent years. The data resides in di(cid:11)erent forms, ranging from un- structured data in (cid:12)le systems to highly structured in relational database sys- tems. Data is accessible through a variety of interfaces including Web browsers, database query languages, application-speci(cid:12)c interfaces, or data exchange for- mats.Some ofthis data is rawdata,e.g.,images or sound.Some ofit has struc- ture even if the structure is often implicit, and not as rigid or regular as that found in standard database systems. Sometimes the structure exists but has to beextractedfromthedata.Sometimesalsoitexists butweprefertoignoreitfor certain purposes such as browsing. We call here semi-structured data this data thatis(fromaparticular viewpoint)neither rawdatanorstrictly typed,i.e.,not table-oriented as in a relational model or sorted-graph as in object databases. As will seen later when the notion of semi-structured data is more precisely de(cid:12)ned, theneed forsemi-structured dataarises naturallyin thecontext ofdata integration,evenwhenthedatasourcesarethemselveswell-structured.Although data integration is an old topic, the need to integrate a wider variety of data- formats (e.g., SGML or ASN.1 data) and data found on the Web has brought the topic of semi-structured data to the forefront of research. The main purpose of the paper is to isolate the essential aspects of semi- structured data. We also survey some proposals of models and query languages for semi-structured data. In particular, we consider recent works at Stanford U. and U. Penn on semi-structured data. In both cases, the motivation is found in the integration of heterogeneous data. The \lightweight" data models they use (based on labelled graphs) are very similar. As we shall see, the topic of semi-structured data has no precise boundary. Furthermore, a theory of semi-structured data is still missing. We will try to highlight some important issues in this context. The paper is organized as follows.InSection 2,wediscuss the particularities ofsemi-structured data.InSection 3,weconsider the issue ofthe datastructure and in Section 4, the issue of the query language. ? Currently visiting theComputerScience Dept.,Stanford U.Worksupported inpart by CESDIS, NASAGoddard Space Flight Center;by the Air ForceWright Labora- tory Aeronautical Systems Centerunder ARPAContractF33615-93-1-1339, and by the Air Force Rome Laboratories under ARPAContract F30602-95-C-0119. Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 2. REPORT TYPE 3. DATES COVERED 1997 N/A - 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Querying Semi-Sttuctured Data 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Air Force Wright Patterson AFB, OH 45433 REPORT NUMBER 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release, distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE UU 16 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 2 Semi-Structured Data In this section, we make more precise what we mean by semi-structured data, how such data arises, and emphasize its main aspects. Roughlyspeaking, semi-structured datais datathat is neither rawdata,nor very strictly typed as in conventional database systems. Clearly, this de(cid:12)nition is imprecise. Forinstance, would aBibTex (cid:12)le be considered structured orsemi- structured?Indeed,thesamepieceofinformationmaybeviewedasunstructured at some early processing stage, but later become very structured after some analysishasbeenperformed.Inthissection,wegiveexamplesofsemi-structured data,makemoreprecisethisnotionanddescribeimportantissuesinthiscontext. 2.1 Examples We will often discuss in this paper BibTex (cid:12)les [Lam94] that present the ad- vantage of being more familiar to researchers than other well-accepted formats such asSGML [ISO86]or ASN.1[ISO87].Datain BibTex (cid:12)les closely resembles relational data. Such a (cid:12)le is composed of records. But, the structure is not as regular. Some (cid:12)elds may be missing. (Indeed, it is customary to even (cid:12)nd com- pulsory(cid:12)eldsmissing.)Other(cid:12)eldshavesomemeaningfulstructure,e.g.,author. Therearecomplexfeaturessuchasabbreviationsorcrossreferences thatarenot easy to describe in some database systems. The Web also provides numerous popular examples ofsemi-structured data. IntheWeb,dataconsists of(cid:12)lesinaparticular format,HTML,withsomestruc- turing primitives such as tags and anchors. A typical example is a data source aboutrestaurants intheBayArea(fromthe PaloAltoWeeklynewspaper), that we will call Guide. It consists of an HTML (cid:12)le with one entry per restaurant and provides some information on prices, addresses, styles of restaurants and reviews. Data in Guide resides in (cid:12)les oftext with some implicit structure. One can write a parser to extract the underlying structure. However, there is a large degree of irregularity in the structure since (i) restaurants are not all treated in a uniform manner (e.g., much less information is given for fast-food joints) and (ii) informationis entered asplain textbyhumanbeings thatdonotpresent the standard rigidity of your favorite data loader. Therefore, the parser will have to be tolerant and accept to fail parsing portions of text that will remain as plain text. Also, semi-structured data arises often when integrating several (possibly structured) sources. Dataintegration ofindependent sources hasbeen apopular topicofresearch since theveryearly daysofdatabases.(Surveys canbefoundin [SL90,LMR90,Bre90],andmorerecentworkontheintegrationofheterogeneous + + sources in e.g., [LRO96, QRS 95, C 95].) It has gained a new vigor with the recent popularity ofthe Web.Consider the integration ofcar retailer databases. Some retailers will represent addresses as strings and others as tuples. Retailers will probably use di(cid:11)erent conventions for representing dates, prices, invoices, etc. We should expect someinformation to be missing fromsome sources. (E.g., someretailers maynotrecord whether non-automatictransmission isavailable). More generally, a wide heterogeneity in the organization of data should be ex- pected from the car retailer data sources and not all can be resolved by the integration software. Semi-structured dataarisesunderavarietyofformsforawiderangeofappli- cations suchasgenomedatabases,scienti(cid:12)c databases,libraries ofprogramsand moregenerally,digital libraries, on-line documentations,electronic commerce.It isthusessential tobetter understand theissue ofqueryingsemi-structured data. 2.2 Main aspects The structure is irregular: This must be clear from the previous discussion. In many of these applications, thelargecollections thataremaintainedoftenconsist ofheterogeneous elements. Someelementsmaybeincomplete.Ontheotherhand,otherelementsmayrecord extra information, e.g., annotations. Di(cid:11)erent types may be used for the same kind of information, e.g., prices may be in dollars in portions of the database and in francs in others. The same piece of information. e.g.,an address, maybe structured in some places as a string and in others as a tuple. Modelling and querying such irregular structures are essential issues. The structure is implicit: Inmanyapplications, althoughaprecise structuring exists, itis givenimplicitly. Forinstance, electronic documentsconsist oftenofatextandagrammar(e.g.,a DTDinSGML).Theparsingofthedocumentthenallowsonetoisolate pieces of information anddetect relationships between them.However,the interpretation ofthese relationships (e.g.,SGMLexceptions) maybebeyondthe capabilities of standarddatabasemodels andareleft tothe particular applications andspeci(cid:12)c tools. We view this structure as implicit (although speci(cid:12)ed explicitly by tags) since (i) some computation is required to obtain it (e.g., parsing) and (ii) the correspondence between theparse-tree andthelogicalrepresentation ofthedata is not always immediate. It is also sometimes the case, in particular for the Web, that the documents comeasplaintext.Somead-hocanalysisisthenneeded toextractthestructure. For instance, in the Guide data source, the description of restaurant is in plain text.Now,clearly,itispossibletodevelopsomeanalysistoolstorecognizeprices, addresses, etc. and then extract the structure of the (cid:12)le. The issue ofextracting the structure of some text (e.g., HTML) is a challenging issue. The structure is partial: Tocompletely structure thedataoftenremains anelusive goal.Partsofthedata maylackstructure(e.g.,bitmaps);otherpartsmayonlyunveilsomeverysketchy structure (e.g., unstructured text). Information retrieval tools may provide a limited form of structure, e.g., by computing occurrences of particular words or group of words and by classifying documents based on their content. An application may also decide to leave large quantities of data outside the database. This datathen remains unstructured from adatabase viewpoint. The loadingofthisexternaldata,itsanalysis,anditsintegrationtothedatabasehave to be performed e(cid:14)ciently. We maywantto also use optimization techniques to only load selective portions ofthis data,in the style of[ACM93].In general, the management and access of this external data and its interoperability with the data from the database is an important issue. Indicative structure vs. constraining structure: In standard database applications, a strict typing policy is enforced to protect data. We are concerned here with applications where such strict policy is often viewed as too constraining. Consider forinstance the Web.A person developing a personal Web site would be reluctant to accept strict typing restrictions. In the context of the Lore Project at Stanford, the term data guide was adoptedtoemphasizenon-conventionalapproachestotypingfoundinmostsemi- structured dataapplications. A schema (as in conventional databases) describes a strict type that is adhered to by all data managed by the system. An update notconformingissimplyrejected.Ontheotherhand,adataguideprovidessome information about the current type of the data. It does not have to be the most accurate. (Accuracy may be traded in for simplicity.) All new data is accepted, eventually at the cost of modifying the data guide. A-priori schema vs. a-posteriori data guide: Traditional databasesystems are basedonthehypothesis ofa(cid:12)xedschema that has to be de(cid:12)ned prior to introducing any data. This is not the case for semi- structured data where the notion of schema is often posterior to the existence of data. Continuing with the Web example, when all the members of anorganization have a Web page, there is usually some pressure to unify the style of these home-pages, or at least agree on some minimal structure to facilitate the design of global entry-points. Indeed, it is a general pattern for large Web sources to start withaveryloose structure andthen acquire some structure whenthe need for it is felt. Further on, we will brie(cid:13)y mention issues concerning data guides. The schema is very large: Often as a consequence of heterogeneity, the schema would typically be quite large. This is in contrast with relational databases where the schema was ex- pected to be orders of magnitude smaller than the data. For instance, suppose that we are interested in Californian Impressionist Painters. We may (cid:12)nd some data about these painters in many heterogeneous information sources on the Web, so the schema is probably quite large. But the data itself is not so large. Notethatasaconsequence, theuserisnotexpected toknowallthedetails of the schema. Thus,queries over the schema are as important as standard queries overthedata.Indeed,onecannotseparateanymorethese twoaspects ofqueries. The schema is ignored: Typically, it is useful to ignore the schema for some queries that have more ofa discovery nature. Such queries mayconsist in simply browsing through the data orsearchingforsomestringorpatternwithoutanyprecise indicationonwhereit may occur. Such searching or browsing are typically not possible with SQL-like languages. They pose new challenges: (i) the extension of the query languages; and(ii) theintegration ofnewoptimizationtechniques suchasfull-text indexing + [ACC 96] or evaluation of generalized path expressions [CCM96]. The schema is rapidly evolving: Instandarddatabasesystems,theschemaisviewedasalmostimmutable,schema updates as rare, andit is well-accepted thatschema updates arevery expensive. Now, in contrast, consider the case of genome data [DOB95]. The schema is expected to change quite rapidly, at the same speed as experimental techniques are improved or novel techniques introduced. As a consequence, expressive for- matssuch asASN.1orACeDB [TMD92]were preferred toarelational orobject database system approach. Indeed, the fact that schema evolves very rapidly is oftengiven asthe reason fornotusing databasesystems in applications thatare managing large quantities of data. (Other reasons include the cost of database systems and the interoperability with other systems, e.g., Fortran libraries.) Inthe contextofsemi-structured data,wehavetoassumethatthe schemais very(cid:13)exible andcanbe updatedaseasily asdatawhich poses serious challenges to database technology. The type of data elements is eclectic: Another aspect of semi-structured data is that the structure of a data element may depend on a point of view or on a particular phase in the data acquisition process. So, the type of a piece of information has to be more eclectic as, say in standard database systems where the structure of a record or that of an object is very precise. For instance, an object can be (cid:12)rst a (cid:12)le. It may become a BibTex (cid:12)le after classi(cid:12)cation using a tool in the style of [TPL95]. It may then obtain owner, creation-date, and other (cid:12)elds after some information extraction phase. Finally, it could become a collection of reference objects (with complex structures) once it has been parsed. In that respect also, the notion of type is much more (cid:13)exible. This is anissue ofobjects with multiple roles, e.g.,[ABGO93]andobjects in views, e.g.,[dSAD94]. The distinction between schema and data is blurred: In standard database applications, a basic principle is the distinction between theschema(thatdescribes thestructure ofthedatabase)anddata(thedatabase instance).Wealreadysawthatmanydi(cid:11)erences betweenschemaanddatadisap- pearinthecontextofsemi-structured data:schemaupdatesarefrequent,schema laws can be violated, the schema may be very large, the same queries/updates may address both the data and schema. Furthermore, in the context of semi- structured data, this distinction may even logically make little sense. For in- stance, the same classi(cid:12)cation information, e.g., the sex of a person, may be kept as data in one source (a boolean with true for male and false for female) andastypein theother (the objectis ofclass MaleorFemale).Wearetouching hereissues thatdramaticallycomplicate databasedesignanddatarestructuring. 2.3 Some issues To conclude this section, we consider a little more precisely important issues in the context of semi-structured data. Model and languages for semi-structured data: Which model should be used to describe semi-structured data and to manipu- late this data? By languages, we mean here languages to query semi-structured data but also languages to restructure such data since restructuring is essen- tial for instance to integrate data coming from several sources. There are two main di(cid:14)culties (i) we have only a partial knowledge of the structure; and (ii) the structure is potentially \deeply nested" or even cyclic. This second point in particular defeats calculi and algebras developed in the standard database context (e.g.,relational, complex value algebra) by requiring recursion. It seems thatlanguagessuchasDatalog(see[Ull89,AHV94])althoughtheyprovidesome form of recursion, are not completely satisfactory. These issues will be dealt with in more details in the next two sections. Extracting and using structure: The general idea is, starting with data with little explicit structure, to extract structuring information and organize the data to improve performance. To con- tinue with the bibliography example, suppose we have a number of (cid:12)les con- taining bibliography references in BibTex and other formats. We may want to extract (in a data warehousing style) the titles of the papers, lists of authors andkeywords,i.e.,the mostfrequently accessed datathatcan be foundin every format for references, and store them in a relational database. Note that this extraction phase may be di(cid:14)cult if some (cid:12)les are structured according to for- mats ignored by our system. Also, issues such as duplicate elimination have to be faced. In general, the issue of recognizing an object in a particular state or within a sequence of states (for temporal data) is a challenging issue. The relational database then contains links to pieces of information in the (cid:12)les,sothatalldataremainsaccessible. Suchastructured layerontopofairreg- ular and less controlled layer of (cid:12)les, can provide important gains in answering the most common queries. In general, we need tools to extract information from (cid:12)les including classi- (cid:12)ers, parsers, but also software to extract cross references (e.g., within a set of HTML documents), information retrieval packages to obtain statistics on words (or groups of words) occurrences and statistics for relevance ranking and rele- vance feedback. More generally, one could envision the use of general purpose data mining tools to extract structuring information. One can then use the information extracted from the (cid:12)les to build a struc- tured layer above the layer of more unformed data. This structured layer ref- erences the lower data layer and yields a (cid:13)exible and e(cid:14)cient access to the in- formation in the lower layer to provide the bene(cid:12)ts of standard database access methods. A similar concept is called structured map in [DMRA96]. More ways to use structure: the data guide We saw that manydi(cid:11)erences with standard databases come froma very di(cid:11)er- ent approach to typing. We used the term data guide to stress the di(cid:11)erences. A similar notion is considered in [BDFS97]. Now, since there is no schema to view as a constraint on the data, one may question the need for any kind of typing information, and for a data guide in particular. A data guide provides a computedloose description ofthestructure ofdata.Forinstance, in aparticular application, the data guide may say that persons possibly have ougoing edges labelled name, address, hobby and friend, that an address is either a string, but that it may have outgoing edges labelled street,and zipcode. This should be viewed as more or less accurate indications on the kind of data that is in the database at the moment. It turns out that there are many reasons for using a data guide: 1. graphical query language: Graphical interfaces use the schema in very es- sential ways. For instance, QBE [Zlo77] would present a query frame that consistsofthenamesofrelations andtheirattributes.Inthecontextofsemi- structureddata,onecanviewthedataguideasan\encompassingtype"that wouldserve the role ofatype in helping the user graphically express queries or browse through the data. 2. cooperative answer:Consider for instance the mistyping of a label. This will probablyresult inatypeerrorinatraditional databasesystem,butnothere since strict type enforcement is abandoned. Using a data guide, the system maystill explain whytheansweris empty(because suchlabel isabsent from the database. 3. queryoptimization:Typinginformationisveryusefulforqueryoptimization. Even when the structure is not rigid, some knowledge about the type (e.g., presence/absence ofsomeattributes) canprovetobeessential. Forinstance, if the query asks for the Latex sources of some documents and the data guides indicate that some sources do not provide Latex sources, then a call tothese sources canbe avoided.This is also aplace where the system hasto show some (cid:13)exibility. One ofthe sources maybe a very structured database (e.g.,relational), and the system should take advantage of that structure. The notion of the data guide associated to some particular data with vari- ous degrees of accuracy, its use for expressing and evaluating queries, and its maintenance, are important directions of research. System issues: Although this is not the main focus of the paper, we would like to brie(cid:13)y list some system issues. Wealready mentioned the need for new query optimization techniques, andfortheintegration ofoptimization techniques fromvarious(cid:12)elds (e.g., database indexes and full text indexes). Some standard database system issues such as transaction management, concurrency control or error recovery havetobereconsidered, inparticular,becausethenotionof\dataitem"becomes less clear: the same piece of data may have several representations in various parts of the system, some atomic, some complex. Physical design (in particular clustering) isseriously alteredinthiscontext.Finally,itshouldbeobservedthat, by nature, a lot of the data will reside outside the database. The optimization of external data access (in particular, the e(cid:14)cient and selective loading of (cid:12)le data) and the interoperability with other systems are therefore key issues. 3 Modeling Semi-Structured Data A (cid:12)rst fundamental issue is the choice of a model: should it be very rich and complex, or on the contrary, simple and lightweight? We will argue here that it should be both. Why a lightweight model? Consider accessing data over the Internet. If we obtain new data using the Web protocol, the data will be rather unstructured at (cid:12)rst. (Some protocols such as CORBA [OMG92] may provide a-priori more structured data.) Furthermore, if the data originates from a new source that we just discovered, it is very likely that it is structured in ways that are still unknown to our particular systems. This is because (i) the number of semantic constructs developers andresearchers maypossibly inventisextremelylargeand (ii) the standardization of a complex data model that will encompass the needs of all applications seems beyond reach. For such novel structures discovered over the network, a lightweight data model is preferable. Any data can be mapped to this exchange model, and be- comes therefore accessible without the use of speci(cid:12)c pieces of software. Why also a heavyweight data model? Using a lightweight model does not preclude the use of a compatible, richer model that allows the system to take advantage of particular structuring information. For instance, traditional rela- tions with indexes will be oftenimported. When using such anindexed relation, ignoring the fact that this particular data is a relation and that it is indexed would be suicidal for performance. Aswementionedintheprevioussection, thetypesofobjectsevolvebasedon ourcurrentknowledgepossiblyfromtotallyunstructured toverystructured,and a piece of information will often move from a very rich structure (in the system where it is maintained); to a lightweight structure when exchanged over the network; to a (possibly di(cid:11)erent) very rich structure when it has been analyzed and integrated to other pieces of information. It is thus important to dispose of a (cid:13)exible model allowing both a very light and a very rich structuring of data. In this section, we (cid:12)rst brie(cid:13)y consider some components of arich model for semi-structured data. This should be viewed as an indicative, non-exhaustive list ofcandidate features. In our opinion, speci(cid:12)c models for speci(cid:12)c application domains (e.g., Web databases or genome databases) are probably more feasible than an all-purpose model for semi-structured data. Then, we present in more details the Object Exchange Model that is pursuing a minimalist approach. 3.1 A maximalist approach We next describe primitives that seem to be required froma semantic model to allowthe description ofsemi-structured data.Ourpresentation israther sketchy and assumes knowledge of the ODMG model. The following primitives should be considered: 1. The ODMG model: the notions of objects, classes and class hierarchy; and structuring constructs such as set, list, bag, array seem all needed in our context. 2. Null values: these are given lip service in the relational and the ODMG models and more is needed here. 3. Heterogeneous collections: collections need often to be heterogeneous in the semi-structured setting. So, there is the need for some union types as found for instance in [AH87] or [AK89]. 4. Text with references: text is an important component for semi-structured information. Two important issues are (i) references to portions of a text (references andcitations inLaTex),and(ii) references fromthetext(HTML anchors). 5. Eclectic types: the same piece of information may be viewed with various alternative structures. 6. Version and time: it is clear that we are often more concerned by querying the recent changes in some data source that in examining the entire source. No matter how rich a model we choose, it is likely that some weird features of a given application or a particular data exchange format will not be covered (e.g., SGML exceptions). This motivates the use of an underlying minimalist data format. 3.2 A minimalistapproach + In this section, we present the Object Exchange Model (OEM) [AQM 96], a data model particularly useful for representing semi-structured data. The model consists of graph with labels on the edges. (In an early version of the model [PGMW95], labels were attached to vertices which leads to minor di(cid:11)erences in the description of information and in the corresponding query languages.)Averysimilarmodelwasindependently proposedin[BDHS96].This seems toindicate thatthis model indeed achieves the goalstobe simple enough, and yet (cid:13)exible and powerful enough to allow describing semi-structured data found in common data sources over the net. A subtle di(cid:11)erence is that OEM is based on the notion of objects with object identity whereas [BDHS96] uses tree markers and bisimulation. We will ignore this distinction here. Data represented in OEM can be thought of as a graph, with objects as the vertices andlabels onthe edges. Entities arerepresented byobjects.Eachobject has a unique object identi(cid:12)er (oid) from the type oid. Some objects are atomic and contain a value from one of the disjoint basic atomic types, e.g., integer, real,string,gif,html, audio,java,etc. All other objects are complex; their value is a set of object references, denoted as a set of (label;oid) pairs. The labels are taken from the atomic type string.Figure 1 provides an example of an OEM graph. OEM can easily model relational data, and, as in the ODMG model, hier- archical and graph data. (Although the structure in Figure 1 is almost a tree, there is a cycle via objects &19 and &35.) To model semi-structured informa- tion sources, we do not insist that data is as strongly structured as in standard database models. Observe that, for example, (i) restaurants have zero, one or more addresses; (ii) an address is sometimes a string and sometimes a complex structure; (iii) a zipcode may be a string or an integer; (iv) the zipcode occurs in the address for some and directly under restaurant for others; and (v) price information is sometimes given and sometimes missing. Guide &12 restaurant restaurant restaurant zipcode &19 &35 &54 &77 92310 nearby category name address nearby category name addressaddress nearby price category name &17 &13 &14 &66 &17 &23 &25 &55 &79 &80 gourmet Chef Chu Vietnamese Saigon Mountain Menlo Park cheap fast food McDonald’s View street city zipcode &44 &15 &16 El Camino Real Palo Alto 92310 Fig.1. An OEM graph WeconcludethissectionwithtwoobservationsrelatingOEMtotherelational and ODMG models: OEMvs.relational:OnecanviewanOEMdatabaseasarelationalstructure with a binary relation VAL(oid, atomic value) for specifying the values of atomicobjectsandaternaryrelationMEMBER(oid,label,oid)tospecifythe valuesofcomplexobjects.Thissimpleviewpointseemstodefeatalargepart oftheresearch onsemi-structured data.However,(i)sucharepresentation is possible only because of the presence of object identi(cid:12)ers, so we are already out of the relational model; (ii) we have to add integrity constraints to the relational structure (e.g.,toprohibit danglingreferences); and(iii) itisoften the case that we wantto recover an object together with its subcomponents and this recursively, which is certainly a feature that is out of relational calculus. OEM vs. ODMG: In the object exchange model, all objects have the same type,namelyOEM.Intuitively,thistypeisatuplewithone(cid:12)eldperpossible label containing a set of OEM’s. Based on this, it is rather straightforward to have a type system that would incorporate the ODMG types and the