Comparing Relational Databases to a Native Multi …...Consiglio Nazionale delle Ricerche Istituto...
Transcript of Comparing Relational Databases to a Native Multi …...Consiglio Nazionale delle Ricerche Istituto...
Consiglio Nazionale delle Ricerche Istituto di Calcolo e Reti ad Alte Prestazioni
Comparing Relational Databases to a Native Multi-Model Database System
A. Messina
Rapporto Tecnico N.: RT-ICAR-PA-18-01 Gennaio 2018
Consiglio Nazionale delle Ricerche, Istituto di Calcolo e Reti ad Alte Prestazioni (ICAR) – Sede di Cosenza, Via P. Bucci 41C, 87036 Rende, Italy, URL: www.icar.cnr.it – Sede di Napoli, Via P. Castellino 111, 80131 Napoli, URL: www.na.icar.cnr.it – Sede di Palermo, Viale delle Scienze, 90128 Palermo, URL: www.pa.icar.cnr.it
Non è possibile visualizzare l'immagine.
Consiglio Nazionale delle Ricerche Istituto di Calcolo e Reti ad Alte Prestazioni
Comparing Relational Databases to a Native Multi-Model Database System
A. Messina1
Rapporto Tecnico N.: RT-ICAR-PA-18-01 Gennaio 2018
1 Istituto di Calcolo e Reti ad Alte Prestazioni, ICAR-CNR, Sede di Palermo, Via Ugo
La Malfa n. 153, 90146 Palermo. I rapporti tecnici dell’ICAR-CNR sono pubblicati dall’Istituto di Calcolo e Reti ad Alte Prestazioni del Consiglio Nazionale delle Ricerche. Tali rapporti, approntati sotto l’esclusiva responsabilità scientifica degli autori, descrivono attività di ricerca del personale e dei collaboratori dell’ICAR, in alcuni casi in un formato preliminare prima della pubblicazione definitiva in altra sede.
Non è possibile visualizzare l'immagine.
Index
1 INTRODUCTION......................................................................................................5
2 DATABASECONCEPTS..........................................................................................6
2.1 Introduction ......................................................................................................................... 6
2.2 Relational Database System Concepts.................................................................................. 6
2.2.1 Table and Schema Definition ............................................................................................ 6
2.2.2 Insertion and Identification of Records ............................................................................. 7
2.2.3 Indexing ........................................................................................................................... 7
2.2.4 Transactions ..................................................................................................................... 8
2.3 Concepts in ArangoDB .......................................................................................................... 8
2.3.1 JSON Data Types .............................................................................................................. 8
2.3.2 Living without a Schema ................................................................................................... 9
2.3.3 Document Keys and Indexes ........................................................................................... 10
2.3.4 Graphs ........................................................................................................................... 11
2.3.5 Transactions ................................................................................................................... 12
2.4 Comparison of Terminology ............................................................................................... 12
3 DATAMODELING................................................................................................15
3.1 Introduction ....................................................................................................................... 15
3.2 Data Modeling in Relational Systems ................................................................................. 15
3.3 Data Modeling in ArangoDB ............................................................................................... 16
3.4 Comparison of Data Models ............................................................................................... 19
4 QUERYCONCEPTS...............................................................................................21
4.1 Introduction ....................................................................................................................... 21
4.2 SQL ..................................................................................................................................... 21
4.3 AQL .................................................................................................................................... 23
4.3.1 FOR Loop Construct........................................................................................................ 23
4.3.2 High-Level Operations .................................................................................................... 23
4.3.3 Query Optimization in AQL ............................................................................................. 24
4.3.4 Combine All of the Models ............................................................................................. 24
4.4 Query Language Comparison ............................................................................................. 26
1 Introduction
This technical report compares relational database management systems(RDBMS) to nativemulti-model database systems— in particular,MySQL andArangoDB. It describes their key concepts, contrasts both at the end of eachsection,andconcludesbyexplainingwhatsetsArangoDBapart.
Thisisahigh-leveloverviewoffeaturesanddifferences.Itcan,though,serveasastartingpointtounderstandthemindsetofbothtypesofsystems,forareadertobridgethegapfromwhattheyknowaboutdatabaseslikeMySQL,towhattheymaynotyetunderstandinArangoDB.
ThistechnicalreportwillalsofamiliarizethereaderwithprinciplesinArangoDB,suchasdocumentstoresandgraphdatabases.ThesearecombinedinArangoDBinthesamedatabasecore,accessiblewithonequery language.Thegoalof thisdocument isspecially tohelpthereaderappreciate thecomparableeaseof thetransitiontogreaterstorageflexibilitywithmultipledatamodelsthatArangoDBprovides.
ArangoDBisanativemulti-modelNoSQLdatabase.Itsupportsthreepopulardatamodels,namelykey-value,documentandgraph.ArangoDBiswritteninmodernC++andisavailableasopen-sourceversion(CommunityEdition)andalsocomeswith a commercial version including additional features for performance andsecurity(EnterpriseEdition).
MovingrelationaldatatoArangoDBissimple:exportthestoreddatatoaformatsuchasCSV, and thenusearangoimp to import thedata intoArangoDB.Oncethat’sdone,youmaycontinueworkingwiththesamedatamodelthatyouusedpreviously with the relational database. Because ArangoDB supports real joinoperations,thereisnoneedfornesteddatatocircumventnormalizationandjoins.Hence,youdon’thavetoreworkthedatamodel.
Still,therearesomedifferencesbetweenbothstorageapproaches.Thistechnicalreportwilldescribethedifferencebetweeneachdatabaseconcept,aswellasthedatamodeling possibilities and query options availablewith each. In the finalsections,thispaperwillalsointroducetheclusterarchitectureofArangoDBandtheoptionsthatareavailableforscalinghorizontally–evenwithcomplexqueries.
2 Database Concepts
2.1 Introduction
Beforedelvingintodatamodelingandmoreadvancedconcepts,let’sstartwithageneralexaminationofthedatabaseconceptsbehindrelationaldatabasesystemsandcontrastthatwiththebasicconceptsofadatabasesystemlikeArangoDB.
2.2 Relational Database System Concepts
Thedatamodelofarelationaldatabasemanagementsystem(i.e.,RDBMS)utilizesthe controllingmetaphor of a table,much like ones found in inventory ledgerbooks used for centuries in business. Records are stored in tables, with eachrecordbeingarowinthattable.Thecolumnheadsdefinetheavailablefieldsandthevaluesforeachofthecolumnsperrecordinindividualcells.Suchtablescanbe organized in databases (e.g., for multi tenancy use). There are a few coredatabases that contain tables which store system configuration variables andotherdataforinternalusebythesystem.
Data in a relationaldatabase systemcanbequeriedand retrieved ina tabularformat. It can be full rows or subsets of columns. Behind the scenes, a binarytransport protocol is used to send data. Although one can choose betweendifferentstorageengines–InnoDBisthemostpopular–withdifferentfeaturesandperformanceproperties,howdataisactuallystoredisinvisibletotheuser.
2.2.1 Table and Schema Definition
Beforedatacanbeinsertedintoatable,adevelopermustfirstconsiderthedataitwillcontain,howitwillbeorganizedandthenatureof thatdata–oftentimeswithouthavingbeengivenmorethanasmallsampleofthedatainadvance.Hemustdecideonthenumberofcolumnsandrecognizablenamesforeachcolumn.More importantly,hemustdecideonthedatatype(e.g.,INT,CHAR,DATETIME)appropriatetothetypeofdataanticipated.Unlesshechoosestoacceptthedefaultsettings for each column, he may want to set some restrictions on the dataaccepted (e.g.,NOT NULL).Plus,hemaywantto takeaguessatwhich columnshouldbe indexedwithoutanyusage information,andtowhatextentan indexshouldbeused,asinthecaseofmulti-columnindexes.
Thedefinitionofatablewithallofitscolumnsettingsiscalledaschema.Schemas
canbemodifiedinmostrelationaldatabasesystemsaftertheinitialdefinition,butinsomesystems,especiallywithclusters,itcanbeanexpensiveoperationbecausethe data needs to be re-organized on schema changes. Although a skilleddevelopercananticipatepotentialtableusageandtheneedsofitsusers,designingaschemacanbemuchlikeawrittenone:makingchangeslatertotheorganizationofthedatacanbecumbersome–evenwhenyouusedapencilinsteadofapen.
2.2.2 Insertion and Identification of Records
Wheninsertingrecordsintoatable,thevalueofallofthefieldscanbeprovidedinthesameorderofthetable’scolumnsbasedonitsschema,orasubsetofthecolumns can be givenwith the name and values of each field given. Fields notincludedwillbesetgenerallytoaNULLvalue,whichexpressestheabsenceofaproper value – unless a default value was specified in the schema. If columndefinitionhasnodefaultvalueanditwasdefinedasNOT NULL,anerrorwilloccurwheninsertarowwithoutgivingavalueforthecolumn.DependingontheSQLmodeused,thiswillpreventtherowfrombeinginserted.Ifit’spartofamultiplerowinsert,itwillpreventallrowsfrombeinginserted.
Each row in a table can usually be uniquely identified by an automaticallyincrementednumber.Thecolumnforthisidentifierisdefinedastheprimarykey.It’sacceptabletouseadifferentdatatype,orforegoaprimarykeyaltogetherandidentifyrowsbyothermeans,evenbyacombinationofvaluesincertaincolumns.However,thisisnotacommonpractice.Ifaprimarykeyisreferencedbyanothertable,itiscalledaforeignkey.
2.2.3 Indexing
TheprimarykeyindexforfastaccessofrecordsisusuallyaBTREE,thetypicalindextypeused inRDBMS.Therearesomeidiosyncrasiesrelatedtothis indextype.Theprimarykeymustbeunique.Theuser can create secondary indexesusingoneormorefields,withsettingslikeuniqueorfulltextindexes.Dataqueriesexecuted on a huge dataset,withoutusing indexes can be significantly slower.Therefore, users are dependent on the database designer to define well andproperlyindexesforthebestperformancebasedonthedatacontentandhowitwillbeused.
2.2.4 Transactions
Akey strengthofRDBMS is that they support transactional guarantees. InSQLqueries, transactions are startedwith explicitBEGIN orSTART TRANSACTIONcommand.Followingaseriesofdataretrievalormodificationoperations,anSQLtransactioniscompletedwithaCOMMITcommand,orrolledbackwithaROLLBACKcommandor if the session is terminated.Theremaybe client communicationswith the server between the start and the commitment or rollback of an SQLtransaction.
2.3 Concepts in ArangoDB
At its core ArangoDB is a transactional document store, in the sense of JSONobjectsasdocuments.
Internally,dataisstoredinabinaryform(VelocyPack),butwhatisenteredandreturned from the system is typically JSON. By default, JSON data is sent andreceived over the well-known HTTP protocol by the server, which provides aRESTfulAPI.OtheroptionsareVelocyPackoverHTTPandVelocyStream,abinarytransportprotocol.
Howdataisactuallypersistedisnotvisibletotheuser.Adevelopercanchoosebetween two storage engines with different characteristics according to theparticularusecase:
2.3.1 JSON Data Types
JSONsupportsafewprimitiveandcompounddatatypesthatcanbecombinedtocomplex data structures. The primitive types are: null, boolean (true, false),number (integer and floating-point numbers), and string. These are fairly
straightforwardandthesameinanydatabasesystem.
Thecompoundtypesaremorecomplicated.Theyaretwotypes:arrayandobject.Anarrayisanorderedcollectionofelements,eachidentifiedbyanindexstartingat0;arrayscanbemultidimensional.Anobjectisahashmapwhichmapsstringkeys(i.e.,attributekeys)tovaluesofarbitrarytypes(i.e.,attributevalues).Objectscancontainprimitivetypesornestedcompoundtypes,withsupportforarbitrarydeepnesting,ifdesired.EachdocumentisessentiallyaJSONobjectatthetoplevel,witharbitrarynamedattributesthatcanhaveprimitivevaluesorbenestedarraysandobjects.
Documents are stored in collections. Collections have names, which ideallydescribe what kind of information the documents contain. Collections can beorganizedinmultipledatabases(e.g.,formulti-tenancyuse).Thedefaultdatabasein ArangoDB is called _system. It contains hidden collections for internalpurposes,buttheusermayalsoaddcollectionstothedatabase.
2.3.2 Living without a Schema
ArangoDBisschema-free.Thesystemdoesnotrequireyoutodeclareordefinefieldattributes.Norisitnecessarytospecifyinadvancethedatatypesoffields.Incaseaschemavalidationisnecessary,itcanbeintegratedbyaJOIvalidationinaArangoDBFoxxmicroservice.
Documentscanhaveanarbitrarystructureandcanuseanysupporteddatatype.Inprinciple,eachdocumentinacollectioncanbestructureddifferently.However,therewillusuallybesomefieldsthatalldocumentshaveincommoninthesenseofattributekeys.Theattributevaluesmayusedifferentdatatypes,nonetheless,eveniftheattributekeysmatch.
There is no such thing as filling in a subset of fields in a schema-free system,becauseforeverydocument,whatshallbepersistedisdefinedbythedocumentitself. It is self-contained. A document can have as few or as many attributesincludingnestedattributesasneeded.Both,attributekeysandvalues,havetobeexplicitlyspecifiedallthetime.
2.3.3 Document Keys and Indexes
Eachdocumentrequiresakeythatuniquelyidentifiesitwithinacollection.Itcanbeassignedbytheuseruponcreation,orArangoDBwillgenerateone.Thedatatypeisalwaysstring.Theautomaticallygeneratedkeysareincreasingnumbers(withgaps),butconvertedtoanon-numericcharactersequence.Documentkeyscannot be modified after document creation (immutable). There is a virtualattribute (i.e., the document ID) which is comprised of the collection name, aforwardslashandthedocumentkeytoidentifyadocumentwithinthedatabase.Documentkeysanddocument identifiersarealways indexed.The indexonthe_keyattributeiscalledaprimaryindex.Itexistsforeachcollectionandcan’tberemoved.It’sahashindexwiththeoptionsnon-sparseandunique.Thismeansitisalwaysnon-emptyandthattherearenoduplicatekeysinthecollection.
Additionalindexescanbedefinedbytheuseroveroneormultiplefields,aswellastheindividualelementsofarrays.Availableindextypes:
• Hash:constantlookuptime(superfast),onlyapplicableifanexactvalueissearched
• Skiplist: sorted index, slower than hash index, but suitable for rangequeries
• Geospatial:canindex2Dcoordinatesforfastretrievalofdocumentsnearareferencepoint(i.e.,closestdistancefirst),fullGeoJSONsupportwillbeaddedinArangoDB3.4
• Fulltext: fullwordandprefixmatching instringswithareverse lookupindex(completetextsearchandrankingengineincludinginvertedindexeswillbeintegratedinArangoDB3.4)
• Persistent: WithMMfilesstorageengine,persistentskiplistindexescanbecreatedviathistype.WithRocksDB,allindexesarepersisted.
ArangoDB can be used as a key-value store by retrieving documents via theirdocumentkeysorIDs.Theprimaryindexishash-basedwithaconstantlookuptimeandhasaselectivityof100%becauseeachkeymustbeuniqueinacollection,whichenablesquickretrievalviathekey.Thedocumentasawholeisreturnedasavalueinthekey-valueaccesspattern.
Ifsecondaryindexesareinvolved,oradocumentisreturnedpartially(i.e.,onlycertainattributes),itisnotusedstrictlyspeakingaskey-valuestore.Instead,it’susedlikeadocumentstore.Thereisnorealdistinction,however,betweentheminArangoDB.
2.3.4 Graphs
ThethirddatamodelsupportedbyArangoDBisgraphs.Graphsarelikenetworksconsistingof nodes that are linked together,which can expressmany-to-manyrelationships.Therecanbevarioustopologies, likeagraphforminga tree(e.g.,corporate hierarchy, electricity grid) rather than a mesh (e.g., social network,fraudnetworks).
TherearetwotypesofcollectionsinArangoDB:vertexandedgecollections.Bothstoredocumentsaspreviouslydescribed.Documentsinanedgecollectionhavetwoadditionalattributes,_fromand_to.BothhavetobeassigneddocumentIDstolinktogetherdocuments.Thedocumentthatlinksthemiscalledanedgeandthe linkeddocumentsare calledvertices in thegraphmodel.Edgesarealwaysdirected inArangoDB.The edge is leading from the source (i.e.,_from) to thetarget(i.e.,_to).EdgesdohaveadirectioninArangoDBbutcanbeignoredbytheANYstatement.
Additionally, there is a special edge index that can receiveall of theedgesof avertexinconstanttime.Sincereceivingedgesisthemostusedoperationinanygraphquery, this constant time lookup is a requirement forgraphdatabase tooperateperformantly.Thisindexcannotberemoved.Usersmaydefineadditionalindexesforedgecollectionsingeneral,andindexesoverthe_fromand_tofieldsinparticular,tocreatevertex-centricindexestospeedupcertaingraphqueries.
ArangoDBcanalsoefficientlyhandleothertypesofdata,suchasgeo-spatialandtext.WithArangoDB3.4,thedatatypestextandgeo-spatialwillreceiveamuchricher featuresetandnewindexestoextendthenumberofpossibleusecases.Thesedatatypesareintentionallynotcalleddatamodels,astheprocessingofthetwoismoreamatterofcomputationthanstorage.
2.3.5 Transactions
InArangoDB,atransactionisalwaysaserver-sideoperation.Itisexecutedontheserver in one go, without any client interaction. All operations to be executedwithin a transaction need to be known by the serverwhen the transaction isstarted.
Thereareno individualBEGIN,COMMITorROLLBACKstatements inArangoDB'squerylanguageAQL.Instead,atransactioninArangoDBisstartedbyprovidingadescriptionof the transactionwrittenas JavaScript function.This functionwillthen automatically start a transaction, execute all required retrieval andmodificationoperations,andautomaticallycommittransactionsattheend.Ifanerroroccursduringtheexecutionofatransaction,thetransactionisautomaticallyaborted,andallchangesarerolledback.
Transactions in ArangoDB can be multi-document and multi-collectiontransactions in a single instance. For cluster scenarios, ArangoDB alreadysupportsatomicoperations.Ineverydocument,ArangoDBstoresautomaticallythe so-called revision key,_rev in alldocuments. For a single instance, this isbeingused tosupport fullMultiVersionConcurrencyControl (MVCC,RocksDBstorage engine) and in a cluster setting to support atomic operations likeCompare-and-Swap.ArangoDBwillsoonsupportmoreoptionsfortransactionalguaranteesinclustersettingsonourpathtocluster-wideMVCC.
2.4 Comparison of Terminology
HavingexaminedandcomparedthekeyconceptsandmethodsofbothRDBMSandArangoDB,let’sconsiderandcomparetheterminology.Therearesometermsused in an RDBMS that are identical to its counterpart in ArangoDB, althoughviewedalittledifferently.Thereareothers,though,forwhichadifferenttermisusedtoreflectmoreclearlythedifferentperspectiveofArangoDB.Belowisalistofthesekeytermsforbothsystems:
InanRDBMS,entitiesofthesametypearestoredintables.Collectionsareroughlytheequivalentof tables indocument stores, as theyusuallyholddocumentsofsimilarentities.Therenormallywillbesomeormanycommonattributes.Becauseofthisvaryingnatureofdatastructures,collectionasisamoreaccuratetermthantable, which has a fixed structure. Documents are also not tabular, becauseunneeded attributes (i.e., fields) can be omitted and it’s possible to nest data,whichallowsforatree-likestructure.
The document structure can differ between any two documents in a singlecollection. For example, as new features are added to a product that usesArangoDBasabackend,additionalattributesmaybecomenecessary.Theycanbeaddedprogressivelyon-the-flyfornewlycreateddocuments,withouttheneedtoupdateallpreviouslyexistingdocuments.InSQL,thetablestructure(i.e.,schema)wouldneedtobeupdatedwithadditionalcolumns.This iscanbeasignificantoperationinsomesystemsandrequiresplanning.
While there are a few primitive data types in ArangoDB, the systemdoes notimposeanyrestrictionssuchashavingtodefinewhatdatatypesmustbeusedforwhichattributes—itisschema-free.Ifyouwishtovalidatedocumentstructuresontheserver-side,ArangoDBprovidesJOIobjectschemavalidationviaFoxx.
There are many more primitive data types in relational database systems. InArangoDB,thereisonlyasingleprimitivetypefornumbers,whichcoversintegersaswellasfloatingpointnumbers.There'salsoonlyonedatatypeforcharactersequences.TherearenolimitstothelengthofstringsandstringsarestoredinnormalizedUTF-8encoding.Therearen’tdatatypessuchasenumsorsets,whichrequirethedefiningacceptablevalues.Setscanbeemulatedbyusingafunctiontorestrictanarraytouniqueelements.AmissingorunavailablevalueinanRDBMSisberepresentedbyNULL.ArangoDBalsohasanullvalue,butit’srarelyneeded.Theabsenceofavaluecansimplybeexpressedbythelackoftheattribute.
The way ArangoDB implements graphs is different from many other graphdatabases,whichuse“index-freeadjacency”.Edgesstoredinaseparatecollection
withaspecialindexonthe_fromand_toattributesarenottoodifferentfromcross tables for storing many-to-many relationships in an RDBMS. Theperformanceguaranteesfortraversalsare,nonetheless,onparwithothergraphdatabases, but there is more potential to scale graphs. Combined with theEnterprise Edition features (i.e., SmartGraphs and SatelliteCollections), graphperformancecanbefurtherimproved.
3 Data Modeling
3.1 Introduction
Modelingdataforadatabasesystemstartswithdefiningwhatdataneedstobestoredtosolvetheproblem.Thisusuallyrequiresbeingclearaboutthequestionswhichshouldbeanswered,whatentities thereareneeded, andwhatare theirrelationshipstocreatea logicalmodel.Thismodelcanbe iterateduponuntil itaddressestheentireproblem.Themodelneedstobetranslatedtoadesignthatthechosendatabasesystemsallowsand ideallysupports inaneasytouseandperformantway.
3.2 Data Modeling in Relational Systems
Thefirststepincreatingalogicalmodelistodeterminetheentities.Forexample,considera simpledatabaseaboutmusic. Supposewechose tostore songsandalbumsbyartisttogetherwithgenres.
Thenextstepistodefinethepropertiesoftheseentitiesandtheirrelationships.Anentity-relationship-model(ERM)isthestandardwayofrepresentingallofthis.
It'seasytocastthisintoarelationaldatabasemodel.Foreverytypeofentity,onetableisdefinedwiththeentitypropertiestranslatedtothecolumns.
Relationshipsbetweenrecordscanbeexpressedwithmatchingvaluesbetween
tables.Theycanbeone-to-onerelations(i.e.,1:1),inwhicheachrecordintableAmatchesonerecordintableB.Thatistosay,afieldinAandanotherinBhavethesame value — usually by way of a primary key. Not every record needs acounterpart;theremayalsobeone-to-zerorelations,buttherecan’tbemorethanonerecordoneitherside.It’scommontointegratethefieldsoftableBintotableAinsuchasituation,unlesstherearespecialconsiderationstoholdthedataindifferenttables(e.g.,accesspermission).
Incaseofone-to-manyrelations(i.e.,1:n),eachrecordintableAmatcheswithoneor more records in table B. Applying this to our example, albums and theirrespectivesongswouldbestoredseparatelyinthismodel.Eachsongrecordcanonlybelinkedtoonealbuminthissimplemodel.Inareality,onewouldprobablyallowthesamesongstobelinkedtomultiplereleases.
Therelationshipbetweenalbumsandgenresisdifferent:amany-to-manyrelation(i.e.,n:m).Asinglealbumcanhavemultiplegenres,andmultiplealbumscanhavethesamegenre.Albumsandgenrescanbestoredintwotables,andathirdtableknownasacrosstablebeintroducedtolinktherecordsofthetwoformertables.Arecordinthecrosstablestoresitsownprimarykey,andonealbumIDandonegenreID,aforeignkey.
Tomakeuseof stored relations, toresolve foreignkeys to fields fromanothertable,thetables—orrathersubsetsoftheirrecords—arejoinedonthefieldswhichholdtherecordkeys.Thesefieldsshouldbeindexed,astherecanbeahighcomputationalcomplexityifallofthecombinationsofrecordsneedtobechecked.The result isoftenadatabasewithmany tablesandmanycross tables, alwaysrequiringjoinstoquerythedata,thusimpactingperformance.
3.3 Data Modeling in ArangoDB
Asdescribed intheDatabaseConceptssectionabove,ArangoDBisadocumentstore at its core. A JSONdocument containingmultiple entities embedded in asingledocumentcouldlookliketheexampleontheright.
CurlybracketssurroundJSONobjects,whichcancontainattributes.Attributesaregiven in pairs, key/value pairs. Each key and value are separated by a colon.
Attributepairsareseparatedbycommas.
Attribute keys are enclosed in double quote marks,which signifies a string. Keys are always string datatypesinJSON.Theyaredepictedintealbluehere.
The values can be of any type. In the example here,name, country and other nested attribute values arealsostrings,coloredingreen.Thealbumyearandsongdurationvaluesarenumbersandhighlightedinblue.
Squarebracketssurroundvaluelists.Inthisexample,albums, genresandsongsare sucharrays.Regardinggenres,thearrayscontainmultiplestringswithgenrenames.
Thearrayforalbumscontainstwonestedobjects,oneperalbum.Eachobjecthasfourattributes:Title,Year,Genres and Songs. The Songs array holds evenmoreobjects,onepertrackofeachalbum.Eachobjecthasatitleandadurationattribute.Bytheway, thedashedlinesdenoteomissions,tofittheexampledocumentonthispage.
Thisisacontrivedexamplewithallofthedataaboutanartistinasingledocument,todemonstrateinformationnesting.
Usingmultiplelevelsofnestingcanbeverypowerful,butit’snottheonlywaytostructure data. Based on the data model draft and ERM diagram from thebeginningofthischapter,therearefourentitiesinthenormalizedmodel:Artist;Album;Song;andGenre.
Eachtypeofentitycanbestoredinacollection:alloftheartistdocumentscanbestored inanArtistcollection;andalbumdocumentscanbestored inanAlbumcollection.Entitiesordocumentscanthenbelinkedinmultipleways.
Anadditionalcollectioncouldbecreated,tostoreanalbumdocumentkeyandasong document key, similar to a cross table, to be joined in a query based onmatching keys. It’s a permitted method, but not recommended. Instead, crosstables can be translated to edge collections almost unedited. Instead ofestablishingaconnectionwithapairofforeignkeys,the_fromand_toattributesofanedgecanbeusedtoexpresstherelationandopenthepossibilityforgraphtraversals.
It’snotnecessarytousethegraphmodelforeverything.Itmaynotbetheidealsolutionrelatedtoperformancetodothisforverylimited,fixeddepthtraversals—especiallyaone-steptraversalaswewouldhereforalbumsandtheassociatedgenres.
Embeddinggenredata into thealbumdocuments is apossible solution.Onlyasingle document read would be needed to retrieve the album information,includingthegenredata.Ontheotherhand, itbecomeshardertomaintainthedata.Forexample,renamingagenrewithoutintroducinginconsistencieswouldbetedious.Allalbumdocumentscontainingato-be-renamedgenrewouldneedtobemodified.Furthermore,therewouldn'tbeanefficientwaytoretrievealistofallgenres.
Aviablemiddlewayistohavetwocollections,oneforalbumsandoneforgenres,andstoreanarrayofgenredocumentkeysinthealbumdocumentsformany-to-manyrelations.Thiswaythereisacentralplacetoretrieveallavailablegenresandanindividualgenre.Therewouldbeonlyasingledocumenttomaintainallofthe information related to a genre in the Genre collection. The link between
albums and genres is established via the genreKeys attribute in each Albumdocumentandcanbeeasilyresolvedinaquery,withoutanadditionalcollection.Alookuporcollectionjoinofgenresleadstothesameresultsaswiththegraphmodel,butperformsbetter thana full graph traversal. Ifperformance is lessaconcern,though,youmayaswellusethegraphapproachsinceitresemblesmorecloselythewaywethinkofrelations.
Another consideration should be the amount of links between documents.Chancesarethatatleastsomeofthegenreswillbereferencedbythousandsorevenmillionsofalbums.Usingagraphmodelwouldresultinsomegenrevertices,having a huge amount of edges connected to it. Such vertices are called supernodes.Theyshouldbeavoided.Traversalsrunningintosupernodeswillhavetofollowalloftheseedges.Thecollectionjoinmodelscalesmuchbetter.Thedesiredapproach,though,canbechosenbytheuser,dependingontheusecase.
3.4 Comparison of Data Models
Data ismodeledas tables foranRDBMS. It’s commonpractice tonormalizeaninitialmodelso thatonearrivesat amodelwithout redundancieswithatomicvalues, in the sense that pieces of information aren’t being split into separatetablesandcolumns.Forexample,youwouldn’thaveasinglefieldcontaining"ToysintheAttic(1975)",withbothanalbumtitleandthereleaseyear. Instead,youwouldhavetwofields,onecontainingthetitleandtheothertheyear.Youwouldalsohavedistincttablesforeachentitytype.Thenormalizationprocessinvolvestheadditionofcrosstablestostorerelationsbetweenrecords.
Whilethereisaplethoraofdatatypesfromwhichtochoosebeforestoringdatainrelationalsystems,thereareonlyafewbasicdatatypesinArangoDBthatcanbecombinedtostructuredatainmanyways.Theyallowforsimpleandcomplexalike,independentforeachdocument.Everythingexpressiblewithtablescanbetranslated to JSON documents, without the need to specify in advance thestructureofdocuments.
A direct conversion of relational records to documents would result in JSONdocumentswithtop-levelattributesonly(i.e.,nonestedobjects).Thepossibility
of having sub-attributes and arrays, however, enables quite a different datamodelingapproach.Thisapproachpermitstheembeddingofdataindocumentsandeliminatessomeextratablesorcollections.Translatingcrosstablestoedgecollections isagreatwaytostoremany-to-manyrelationships inArangoDB, togain graph features such as shortest path computation and graph traversal.Distributed graph processing based on the Pregel computing model is alsoavailable starting inversion3.2ofArangoDB. If youdon'thaveaneed for realgraph,amodelwithcollectionjoinscanbechoseninstead.Multiplemodelscanbecombinedfreely.ArangoDBprovidesflexibility,allowingthedevelopertostrikeabalance between an understandable data model, fast development iterations,neededfeaturesandperformance.
4 Query Concepts
4.1 Introduction
Storingdataisonlypartofthefunctionofadatabasesystem.Theabilitytoretrievedata in variousways from a database iswhere an electronic database ismostuseful. Otherwise, youmay use a paper-based system aswell. To facilitate theretrievalofdata,aquerylanguageisprovided.RelationaldatabasesystemslikeMySQL use a StructuredQuery Language (i.e., SQL), while ArangoDB providesArangoQueryLanguage(i.e.,AQL).
4.2 SQL
SQLcannotonly retrieve records from thedatabase system, it canalso insert,update and delete records. Tables and databases can be created, altered anddropped,anduserpermissionscanbemanagedwithSQLstatements.Theseusesaregroupedandclassifiedbasedon theirsimilarities:data retrievalqueriesasDQL(i.e.,dataquerylanguage);inserting,updatinganddeletingdataasDML(i.e.,datamanipulationlanguage),structureandschemastatementasDDL(i.e.,datadefinitionlanguage);andcreatingusersandsettingpermissionsasDCL(i.e.,datacontrol language). There are also procedural elements in the form of storedproceduresformorecomplexlogic.
SQLbecameastandardin1986,withseveralrevisionsovertheyearsuptothecurrentSQL:2016standard.MostRDBMSimplementasubsetofthesestandardsoravariantwithadditionalvendor-specific featureswithadialectof thequerylanguage.Thus,otherthanthemostbasicSQLstatements,SQLcodethatworksinonesystemmaynotwork inanothersystem,withoutmodificationsbecauseofsyntaxandfeaturedifferences.
SQL comes with several language constructs to interact with databases. EachconstructorSQLstatement,readslikeanEnglishsentencewithafixedorderofclauses.Manyclausesareoptional.Aclauseisstartedbyakeyword,followedbyan expression. For example, to retrieve records from a database, a SELECTstatementisusedandspecifieswhattoreturn:
Eachclausemustappearinthedesignatedplaceifitisused.Forexample,itwouldbea syntaxerror if theWHERE clausewasenteredbefore theFROM clause.TheWHERE clause can beomitted to return unconditionally thedesired fields of allrecords.HAVINGallowsmoreconditionstobeadded,butcanonlybeusedafterGROUP BYasapost-filter.TheWHERE clausedescribes filterconditionsappliedbeforetheaggregationinthecombinationofboth.TheORDER BYclauseisusedtosortthematchingrecordsbyoneormorefieldsinascendingordescendingorder.
WedidnotincludeaLIMITclauseinthisgenericexampletodefineanoptionaloffsetandamaximumnumberofrecordstoreturn.TheLIMITclauseiscommonlyusedwithMySQL, but it’s actually non-standard. TSQL, used inMicrosoft SQL,supports the syntax SELECT TOP 10, for instance. There are many morevariations and some support for multiple syntaxes in various systems. TheSQL:2008standardprovides foraFETCH clause(e.g.,FETCH FIRST 10 ROWS ONLY),withvaryingsupportacrosssystems.
Themost common joiningof tables inSQL is the inner join,which returns theintersectionoftwosets.Acolumnofthefirstsetisusedtomatchthevaluesofatablewith a specified columnof another table, the second set. For instance, ifthere'sarecordwithaspecificvalue inaparticularcolumninonetable,andarecordwith the same value in a corresponding column in another table, thoserecordswiththeselectedcolumnsarereturned.
Unlessemp_id isdefinedsoastoallowonlyuniquevalues,thesameemployeeidentificationnumbermayoccurmultipletimesineithertable.ThiswillproduceadditionalresultsinSELECTstatements.AJOINisusedtoexpresstheothertableyouwanttojoin.AnINNER JOINsaysonlytoreturnrecordsinwhichthevaluesinthegivencolumnsareequal.Itcanalsobeexpressedlikethis:
If you want all of the records from the employees table are supposed to bereturned,regardlessofamatchintheemp_contact_infotable,aleftjoincanbeusedlikeso:
Basically,thetablebeingselectedisthetableontheleft,andtheonebeingjoinedthe table on the right. A left join will return every row in the left table (i.e.,employeeshere)thatmeettheconditionsoftheWHEREclause,iftherewereone,aswellasthematchingdataintherighttable.However,ifthereisn’tamatchingrowintherighttable,NULLvaluesaredisplayedforfieldsrelatedtoit.
4.3 AQL
AQLisclassifiedasDQLandDML,whichmeans itcancreate,read,updateanddeletedocuments.Itcannotbeusedtocreatecollectionsanddatabases,managepermissionsordefineschemas.Assuch,AQLhassomesimilaritiestoSQL,butalsodistinctdifferences,mainlybecauseit’sdesignedforcombiningmultiplemodelsandgraphtraversalsinparticular.
AQL was invented because when ArangoDB was created, there was no SQLstandardwhich covered itsmulti-model needs. A standard does still not existwhichcoversallthatisArangoDB,inparticulargraphs.ThisisprobablybecauseNoSQLsystems involveawiderangeofdataconceptsandpracticalgoals,withvastly different technological approaches. ArangoDB supports multiple datamodelswithitssinglequerylanguage,AQL.It’sacustom-tailoredfit.Becauseit'snot led by a standard, it can improve quicklyand freely, based onmarket anddeveloperneeds.
4.3.1 FOR Loop Construct
AQLhasalanguageconstructforloopingthroughdata,typicalinprogramminglanguagesmorethandatabasequery languages.Thisallowsdeveloperstobothqueryandprograminoneversatilelanguage,ratherthanhavetolearnandusetwo.It’sthecentralbuildingblockfordocumentretrieval.ThemostcommonformistheFORloopusedwithacollectionofdocuments,muchlikeSELECTinSQL.Itcan also be used to iterate through arrays, such as an array attribute, forintermediateresultsfromasubqueryoravariabledefinedinline.
4.3.2 High-Level Operations
Otheressentialhigh-leveloperationsareFILTER,SORTandLIMIT.Theycanbeenteredindifferentlocations,usuallywithvaryingresultsbasedontheprocessingorderdescribed by the query.You can definewhat a query should return in aRETURNstatement.Onecanchoosewhichpartsofadocumenttoreturn,or the
documentasawhole,optionallywithadditionaloralteredattributescomputed.
Datamodificationqueries(i.e.,INSERT,UPDATE,REPLACE,DELETE)don’trequireaRETURN statement. It’smandatoryonly indataaccessqueries. It canbeusednonetheless inmodification queries, such as for returning old or new state ofdocuments. Modification queries come in a few syntax variants, but thepredominantformsarelikethefollowing:
ThedifferencebetweenUPDATE andREPLACE is thatUPDATE allows forpartialmodifications,suchaschangingexistingattributesoraddingnewones,whereasREPLACEretainsonlythedocumentkey,andswitchestherestofthedocumentwiththeprovidednewcontent.
4.3.3 Query Optimization in AQL
In order to execute the above described queries, ArangoDB uses a two-phasequery optimization technique. In the first phase, the query is parsed andtransformed intoan internal representation, anabstract syntax tree (i.e.,AST).ThenoptimizerrulestransformtheASTwithoutchangingtheresult.Forinstance,theoptimizerwilltrytouseindexeswhereverpossible.
Thereareoftenseveralpossiblewaystoexecutethesamequery.Forexample,ina join itcouldstart thesearch inanyof the involvedcollections.Theoptimizerestimates what’s required to solve the query for each way possible, and willexecutetheonethatrequiresthelowestamountofoperations.IfyouarecurioustoseewhatArangoDBcreatesfromyourqueryyoucanchecktheexplainoutput.Thistoolcanhelptoimprovequeryperformance.
4.3.4 Combine All of the Models
Nowlet’sseewherewecanmakeuseofthemulti-modelapproachandcombineagraphsearchwithjoins.Let’sassumethatwehaveaverylargedatasetofsongsfromseveralyearsandseveralgenres.Inthiscase,somegenreswillmostlikelybesuper-nodes,verticeswithmanyconnectededges(e.g.,“Pop”).It’salwaysabadidearelatedtoperformancetoiteratethroughthemwithinaquery.
Considerthefolloweruserrequest:“Ijustlistenedtoasongcalled,TributeandI
likeditverymuch.IsuspectthattheremaybeothersongsofthesamegenreasthissongthatImightenjoy.So,Iwanttofindallofthealbumsofthesamegenrethatwerereleasedinthesameyear”.
Let’s break this into logical steps, utilizing the names we have assigned tocomponentsofourdatabase.We’lldothisinthepuregraphway,inthisorder:
1. StartwithSong,‘Tribute’2. Onestep(PartOf)tofindtheAlbum3. Onestep(Has)tofindtheGenre4. Onestepback(Has)tofindallotherAlbum’softhisGenre5. YearofotherAlbumisidenticaltoyearofAlbum
WritteninAQL,thequerywouldlooklikethis:
This query is straightforward and will find the solution. However, it has thedrawbackthatismost-likelytraversingoverasuper-nodeinGenre.Also,thelistofgenresisunlikelytogrowrapidly.Althoughwemayaddmanynewsongs,weseldomlyaddnewgenres.Therefore,itwouldbebetterfirsttoselectallAlbumsofthesameyearandthenvalidatethatthegenreisidentical.ThiswaywegetalimitedsetofAlbums,andeachhasonlyonegenre.Thatquerywouldberesilienttodatagrowth.
So,letusmodifyourquerytothefollowing:
1. StartwithSong,‘Tribute’2. Onestep(PartOf)tofindtheAlbum3. OneStep(Has)tofindtheGenre4. JoinallAlbumswithidenticalYear5. ForeachotherAlbum,onesteptofinditsgenre6. FilterallotherAlbumwherethegenreisdifferent
ThesestepscanbewritteninAQLlikethis:
4.4 Query Language Comparison
Atthecore,SQLandAQLarebothdeclarativelanguages,withfunctionsthatcanbecalledforadditionaloperations.ThekeydifferencebetweenSQLandAQListhatinSQLyoudescribetheresultyouwant,andinAQLyoudescribetheprocesstogetthere.TheconceptbehindAQLisclosertocoding,andthereforeeasierthanswitchingbetweenacodinglanguageandqueryinglanguage.
SQLsyntaxhasafixedstructure,whereashighleveloperationsinAQL,suchasfilters,canbeputinvariousplacesfordifferentpurposesandresults.TheflowismorelogicalthanadheringtoSQLstatementorder.Plus,high-leveloperationslikeFILTERandSORTaremoreversatile.FILTERequalsWHEREandHAVING inSQL,dependingonwhereitappears.SORTisadirectequivalentofORDER BYinSQL.LIMIT is the same as in MySQL, although there are some things to considerregardingscopeinconjunctionwithsubqueries.RETURNdoesn’texistinSQL,butyoucanspecifywhichfieldstoreturninSELECT.InSQLwhatistobereturnedisgivenearly in thestatement,whereas inAQL,RETURN isplacedat theendof aqueryandofsubqueries.
AggregationwithCOLLECT isn’t too different fromGROUP BY in SQL, but theexistenceofcompoundtypesinAQLmakesitverydifferent.WhileinAQLthereisadirectalternativetotheconstructofaGROUP BYclauseinSQL,withAQLtherearemanymoretwistsandwaystoaggregatedata.Forexample,alldocumentsthatfallintoagroupbasedonthegroupingcriteriacanbekeptandfurtherprocessed.With the COLLECT AGGREGATE clause, you can efficiently compute statisticalfigureswhilethedataisbeinggrouped.Groupingisfastevenwithoutanindex,usingahash-basedapproach.
FORloopsinalltheirvariantsareakeyconceptofAQL.TheymakeAQLfeelmore
likeaprogramminglanguagethanaquerylanguage.It’samajorbuildingblockandaddsasenseofflowtoqueries.
Joins,whichareheavilyusedinmostapplicationsbasedonanRDBMS,arealsopossible in AQL. Joins are expressed with nested FOR loops combined withFILTER,insteadofaseparatesyntaxforeachJOIN.It’squitesimilartothesyntaxforinnerjoinsinSQL,butwithoutaJOINstatement.Multipletypesofjoinscanbecarried outwith AQL, although itmay not be necessary to do so. An array ofdocumentkeystogetherwiththeDOCUMENT()functionisaviablealternativetoresolveone-to-manyrelationships.
Recursion is not possible in SQL, except with a stored procedure—which isbasicallyproceduralprogrammingandnotstrictlypartoftheSQLstandards;theyare largely vendor-specific. Using nested FOR loops in AQL allows for thetraversingofmultiplelevelsofattributeswithinadocument.Thegraphtraversalvariant of the FOR loops allows a query to traverse a graph with a definableminimumandmaximumdepthwitha fewsimple linesofcode.Thissametaskwould be extremely complex in a relational database system with a storedprocedure.