Comparing Relational Databases to a Native Multi …...Consiglio Nazionale delle Ricerche Istituto...

Consiglio Nazionale delle Ricerche Istituto di Calcolo e Reti ad Alte Prestazioni

Comparing Relational Databases to a Native Multi-Model Database System

A. Messina

Rapporto Tecnico N.: RT-ICAR-PA-18-01 Gennaio 2018

Consiglio Nazionale delle Ricerche, Istituto di Calcolo e Reti ad Alte Prestazioni (ICAR) – Sede di Cosenza, Via P. Bucci 41C, 87036 Rende, Italy, URL: www.icar.cnr.it – Sede di Napoli, Via P. Castellino 111, 80131 Napoli, URL: www.na.icar.cnr.it – Sede di Palermo, Viale delle Scienze, 90128 Palermo, URL: www.pa.icar.cnr.it

Non è possibile visualizzare l'immagine.

Consiglio Nazionale delle Ricerche Istituto di Calcolo e Reti ad Alte Prestazioni

Comparing Relational Databases to a Native Multi-Model Database System

A. Messina1

Rapporto Tecnico N.: RT-ICAR-PA-18-01 Gennaio 2018

1 Istituto di Calcolo e Reti ad Alte Prestazioni, ICAR-CNR, Sede di Palermo, Via Ugo

La Malfa n. 153, 90146 Palermo. I rapporti tecnici dell’ICAR-CNR sono pubblicati dall’Istituto di Calcolo e Reti ad Alte Prestazioni del Consiglio Nazionale delle Ricerche. Tali rapporti, approntati sotto l’esclusiva responsabilità scientifica degli autori, descrivono attività di ricerca del personale e dei collaboratori dell’ICAR, in alcuni casi in un formato preliminare prima della pubblicazione definitiva in altra sede.

Non è possibile visualizzare l'immagine.

Index

1 INTRODUCTION......................................................................................................5

2 DATABASECONCEPTS..........................................................................................6

2.1 Introduction ......................................................................................................................... 6

2.2 Relational Database System Concepts.................................................................................. 6

2.2.1 Table and Schema Definition ............................................................................................ 6

2.2.2 Insertion and Identification of Records ............................................................................. 7

2.2.3 Indexing ........................................................................................................................... 7

2.2.4 Transactions ..................................................................................................................... 8

2.3 Concepts in ArangoDB .......................................................................................................... 8

2.3.1 JSON Data Types .............................................................................................................. 8

2.3.2 Living without a Schema ................................................................................................... 9

2.3.3 Document Keys and Indexes ........................................................................................... 10

2.3.4 Graphs ........................................................................................................................... 11

2.3.5 Transactions ................................................................................................................... 12

2.4 Comparison of Terminology ............................................................................................... 12

3 DATAMODELING................................................................................................15

3.1 Introduction ....................................................................................................................... 15

3.2 Data Modeling in Relational Systems ................................................................................. 15

3.3 Data Modeling in ArangoDB ............................................................................................... 16

3.4 Comparison of Data Models ............................................................................................... 19

4 QUERYCONCEPTS...............................................................................................21

4.1 Introduction ....................................................................................................................... 21

4.2 SQL ..................................................................................................................................... 21

4.3 AQL .................................................................................................................................... 23

4.3.1 FOR Loop Construct........................................................................................................ 23

4.3.2 High-Level Operations .................................................................................................... 23

4.3.3 Query Optimization in AQL ............................................................................................. 24

4.3.4 Combine All of the Models ............................................................................................. 24

4.4 Query Language Comparison ............................................................................................. 26

1 Introduction

This technical report compares relational database management systems(RDBMS) to nativemulti-model database systems— in particular,MySQL andArangoDB. It describes their key concepts, contrasts both at the end of eachsection,andconcludesbyexplainingwhatsetsArangoDBapart.

Thisisahigh-leveloverviewoffeaturesanddifferences.Itcan,though,serveasastartingpointtounderstandthemindsetofbothtypesofsystems,forareadertobridgethegapfromwhattheyknowaboutdatabaseslikeMySQL,towhattheymaynotyetunderstandinArangoDB.

ThistechnicalreportwillalsofamiliarizethereaderwithprinciplesinArangoDB,suchasdocumentstoresandgraphdatabases.ThesearecombinedinArangoDBinthesamedatabasecore,accessiblewithonequery language.Thegoalof thisdocument isspecially tohelpthereaderappreciate thecomparableeaseof thetransitiontogreaterstorageflexibilitywithmultipledatamodelsthatArangoDBprovides.

ArangoDBisanativemulti-modelNoSQLdatabase.Itsupportsthreepopulardatamodels,namelykey-value,documentandgraph.ArangoDBiswritteninmodernC++andisavailableasopen-sourceversion(CommunityEdition)andalsocomeswith a commercial version including additional features for performance andsecurity(EnterpriseEdition).

MovingrelationaldatatoArangoDBissimple:exportthestoreddatatoaformatsuchasCSV, and thenusearangoimp to import thedata intoArangoDB.Oncethat’sdone,youmaycontinueworkingwiththesamedatamodelthatyouusedpreviously with the relational database. Because ArangoDB supports real joinoperations,thereisnoneedfornesteddatatocircumventnormalizationandjoins.Hence,youdon’thavetoreworkthedatamodel.

Still,therearesomedifferencesbetweenbothstorageapproaches.Thistechnicalreportwilldescribethedifferencebetweeneachdatabaseconcept,aswellasthedatamodeling possibilities and query options availablewith each. In the finalsections,thispaperwillalsointroducetheclusterarchitectureofArangoDBandtheoptionsthatareavailableforscalinghorizontally–evenwithcomplexqueries.

2 Database Concepts

2.1 Introduction

Beforedelvingintodatamodelingandmoreadvancedconcepts,let’sstartwithageneralexaminationofthedatabaseconceptsbehindrelationaldatabasesystemsandcontrastthatwiththebasicconceptsofadatabasesystemlikeArangoDB.

2.2 Relational Database System Concepts

Thedatamodelofarelationaldatabasemanagementsystem(i.e.,RDBMS)utilizesthe controllingmetaphor of a table,much like ones found in inventory ledgerbooks used for centuries in business. Records are stored in tables, with eachrecordbeingarowinthattable.Thecolumnheadsdefinetheavailablefieldsandthevaluesforeachofthecolumnsperrecordinindividualcells.Suchtablescanbe organized in databases (e.g., for multi tenancy use). There are a few coredatabases that contain tables which store system configuration variables andotherdataforinternalusebythesystem.

Data in a relationaldatabase systemcanbequeriedand retrieved ina tabularformat. It can be full rows or subsets of columns. Behind the scenes, a binarytransport protocol is used to send data. Although one can choose betweendifferentstorageengines–InnoDBisthemostpopular–withdifferentfeaturesandperformanceproperties,howdataisactuallystoredisinvisibletotheuser.

2.2.1 Table and Schema Definition

Beforedatacanbeinsertedintoatable,adevelopermustfirstconsiderthedataitwillcontain,howitwillbeorganizedandthenatureof thatdata–oftentimeswithouthavingbeengivenmorethanasmallsampleofthedatainadvance.Hemustdecideonthenumberofcolumnsandrecognizablenamesforeachcolumn.More importantly,hemustdecideonthedatatype(e.g.,INT,CHAR,DATETIME)appropriatetothetypeofdataanticipated.Unlesshechoosestoacceptthedefaultsettings for each column, he may want to set some restrictions on the dataaccepted (e.g.,NOT NULL).Plus,hemaywantto takeaguessatwhich columnshouldbe indexedwithoutanyusage information,andtowhatextentan indexshouldbeused,asinthecaseofmulti-columnindexes.

Thedefinitionofatablewithallofitscolumnsettingsiscalledaschema.Schemas

canbemodifiedinmostrelationaldatabasesystemsaftertheinitialdefinition,butinsomesystems,especiallywithclusters,itcanbeanexpensiveoperationbecausethe data needs to be re-organized on schema changes. Although a skilleddevelopercananticipatepotentialtableusageandtheneedsofitsusers,designingaschemacanbemuchlikeawrittenone:makingchangeslatertotheorganizationofthedatacanbecumbersome–evenwhenyouusedapencilinsteadofapen.

2.2.2 Insertion and Identification of Records

Wheninsertingrecordsintoatable,thevalueofallofthefieldscanbeprovidedinthesameorderofthetable’scolumnsbasedonitsschema,orasubsetofthecolumns can be givenwith the name and values of each field given. Fields notincludedwillbesetgenerallytoaNULLvalue,whichexpressestheabsenceofaproper value – unless a default value was specified in the schema. If columndefinitionhasnodefaultvalueanditwasdefinedasNOT NULL,anerrorwilloccurwheninsertarowwithoutgivingavalueforthecolumn.DependingontheSQLmodeused,thiswillpreventtherowfrombeinginserted.Ifit’spartofamultiplerowinsert,itwillpreventallrowsfrombeinginserted.

Each row in a table can usually be uniquely identified by an automaticallyincrementednumber.Thecolumnforthisidentifierisdefinedastheprimarykey.It’sacceptabletouseadifferentdatatype,orforegoaprimarykeyaltogetherandidentifyrowsbyothermeans,evenbyacombinationofvaluesincertaincolumns.However,thisisnotacommonpractice.Ifaprimarykeyisreferencedbyanothertable,itiscalledaforeignkey.

2.2.3 Indexing

TheprimarykeyindexforfastaccessofrecordsisusuallyaBTREE,thetypicalindextypeused inRDBMS.Therearesomeidiosyncrasiesrelatedtothis indextype.Theprimarykeymustbeunique.Theuser can create secondary indexesusingoneormorefields,withsettingslikeuniqueorfulltextindexes.Dataqueriesexecuted on a huge dataset,withoutusing indexes can be significantly slower.Therefore, users are dependent on the database designer to define well andproperlyindexesforthebestperformancebasedonthedatacontentandhowitwillbeused.

2.2.4 Transactions

Akey strengthofRDBMS is that they support transactional guarantees. InSQLqueries, transactions are startedwith explicitBEGIN orSTART TRANSACTIONcommand.Followingaseriesofdataretrievalormodificationoperations,anSQLtransactioniscompletedwithaCOMMITcommand,orrolledbackwithaROLLBACKcommandor if the session is terminated.Theremaybe client communicationswith the server between the start and the commitment or rollback of an SQLtransaction.

2.3 Concepts in ArangoDB

At its core ArangoDB is a transactional document store, in the sense of JSONobjectsasdocuments.

Internally,dataisstoredinabinaryform(VelocyPack),butwhatisenteredandreturned from the system is typically JSON. By default, JSON data is sent andreceived over the well-known HTTP protocol by the server, which provides aRESTfulAPI.OtheroptionsareVelocyPackoverHTTPandVelocyStream,abinarytransportprotocol.

Howdataisactuallypersistedisnotvisibletotheuser.Adevelopercanchoosebetween two storage engines with different characteristics according to theparticularusecase:

2.3.1 JSON Data Types

JSONsupportsafewprimitiveandcompounddatatypesthatcanbecombinedtocomplex data structures. The primitive types are: null, boolean (true, false),number (integer and floating-point numbers), and string. These are fairly

straightforwardandthesameinanydatabasesystem.

Thecompoundtypesaremorecomplicated.Theyaretwotypes:arrayandobject.Anarrayisanorderedcollectionofelements,eachidentifiedbyanindexstartingat0;arrayscanbemultidimensional.Anobjectisahashmapwhichmapsstringkeys(i.e.,attributekeys)tovaluesofarbitrarytypes(i.e.,attributevalues).Objectscancontainprimitivetypesornestedcompoundtypes,withsupportforarbitrarydeepnesting,ifdesired.EachdocumentisessentiallyaJSONobjectatthetoplevel,witharbitrarynamedattributesthatcanhaveprimitivevaluesorbenestedarraysandobjects.

Documents are stored in collections. Collections have names, which ideallydescribe what kind of information the documents contain. Collections can beorganizedinmultipledatabases(e.g.,formulti-tenancyuse).Thedefaultdatabasein ArangoDB is called _system. It contains hidden collections for internalpurposes,buttheusermayalsoaddcollectionstothedatabase.

2.3.2 Living without a Schema

ArangoDBisschema-free.Thesystemdoesnotrequireyoutodeclareordefinefieldattributes.Norisitnecessarytospecifyinadvancethedatatypesoffields.Incaseaschemavalidationisnecessary,itcanbeintegratedbyaJOIvalidationinaArangoDBFoxxmicroservice.

Documentscanhaveanarbitrarystructureandcanuseanysupporteddatatype.Inprinciple,eachdocumentinacollectioncanbestructureddifferently.However,therewillusuallybesomefieldsthatalldocumentshaveincommoninthesenseofattributekeys.Theattributevaluesmayusedifferentdatatypes,nonetheless,eveniftheattributekeysmatch.

There is no such thing as filling in a subset of fields in a schema-free system,becauseforeverydocument,whatshallbepersistedisdefinedbythedocumentitself. It is self-contained. A document can have as few or as many attributesincludingnestedattributesasneeded.Both,attributekeysandvalues,havetobeexplicitlyspecifiedallthetime.

2.3.3 Document Keys and Indexes

Eachdocumentrequiresakeythatuniquelyidentifiesitwithinacollection.Itcanbeassignedbytheuseruponcreation,orArangoDBwillgenerateone.Thedatatypeisalwaysstring.Theautomaticallygeneratedkeysareincreasingnumbers(withgaps),butconvertedtoanon-numericcharactersequence.Documentkeyscannot be modified after document creation (immutable). There is a virtualattribute (i.e., the document ID) which is comprised of the collection name, aforwardslashandthedocumentkeytoidentifyadocumentwithinthedatabase.Documentkeysanddocument identifiersarealways indexed.The indexonthe_keyattributeiscalledaprimaryindex.Itexistsforeachcollectionandcan’tberemoved.It’sahashindexwiththeoptionsnon-sparseandunique.Thismeansitisalwaysnon-emptyandthattherearenoduplicatekeysinthecollection.

Additionalindexescanbedefinedbytheuseroveroneormultiplefields,aswellastheindividualelementsofarrays.Availableindextypes:

• Hash:constantlookuptime(superfast),onlyapplicableifanexactvalueissearched

• Skiplist: sorted index, slower than hash index, but suitable for rangequeries

• Geospatial:canindex2Dcoordinatesforfastretrievalofdocumentsnearareferencepoint(i.e.,closestdistancefirst),fullGeoJSONsupportwillbeaddedinArangoDB3.4

• Fulltext: fullwordandprefixmatching instringswithareverse lookupindex(completetextsearchandrankingengineincludinginvertedindexeswillbeintegratedinArangoDB3.4)

• Persistent: WithMMfilesstorageengine,persistentskiplistindexescanbecreatedviathistype.WithRocksDB,allindexesarepersisted.

ArangoDB can be used as a key-value store by retrieving documents via theirdocumentkeysorIDs.Theprimaryindexishash-basedwithaconstantlookuptimeandhasaselectivityof100%becauseeachkeymustbeuniqueinacollection,whichenablesquickretrievalviathekey.Thedocumentasawholeisreturnedasavalueinthekey-valueaccesspattern.

Ifsecondaryindexesareinvolved,oradocumentisreturnedpartially(i.e.,onlycertainattributes),itisnotusedstrictlyspeakingaskey-valuestore.Instead,it’susedlikeadocumentstore.Thereisnorealdistinction,however,betweentheminArangoDB.

2.3.4 Graphs

ThethirddatamodelsupportedbyArangoDBisgraphs.Graphsarelikenetworksconsistingof nodes that are linked together,which can expressmany-to-manyrelationships.Therecanbevarioustopologies, likeagraphforminga tree(e.g.,corporate hierarchy, electricity grid) rather than a mesh (e.g., social network,fraudnetworks).

TherearetwotypesofcollectionsinArangoDB:vertexandedgecollections.Bothstoredocumentsaspreviouslydescribed.Documentsinanedgecollectionhavetwoadditionalattributes,_fromand_to.BothhavetobeassigneddocumentIDstolinktogetherdocuments.Thedocumentthatlinksthemiscalledanedgeandthe linkeddocumentsare calledvertices in thegraphmodel.Edgesarealwaysdirected inArangoDB.The edge is leading from the source (i.e.,_from) to thetarget(i.e.,_to).EdgesdohaveadirectioninArangoDBbutcanbeignoredbytheANYstatement.

Additionally, there is a special edge index that can receiveall of theedgesof avertexinconstanttime.Sincereceivingedgesisthemostusedoperationinanygraphquery, this constant time lookup is a requirement forgraphdatabase tooperateperformantly.Thisindexcannotberemoved.Usersmaydefineadditionalindexesforedgecollectionsingeneral,andindexesoverthe_fromand_tofieldsinparticular,tocreatevertex-centricindexestospeedupcertaingraphqueries.

ArangoDBcanalsoefficientlyhandleothertypesofdata,suchasgeo-spatialandtext.WithArangoDB3.4,thedatatypestextandgeo-spatialwillreceiveamuchricher featuresetandnewindexestoextendthenumberofpossibleusecases.Thesedatatypesareintentionallynotcalleddatamodels,astheprocessingofthetwoismoreamatterofcomputationthanstorage.

2.3.5 Transactions

InArangoDB,atransactionisalwaysaserver-sideoperation.Itisexecutedontheserver in one go, without any client interaction. All operations to be executedwithin a transaction need to be known by the serverwhen the transaction isstarted.

Thereareno individualBEGIN,COMMITorROLLBACKstatements inArangoDB'squerylanguageAQL.Instead,atransactioninArangoDBisstartedbyprovidingadescriptionof the transactionwrittenas JavaScript function.This functionwillthen automatically start a transaction, execute all required retrieval andmodificationoperations,andautomaticallycommittransactionsattheend.Ifanerroroccursduringtheexecutionofatransaction,thetransactionisautomaticallyaborted,andallchangesarerolledback.

Transactions in ArangoDB can be multi-document and multi-collectiontransactions in a single instance. For cluster scenarios, ArangoDB alreadysupportsatomicoperations.Ineverydocument,ArangoDBstoresautomaticallythe so-called revision key,_rev in alldocuments. For a single instance, this isbeingused tosupport fullMultiVersionConcurrencyControl (MVCC,RocksDBstorage engine) and in a cluster setting to support atomic operations likeCompare-and-Swap.ArangoDBwillsoonsupportmoreoptionsfortransactionalguaranteesinclustersettingsonourpathtocluster-wideMVCC.

2.4 Comparison of Terminology

HavingexaminedandcomparedthekeyconceptsandmethodsofbothRDBMSandArangoDB,let’sconsiderandcomparetheterminology.Therearesometermsused in an RDBMS that are identical to its counterpart in ArangoDB, althoughviewedalittledifferently.Thereareothers,though,forwhichadifferenttermisusedtoreflectmoreclearlythedifferentperspectiveofArangoDB.Belowisalistofthesekeytermsforbothsystems:

InanRDBMS,entitiesofthesametypearestoredintables.Collectionsareroughlytheequivalentof tables indocument stores, as theyusuallyholddocumentsofsimilarentities.Therenormallywillbesomeormanycommonattributes.Becauseofthisvaryingnatureofdatastructures,collectionasisamoreaccuratetermthantable, which has a fixed structure. Documents are also not tabular, becauseunneeded attributes (i.e., fields) can be omitted and it’s possible to nest data,whichallowsforatree-likestructure.

The document structure can differ between any two documents in a singlecollection. For example, as new features are added to a product that usesArangoDBasabackend,additionalattributesmaybecomenecessary.Theycanbeaddedprogressivelyon-the-flyfornewlycreateddocuments,withouttheneedtoupdateallpreviouslyexistingdocuments.InSQL,thetablestructure(i.e.,schema)wouldneedtobeupdatedwithadditionalcolumns.This iscanbeasignificantoperationinsomesystemsandrequiresplanning.

While there are a few primitive data types in ArangoDB, the systemdoes notimposeanyrestrictionssuchashavingtodefinewhatdatatypesmustbeusedforwhichattributes—itisschema-free.Ifyouwishtovalidatedocumentstructuresontheserver-side,ArangoDBprovidesJOIobjectschemavalidationviaFoxx.

There are many more primitive data types in relational database systems. InArangoDB,thereisonlyasingleprimitivetypefornumbers,whichcoversintegersaswellasfloatingpointnumbers.There'salsoonlyonedatatypeforcharactersequences.TherearenolimitstothelengthofstringsandstringsarestoredinnormalizedUTF-8encoding.Therearen’tdatatypessuchasenumsorsets,whichrequirethedefiningacceptablevalues.Setscanbeemulatedbyusingafunctiontorestrictanarraytouniqueelements.AmissingorunavailablevalueinanRDBMSisberepresentedbyNULL.ArangoDBalsohasanullvalue,butit’srarelyneeded.Theabsenceofavaluecansimplybeexpressedbythelackoftheattribute.

The way ArangoDB implements graphs is different from many other graphdatabases,whichuse“index-freeadjacency”.Edgesstoredinaseparatecollection

withaspecialindexonthe_fromand_toattributesarenottoodifferentfromcross tables for storing many-to-many relationships in an RDBMS. Theperformanceguaranteesfortraversalsare,nonetheless,onparwithothergraphdatabases, but there is more potential to scale graphs. Combined with theEnterprise Edition features (i.e., SmartGraphs and SatelliteCollections), graphperformancecanbefurtherimproved.

3 Data Modeling

3.1 Introduction

Modelingdataforadatabasesystemstartswithdefiningwhatdataneedstobestoredtosolvetheproblem.Thisusuallyrequiresbeingclearaboutthequestionswhichshouldbeanswered,whatentities thereareneeded, andwhatare theirrelationshipstocreatea logicalmodel.Thismodelcanbe iterateduponuntil itaddressestheentireproblem.Themodelneedstobetranslatedtoadesignthatthechosendatabasesystemsallowsand ideallysupports inaneasytouseandperformantway.

3.2 Data Modeling in Relational Systems

Thefirststepincreatingalogicalmodelistodeterminetheentities.Forexample,considera simpledatabaseaboutmusic. Supposewechose tostore songsandalbumsbyartisttogetherwithgenres.

Thenextstepistodefinethepropertiesoftheseentitiesandtheirrelationships.Anentity-relationship-model(ERM)isthestandardwayofrepresentingallofthis.

It'seasytocastthisintoarelationaldatabasemodel.Foreverytypeofentity,onetableisdefinedwiththeentitypropertiestranslatedtothecolumns.

Relationshipsbetweenrecordscanbeexpressedwithmatchingvaluesbetween

tables.Theycanbeone-to-onerelations(i.e.,1:1),inwhicheachrecordintableAmatchesonerecordintableB.Thatistosay,afieldinAandanotherinBhavethesame value — usually by way of a primary key. Not every record needs acounterpart;theremayalsobeone-to-zerorelations,buttherecan’tbemorethanonerecordoneitherside.It’scommontointegratethefieldsoftableBintotableAinsuchasituation,unlesstherearespecialconsiderationstoholdthedataindifferenttables(e.g.,accesspermission).

Incaseofone-to-manyrelations(i.e.,1:n),eachrecordintableAmatcheswithoneor more records in table B. Applying this to our example, albums and theirrespectivesongswouldbestoredseparatelyinthismodel.Eachsongrecordcanonlybelinkedtoonealbuminthissimplemodel.Inareality,onewouldprobablyallowthesamesongstobelinkedtomultiplereleases.

Therelationshipbetweenalbumsandgenresisdifferent:amany-to-manyrelation(i.e.,n:m).Asinglealbumcanhavemultiplegenres,andmultiplealbumscanhavethesamegenre.Albumsandgenrescanbestoredintwotables,andathirdtableknownasacrosstablebeintroducedtolinktherecordsofthetwoformertables.Arecordinthecrosstablestoresitsownprimarykey,andonealbumIDandonegenreID,aforeignkey.

Tomakeuseof stored relations, toresolve foreignkeys to fields fromanothertable,thetables—orrathersubsetsoftheirrecords—arejoinedonthefieldswhichholdtherecordkeys.Thesefieldsshouldbeindexed,astherecanbeahighcomputationalcomplexityifallofthecombinationsofrecordsneedtobechecked.The result isoftenadatabasewithmany tablesandmanycross tables, alwaysrequiringjoinstoquerythedata,thusimpactingperformance.

3.3 Data Modeling in ArangoDB

Asdescribed intheDatabaseConceptssectionabove,ArangoDBisadocumentstore at its core. A JSONdocument containingmultiple entities embedded in asingledocumentcouldlookliketheexampleontheright.

CurlybracketssurroundJSONobjects,whichcancontainattributes.Attributesaregiven in pairs, key/value pairs. Each key and value are separated by a colon.

Attributepairsareseparatedbycommas.

Attribute keys are enclosed in double quote marks,which signifies a string. Keys are always string datatypesinJSON.Theyaredepictedintealbluehere.

The values can be of any type. In the example here,name, country and other nested attribute values arealsostrings,coloredingreen.Thealbumyearandsongdurationvaluesarenumbersandhighlightedinblue.

Squarebracketssurroundvaluelists.Inthisexample,albums, genresandsongsare sucharrays.Regardinggenres,thearrayscontainmultiplestringswithgenrenames.

Thearrayforalbumscontainstwonestedobjects,oneperalbum.Eachobjecthasfourattributes:Title,Year,Genres and Songs. The Songs array holds evenmoreobjects,onepertrackofeachalbum.Eachobjecthasatitleandadurationattribute.Bytheway, thedashedlinesdenoteomissions,tofittheexampledocumentonthispage.

Thisisacontrivedexamplewithallofthedataaboutanartistinasingledocument,todemonstrateinformationnesting.

Usingmultiplelevelsofnestingcanbeverypowerful,butit’snottheonlywaytostructure data. Based on the data model draft and ERM diagram from thebeginningofthischapter,therearefourentitiesinthenormalizedmodel:Artist;Album;Song;andGenre.

Eachtypeofentitycanbestoredinacollection:alloftheartistdocumentscanbestored inanArtistcollection;andalbumdocumentscanbestored inanAlbumcollection.Entitiesordocumentscanthenbelinkedinmultipleways.

Anadditionalcollectioncouldbecreated,tostoreanalbumdocumentkeyandasong document key, similar to a cross table, to be joined in a query based onmatching keys. It’s a permitted method, but not recommended. Instead, crosstables can be translated to edge collections almost unedited. Instead ofestablishingaconnectionwithapairofforeignkeys,the_fromand_toattributesofanedgecanbeusedtoexpresstherelationandopenthepossibilityforgraphtraversals.

It’snotnecessarytousethegraphmodelforeverything.Itmaynotbetheidealsolutionrelatedtoperformancetodothisforverylimited,fixeddepthtraversals—especiallyaone-steptraversalaswewouldhereforalbumsandtheassociatedgenres.

Embeddinggenredata into thealbumdocuments is apossible solution.Onlyasingle document read would be needed to retrieve the album information,includingthegenredata.Ontheotherhand, itbecomeshardertomaintainthedata.Forexample,renamingagenrewithoutintroducinginconsistencieswouldbetedious.Allalbumdocumentscontainingato-be-renamedgenrewouldneedtobemodified.Furthermore,therewouldn'tbeanefficientwaytoretrievealistofallgenres.

Aviablemiddlewayistohavetwocollections,oneforalbumsandoneforgenres,andstoreanarrayofgenredocumentkeysinthealbumdocumentsformany-to-manyrelations.Thiswaythereisacentralplacetoretrieveallavailablegenresandanindividualgenre.Therewouldbeonlyasingledocumenttomaintainallofthe information related to a genre in the Genre collection. The link between

albums and genres is established via the genreKeys attribute in each Albumdocumentandcanbeeasilyresolvedinaquery,withoutanadditionalcollection.Alookuporcollectionjoinofgenresleadstothesameresultsaswiththegraphmodel,butperformsbetter thana full graph traversal. Ifperformance is lessaconcern,though,youmayaswellusethegraphapproachsinceitresemblesmorecloselythewaywethinkofrelations.

Another consideration should be the amount of links between documents.Chancesarethatatleastsomeofthegenreswillbereferencedbythousandsorevenmillionsofalbums.Usingagraphmodelwouldresultinsomegenrevertices,having a huge amount of edges connected to it. Such vertices are called supernodes.Theyshouldbeavoided.Traversalsrunningintosupernodeswillhavetofollowalloftheseedges.Thecollectionjoinmodelscalesmuchbetter.Thedesiredapproach,though,canbechosenbytheuser,dependingontheusecase.

3.4 Comparison of Data Models

Data ismodeledas tables foranRDBMS. It’s commonpractice tonormalizeaninitialmodelso thatonearrivesat amodelwithout redundancieswithatomicvalues, in the sense that pieces of information aren’t being split into separatetablesandcolumns.Forexample,youwouldn’thaveasinglefieldcontaining"ToysintheAttic(1975)",withbothanalbumtitleandthereleaseyear. Instead,youwouldhavetwofields,onecontainingthetitleandtheothertheyear.Youwouldalsohavedistincttablesforeachentitytype.Thenormalizationprocessinvolvestheadditionofcrosstablestostorerelationsbetweenrecords.

Whilethereisaplethoraofdatatypesfromwhichtochoosebeforestoringdatainrelationalsystems,thereareonlyafewbasicdatatypesinArangoDBthatcanbecombinedtostructuredatainmanyways.Theyallowforsimpleandcomplexalike,independentforeachdocument.Everythingexpressiblewithtablescanbetranslated to JSON documents, without the need to specify in advance thestructureofdocuments.

A direct conversion of relational records to documents would result in JSONdocumentswithtop-levelattributesonly(i.e.,nonestedobjects).Thepossibility

of having sub-attributes and arrays, however, enables quite a different datamodelingapproach.Thisapproachpermitstheembeddingofdataindocumentsandeliminatessomeextratablesorcollections.Translatingcrosstablestoedgecollections isagreatwaytostoremany-to-manyrelationships inArangoDB, togain graph features such as shortest path computation and graph traversal.Distributed graph processing based on the Pregel computing model is alsoavailable starting inversion3.2ofArangoDB. If youdon'thaveaneed for realgraph,amodelwithcollectionjoinscanbechoseninstead.Multiplemodelscanbecombinedfreely.ArangoDBprovidesflexibility,allowingthedevelopertostrikeabalance between an understandable data model, fast development iterations,neededfeaturesandperformance.

4 Query Concepts

4.1 Introduction

Storingdataisonlypartofthefunctionofadatabasesystem.Theabilitytoretrievedata in variousways from a database iswhere an electronic database ismostuseful. Otherwise, youmay use a paper-based system aswell. To facilitate theretrievalofdata,aquerylanguageisprovided.RelationaldatabasesystemslikeMySQL use a StructuredQuery Language (i.e., SQL), while ArangoDB providesArangoQueryLanguage(i.e.,AQL).

4.2 SQL

SQLcannotonly retrieve records from thedatabase system, it canalso insert,update and delete records. Tables and databases can be created, altered anddropped,anduserpermissionscanbemanagedwithSQLstatements.Theseusesaregroupedandclassifiedbasedon theirsimilarities:data retrievalqueriesasDQL(i.e.,dataquerylanguage);inserting,updatinganddeletingdataasDML(i.e.,datamanipulationlanguage),structureandschemastatementasDDL(i.e.,datadefinitionlanguage);andcreatingusersandsettingpermissionsasDCL(i.e.,datacontrol language). There are also procedural elements in the form of storedproceduresformorecomplexlogic.

SQLbecameastandardin1986,withseveralrevisionsovertheyearsuptothecurrentSQL:2016standard.MostRDBMSimplementasubsetofthesestandardsoravariantwithadditionalvendor-specific featureswithadialectof thequerylanguage.Thus,otherthanthemostbasicSQLstatements,SQLcodethatworksinonesystemmaynotwork inanothersystem,withoutmodificationsbecauseofsyntaxandfeaturedifferences.

SQL comes with several language constructs to interact with databases. EachconstructorSQLstatement,readslikeanEnglishsentencewithafixedorderofclauses.Manyclausesareoptional.Aclauseisstartedbyakeyword,followedbyan expression. For example, to retrieve records from a database, a SELECTstatementisusedandspecifieswhattoreturn:

Eachclausemustappearinthedesignatedplaceifitisused.Forexample,itwouldbea syntaxerror if theWHERE clausewasenteredbefore theFROM clause.TheWHERE clause can beomitted to return unconditionally thedesired fields of allrecords.HAVINGallowsmoreconditionstobeadded,butcanonlybeusedafterGROUP BYasapost-filter.TheWHERE clausedescribes filterconditionsappliedbeforetheaggregationinthecombinationofboth.TheORDER BYclauseisusedtosortthematchingrecordsbyoneormorefieldsinascendingordescendingorder.

WedidnotincludeaLIMITclauseinthisgenericexampletodefineanoptionaloffsetandamaximumnumberofrecordstoreturn.TheLIMITclauseiscommonlyusedwithMySQL, but it’s actually non-standard. TSQL, used inMicrosoft SQL,supports the syntax SELECT TOP 10, for instance. There are many morevariations and some support for multiple syntaxes in various systems. TheSQL:2008standardprovides foraFETCH clause(e.g.,FETCH FIRST 10 ROWS ONLY),withvaryingsupportacrosssystems.

Themost common joiningof tables inSQL is the inner join,which returns theintersectionoftwosets.Acolumnofthefirstsetisusedtomatchthevaluesofatablewith a specified columnof another table, the second set. For instance, ifthere'sarecordwithaspecificvalue inaparticularcolumninonetable,andarecordwith the same value in a corresponding column in another table, thoserecordswiththeselectedcolumnsarereturned.

Unlessemp_id isdefinedsoastoallowonlyuniquevalues,thesameemployeeidentificationnumbermayoccurmultipletimesineithertable.ThiswillproduceadditionalresultsinSELECTstatements.AJOINisusedtoexpresstheothertableyouwanttojoin.AnINNER JOINsaysonlytoreturnrecordsinwhichthevaluesinthegivencolumnsareequal.Itcanalsobeexpressedlikethis:

If you want all of the records from the employees table are supposed to bereturned,regardlessofamatchintheemp_contact_infotable,aleftjoincanbeusedlikeso:

Basically,thetablebeingselectedisthetableontheleft,andtheonebeingjoinedthe table on the right. A left join will return every row in the left table (i.e.,employeeshere)thatmeettheconditionsoftheWHEREclause,iftherewereone,aswellasthematchingdataintherighttable.However,ifthereisn’tamatchingrowintherighttable,NULLvaluesaredisplayedforfieldsrelatedtoit.

4.3 AQL

AQLisclassifiedasDQLandDML,whichmeans itcancreate,read,updateanddeletedocuments.Itcannotbeusedtocreatecollectionsanddatabases,managepermissionsordefineschemas.Assuch,AQLhassomesimilaritiestoSQL,butalsodistinctdifferences,mainlybecauseit’sdesignedforcombiningmultiplemodelsandgraphtraversalsinparticular.

AQL was invented because when ArangoDB was created, there was no SQLstandardwhich covered itsmulti-model needs. A standard does still not existwhichcoversallthatisArangoDB,inparticulargraphs.ThisisprobablybecauseNoSQLsystems involveawiderangeofdataconceptsandpracticalgoals,withvastly different technological approaches. ArangoDB supports multiple datamodelswithitssinglequerylanguage,AQL.It’sacustom-tailoredfit.Becauseit'snot led by a standard, it can improve quicklyand freely, based onmarket anddeveloperneeds.

4.3.1 FOR Loop Construct

AQLhasalanguageconstructforloopingthroughdata,typicalinprogramminglanguagesmorethandatabasequery languages.Thisallowsdeveloperstobothqueryandprograminoneversatilelanguage,ratherthanhavetolearnandusetwo.It’sthecentralbuildingblockfordocumentretrieval.ThemostcommonformistheFORloopusedwithacollectionofdocuments,muchlikeSELECTinSQL.Itcan also be used to iterate through arrays, such as an array attribute, forintermediateresultsfromasubqueryoravariabledefinedinline.

4.3.2 High-Level Operations

Otheressentialhigh-leveloperationsareFILTER,SORTandLIMIT.Theycanbeenteredindifferentlocations,usuallywithvaryingresultsbasedontheprocessingorderdescribed by the query.You can definewhat a query should return in aRETURNstatement.Onecanchoosewhichpartsofadocumenttoreturn,or the

documentasawhole,optionallywithadditionaloralteredattributescomputed.

Datamodificationqueries(i.e.,INSERT,UPDATE,REPLACE,DELETE)don’trequireaRETURN statement. It’smandatoryonly indataaccessqueries. It canbeusednonetheless inmodification queries, such as for returning old or new state ofdocuments. Modification queries come in a few syntax variants, but thepredominantformsarelikethefollowing:

ThedifferencebetweenUPDATE andREPLACE is thatUPDATE allows forpartialmodifications,suchaschangingexistingattributesoraddingnewones,whereasREPLACEretainsonlythedocumentkey,andswitchestherestofthedocumentwiththeprovidednewcontent.

4.3.3 Query Optimization in AQL

In order to execute the above described queries, ArangoDB uses a two-phasequery optimization technique. In the first phase, the query is parsed andtransformed intoan internal representation, anabstract syntax tree (i.e.,AST).ThenoptimizerrulestransformtheASTwithoutchangingtheresult.Forinstance,theoptimizerwilltrytouseindexeswhereverpossible.

Thereareoftenseveralpossiblewaystoexecutethesamequery.Forexample,ina join itcouldstart thesearch inanyof the involvedcollections.Theoptimizerestimates what’s required to solve the query for each way possible, and willexecutetheonethatrequiresthelowestamountofoperations.IfyouarecurioustoseewhatArangoDBcreatesfromyourqueryyoucanchecktheexplainoutput.Thistoolcanhelptoimprovequeryperformance.

4.3.4 Combine All of the Models

Nowlet’sseewherewecanmakeuseofthemulti-modelapproachandcombineagraphsearchwithjoins.Let’sassumethatwehaveaverylargedatasetofsongsfromseveralyearsandseveralgenres.Inthiscase,somegenreswillmostlikelybesuper-nodes,verticeswithmanyconnectededges(e.g.,“Pop”).It’salwaysabadidearelatedtoperformancetoiteratethroughthemwithinaquery.

Considerthefolloweruserrequest:“Ijustlistenedtoasongcalled,TributeandI

likeditverymuch.IsuspectthattheremaybeothersongsofthesamegenreasthissongthatImightenjoy.So,Iwanttofindallofthealbumsofthesamegenrethatwerereleasedinthesameyear”.

Let’s break this into logical steps, utilizing the names we have assigned tocomponentsofourdatabase.We’lldothisinthepuregraphway,inthisorder:

1. StartwithSong,‘Tribute’2. Onestep(PartOf)tofindtheAlbum3. Onestep(Has)tofindtheGenre4. Onestepback(Has)tofindallotherAlbum’softhisGenre5. YearofotherAlbumisidenticaltoyearofAlbum

WritteninAQL,thequerywouldlooklikethis:

This query is straightforward and will find the solution. However, it has thedrawbackthatismost-likelytraversingoverasuper-nodeinGenre.Also,thelistofgenresisunlikelytogrowrapidly.Althoughwemayaddmanynewsongs,weseldomlyaddnewgenres.Therefore,itwouldbebetterfirsttoselectallAlbumsofthesameyearandthenvalidatethatthegenreisidentical.ThiswaywegetalimitedsetofAlbums,andeachhasonlyonegenre.Thatquerywouldberesilienttodatagrowth.

So,letusmodifyourquerytothefollowing:

1. StartwithSong,‘Tribute’2. Onestep(PartOf)tofindtheAlbum3. OneStep(Has)tofindtheGenre4. JoinallAlbumswithidenticalYear5. ForeachotherAlbum,onesteptofinditsgenre6. FilterallotherAlbumwherethegenreisdifferent

ThesestepscanbewritteninAQLlikethis:

4.4 Query Language Comparison

Atthecore,SQLandAQLarebothdeclarativelanguages,withfunctionsthatcanbecalledforadditionaloperations.ThekeydifferencebetweenSQLandAQListhatinSQLyoudescribetheresultyouwant,andinAQLyoudescribetheprocesstogetthere.TheconceptbehindAQLisclosertocoding,andthereforeeasierthanswitchingbetweenacodinglanguageandqueryinglanguage.

SQLsyntaxhasafixedstructure,whereashighleveloperationsinAQL,suchasfilters,canbeputinvariousplacesfordifferentpurposesandresults.TheflowismorelogicalthanadheringtoSQLstatementorder.Plus,high-leveloperationslikeFILTERandSORTaremoreversatile.FILTERequalsWHEREandHAVING inSQL,dependingonwhereitappears.SORTisadirectequivalentofORDER BYinSQL.LIMIT is the same as in MySQL, although there are some things to considerregardingscopeinconjunctionwithsubqueries.RETURNdoesn’texistinSQL,butyoucanspecifywhichfieldstoreturninSELECT.InSQLwhatistobereturnedisgivenearly in thestatement,whereas inAQL,RETURN isplacedat theendof aqueryandofsubqueries.

AggregationwithCOLLECT isn’t too different fromGROUP BY in SQL, but theexistenceofcompoundtypesinAQLmakesitverydifferent.WhileinAQLthereisadirectalternativetotheconstructofaGROUP BYclauseinSQL,withAQLtherearemanymoretwistsandwaystoaggregatedata.Forexample,alldocumentsthatfallintoagroupbasedonthegroupingcriteriacanbekeptandfurtherprocessed.With the COLLECT AGGREGATE clause, you can efficiently compute statisticalfigureswhilethedataisbeinggrouped.Groupingisfastevenwithoutanindex,usingahash-basedapproach.

FORloopsinalltheirvariantsareakeyconceptofAQL.TheymakeAQLfeelmore

likeaprogramminglanguagethanaquerylanguage.It’samajorbuildingblockandaddsasenseofflowtoqueries.

Joins,whichareheavilyusedinmostapplicationsbasedonanRDBMS,arealsopossible in AQL. Joins are expressed with nested FOR loops combined withFILTER,insteadofaseparatesyntaxforeachJOIN.It’squitesimilartothesyntaxforinnerjoinsinSQL,butwithoutaJOINstatement.Multipletypesofjoinscanbecarried outwith AQL, although itmay not be necessary to do so. An array ofdocumentkeystogetherwiththeDOCUMENT()functionisaviablealternativetoresolveone-to-manyrelationships.

Recursion is not possible in SQL, except with a stored procedure—which isbasicallyproceduralprogrammingandnotstrictlypartoftheSQLstandards;theyare largely vendor-specific. Using nested FOR loops in AQL allows for thetraversingofmultiplelevelsofattributeswithinadocument.Thegraphtraversalvariant of the FOR loops allows a query to traverse a graph with a definableminimumandmaximumdepthwitha fewsimple linesofcode.Thissametaskwould be extremely complex in a relational database system with a storedprocedure.

Comparing Relational Databases to a Native Multi …...Consiglio Nazionale delle Ricerche Istituto...

Documents

Transcript of Comparing Relational Databases to a Native Multi …...Consiglio Nazionale delle Ricerche Istituto...