BERT Goes Shopping - arXiv

BERT Goes ShoppingComparing Distributional Models for Product Representations

Federico BianchiBocconi University

Milano Italyfbianchiunibocconiitlowast

Bingqing YuCoveo

Montreal CAcyu2coveocom

Jacopo TagliabueCoveo AI Labs

New York United Statesjtagliabuecoveocomdagger

AbstractWord embeddings (eg word2vec) have beenapplied successfully to eCommerce productsthrough prod2vec Inspired by the recentperformance improvements on several NLPtasks brought by contextualized embeddingswe propose to transfer BERT-like architecturesto eCommerce our model ndash Prod2BERTndash is trained to generate representations ofproducts through masked session modelingThrough extensive experiments over multipleshops different tasks and a range of designchoices we systematically compare the ac-curacy of Prod2BERT and prod2vec embed-dings while Prod2BERT is found to be supe-rior in several scenarios we highlight the im-portance of resources and hyperparameters inthe best performing models Finally we pro-vide guidelines to practitioners for training em-beddings under a variety of computational anddata constraints

1 Introduction

Distributional semantics (Landauer and Dumais1997) is built on the assumption that the meaning ofa word is given by the contexts in which it appearsword embeddings obtained from co-occurrence pat-terns through word2vec (Mikolov et al 2013)proved to be both accurate by themselves in repre-senting lexical meaning and very useful as compo-nents of larger Natural Language Processing (NLP)architectures (Lample et al 2018) The empiricalsuccess and scalability of word2vec gave rise tomany domain-specific models (Ng 2017 Groverand Leskovec 2016 Yan et al 2017) in eCom-merce prod2vec is trained replacing words in asentence with product interactions in a shoppingsession (Grbovic et al 2015) eventually generat-ing vector representations of the products The key

lowastFederico and Bingqing contributed equally to this re-search

daggerCorresponding author

intuition is the same underlying word2vec ndash youcan tell a lot about a product by the company itkeeps (in shopping sessions) The model enjoyedimmediate success in the field and is now essentialto NLP and Information Retrieval (IR) use casesin eCommerce (Vasile et al 2016a Bianchi et al2020)

As a key improvement over word2vec the NLPcommunity has recently introduced contextualizedrepresentations in which a word like play wouldhave different embeddings depending on the gen-eral topic (eg a sentence about theater vs soccer)whereas in word2vec the word play is going tohave only one vector Transformer-based architec-tures (Vaswani et al 2017) in large-scale models- such as BERT (Devlin et al 2019) - achievedSOTA results in many tasks (Nozza et al 2020Rogers et al 2020) As Transformers are beingapplied outside of NLP (Chen et al 2020) it isnatural to ask whether we are missing a fruitfulanalogy with product representations It is a priorireasonable to think that a pair of sneakers can havedifferent representations depending on the shop-ping context is the user interested in buying theseshoes because they are running shoes or becausethese shoes are made by her favorite brand

In this work we explore the adaptation of BERT-like architectures to eCommerce through exten-sive experimentation on downstream tasks andempirical benchmarks on typical digital retailerswe discuss advantages and disadvantages of con-textualized embeddings when compared to tradi-tional prod2vec We summarize our main contribu-tions as follows

1 we propose and implement a BERT-basedcontextualized product embeddings model(hence Prod2BERT) which can be trainedwith online shopper behavioral data and pro-duce product embeddings to be leveraged by

arX

iv2

012

0980

7v2

[cs

CL

] 2

3 Ju

n 20

21

downstream tasks

2 we benchmark Prod2BERT against prod2vecembeddings showing the potential accuracygain of contextual representations across dif-ferent shops and data requirements By testingon shops that differ for traffic catalog anddata distribution we increase our confidencethat our findings are indeed applicable to avast class of typical retailers

3 we perform extensive experiments by vary-ing hyperparameters architectures and fine-tuning strategies We report detailed resultsfrom numerous evaluation tasks and finallyprovide recommendations on how to besttrade off accuracy with training cost

4 we share our code1 to help practitioners repli-cate our findings on other shops and improveon our benchmarks

11 Product Embeddings an IndustryPerspective

The eCommerce industry has been steadily grow-ing in recent years according to US Departmentof Commerce (2020) 16 of all retail transac-tions now occur online worldwide eCommerce isestimated to turn into a $45 trillion industry in2021 (Statista Research Department 2020) In-terest from researchers has been growing at thesame pace (Tsagkias et al 2020) stimulated bychallenging problems and by the large-scale im-pact that machine learning systems have in thespace (Pichestapong 2019) Within the fast adop-tion of deep learning methods in the field (Ma et al2020 Zhang et al 2020 Yuan et al 2020) prod-uct representations obtained through prod2vec playa key role in many neural architectures after train-ing a product space can be used directly (Vasileet al 2016b) as a part of larger systems for rec-ommendation (Tagliabue et al 2020b) or in down-stream NLPIR tasks (Tagliabue and Yu 2020)Combining the size of the market with the pastsuccess of NLP models in the space investigatingwhether Transformer-based architectures result insuperior product representations is both theoreti-cally interesting and practically important

Anticipating some of the themes below it isworth mentioning that our study sits at the intersec-tion of two important trends on one side neural

1Code available at httpsgithubcomvinidprodb

models typically show significant improvementsat large scale (Kaplan et al 2020) ndash by quantify-ing expected gains for ldquoreasonable-sizedrdquo shopsour results are relevant also outside a few publiccompanies (Tagliabue et al 2021) and allow for aprincipled trade-off between accuracy and ethicalconsiderations (Strubell et al 2019) on the otherside the rise of multi-tenant players2 makes sophis-ticated models potentially available to an unprece-dented number of shops ndash in this regard we designour methodology to include multiple shops in ourbenchmarks and report how training resources andaccuracy scale across deployments For these rea-sons we believe our findings will be interesting toa wide range of researchers and practitioners

2 Related Work

Distributional Models Word2vec (Mikolov et al2013) enjoyed great success in NLP thanks to itscomputational efficiency unsupervised nature andaccurate semantic content (Levy et al 2015 Al-Saqqa and Awajan 2019 Lample et al 2018) Re-cently models such as BERT (Devlin et al 2019)and RoBERTa (Liu et al 2019) shifted much ofthe community attention to Transformer architec-tures and their performance (Talmor and Berant2019 Vilares et al 2020) while it is increasinglyclear that big datasets (Kaplan et al 2020) andsubstantial computing resources play a role in theoverall accuracy of these architectures in our ex-periments we explicitly address robustness by i)varying model designs together with other hyper-parameters and ii) test on multiple shops differingin traffic industry and product catalog

Product Embeddings Prod2vec is a straightfor-ward adaptation to eCommerce of word2vec (Gr-bovic et al 2015) Product embeddings quicklybecame a fundamental component for recommenda-tion and personalization systems (Caselles-Dupreet al 2018 Tagliabue et al 2020a) as well asNLP-based predictions (Bianchi et al 2020) Tothe best of our knowledge this work is the first toexplicitly investigate whether Transformer-basedarchitectures deliver higher-quality product rep-resentations compared to non-contextual embed-dings Eschauzier (2020) uses Transformers on cart

2As an indication of the market opportunity in the spaceof AI-powered search and recommendations we recently wit-nessed Algolia (Techcrunch 2019a) and Lucidworks rais-ing 100M USD (Techcrunch 2019c) Coveo raising 227MCAD (Techcrunch 2019b) Bloomreach raising 115M USD(Techcrunch 2021)

co-occurrence patterns with the specific goal ofbasket completion ndash while similar in the maskingprocedure the breadth of the work and the evalua-tion methodology is very different as convincinglyargued by Requena et al (2020) benchmarkingmodels on unrealistic datasets make findings lessrelevant for practitioners outside of ldquoBig TechrdquoOur work features extensive tests on real-worlddatasets which are indeed representative of a largeportion of the mid-to-long tail of the market more-over we benchmark several fine-tuning strategiesfrom the latest NLP literature (Section 52) shar-ing ndash together with our code ndash important practicallessons for academia and industry peers The clos-est work in the literature as far as architecture goesis BERT4Rec (Sun et al 2019) ie a model basedon Transformers trained end-to-end for recommen-dations The focus of this work is not so muchthe gains induced by Transformers in sequencemodelling but instead is the quality of the rep-resentations obtained through unsupervised pre-training ndash while recommendations are importantthe breadth of prod2vec literature (Bianchi et al2021ba Tagliabue and Yu 2020) shows the needfor a more thorough and general assessment Ourmethodology helps uncover a tighter-than-expectedgap between the models in downstream tasks andour industry-specific benchmarks allow us to drawnovel conclusions on optimal model design acrossa variety of scenarios and to give practitioners ac-tionable insights for deployment

3 Prod2BERT

31 Overview

The Prod2BERT model is taking inspiration fromBERT architecture and aims to learn context-dependent vector representation of products fromonline session logs By considering a shoppingsession as a ldquosentencerdquo and the products shoppersinteract with as ldquowordsrdquo we can transfer maskedlanguage modeling (MLM) from NLP to eCom-merce Framing sessions as sentences is a naturalmodelling choice for several reasons first it mim-ics the successful architecture of prod2vec secondby exploiting BERT bi-directional nature each pre-diction of a masked tokenproduct will make useof past and future shopping choices if a shoppingjourney is (typically) a progression of intent fromexploration to purchase (Harbich et al 2017) itseems natural that sequential modelling may cap-ture relevant dimensions in the underlying vocabu-

Figure 1 Overall architecture of Prod2BERT pre-trained on MLM task

larycatalog Once trained Prod2BERT becomescapable of predicting masked tokens as well asproviding context-specific product embeddings fordownstream tasks

32 Model Architecture

As shown in Figure 1 Prod2BERT is based ona transformed based architecture Vaswani et al(2017) emulating the successful BERT modelPlease note that different from BERTrsquos originalimplementation a white-space tokenizer is firstused to split an input session into tokens each onerepresenting a product ID tokens are combinedwith positional encodings via addition and fed intoa stack of self-attention layers where each layercontains a block for multi-head attention followedby a simple feed forward network After obtain-ing the output from the last self-attention layer thevectors corresponding to the masked tokens passthrough a softmax to generate the final predictions

33 Training Objective

Similar to Liu et al (2019) Sun et al (2019) wetrain Prod2BERT from scratch with the MLM ob-jective A random portion of the tokens (ie theproduct IDs) in the original sequence are chosenfor possible replacements with the MASK tokenand the masked version of the sequence is fed intothe model as input Figure 2 shows qualitativelythe relevant data transformations from the original

Figure 2 Transformation of sequential data from theoriginal data generating process ndash ie a shopping ses-sion ndash to telemetry data collected by the SDK to themasked sequence fed into Prod2BERT

shopping session to the telemetry data to the finalmasking sequence The target output sequence isexactly the original sequence without any maskingthus the training objective is to predict the originalvalue of the masked tokens based on the contextprovided by their surrounding unmasked tokensThe model learns to minimize categorical cross-entropy loss taking into account only the predictedmasked tokens ie the output of the non-maskedtokens is discarded for back-propagation

34 Hyperparameters and Design ChoicesThere is growing literature investigating how dif-ferent hyperparameters and architectural choicescan affect Transformer-based models For exam-ple Lan et al (2020) observed diminishing returnswhen increasing the number of layers after a cer-tain point Liu et al (2019) showed improved per-formance when modifying masking strategy andusing duplicated data finally Kaplan et al (2020)reported slightly different findings from previousstudies on factors influencing Transformers perfor-mance Hence it is worth studying the role of hy-perparameters and model designs for Prod2BERTin order to narrow down which settings are the bestgiven the specific target of our work ie productrepresentations Table 1 shows the relevant hy-perparameter and design variants for Prod2BERTfollowing improvement in data generalization re-ported by Liu et al (2019) when duplicated = 1we augmented the original dataset repeating eachsession 5 times3 We set the embedding size to64 after preliminary optimizations as other valuesoffered no improvements we report results only

3This procedure ensures that each sequence can be maskedin 5 different ways during training

Parameter Values

epochs [e] 10 20 50 100 layers [l] 4 8masking probability [m] 015 025duplicated [d] 1 0

Table 1 Hyperparameters and their ranges

for one size

4 Methods

41 Prod2vec a Baseline ModelWe benchmark Prod2BERT against the industrystandard prod2vec (Grbovic et al 2015) Morespecifically we train a CBOW model with neg-ative sampling over shopping sessions (Mikolovet al 2013) Since the role of hyperparame-ters in prod2vec has been extensively studied be-fore (Caselles-Dupre et al 2018) we prepare em-beddings according to the best practices in Bianchiet al (2020) and employ the following config-uration window = 15 iterations = 30ns exponent = 075 dimensions = [48 100]While prod2vec is chosen because of our focuson the quality of the learned representations ndash andnot just performance on sequential inference perse ndash it is worth nothing that kNN (Latifi et al2020) over appropriate spaces is also a surprisinglyhard baseline to beat in many practical recommen-dation settings It is worth mentioning that forboth prod2vec and Prod2BERT we are mainly in-terested in producing a dense space capturing thelatent similarity between SKUs other importantrelationships between products (substitution (Zuoet al 2020) hierarchy (Nickel and Kiela 2017)etc) may require different embedding techniques(or extensions such as interaction-specific embed-dings (Zhao et al 2020))

42 DatasetWe collected search logs and detailed shopping ses-sions from two partnering shops Shop A and ShopB similarly to the dataset released by Requena et al(2020) we employ the standard definition of ldquoses-sionrdquo from Google Analytics4 with a total of fivedifferent product actions tracked detail add pur-chase remove click5 Shop A and Shop B are

4httpssupportgooglecomanalyticsanswer2731565hl=en

5Please note that as in many previous embedding stud-ies (Caselles-Dupre et al 2018 Bianchi et al 2020) action

Shop Sessions Products 5075 pct

Shop A 1970832 38486 5 7Shop B 3992794 102942 5 7

Table 2 Descriptive statistics for the training datasetpct shows 50th and 75th percentiles of the sessionlength

mid-sized digital shops with revenues between 25and 100 millions USDyear however they differin many aspects from traffic to conversion rate tocatalog structure Shop A is in the sport apparelcategory whereas Shop B is in home improvementSessions for training are sampled with undisclosedprobability from the period of March-December2019 testing sessions are a completely disjointdataset from January 2020 After pre-processing6descriptive statistics for the training set for Shop Aand Shop B are detailed in Table 2 For fairness ofcomparison the exact same datasets are used forboth Prod2BERT and prod2vec

Testing on fine-grained recent data from multi-ple shops is important to support the internal valid-ity (ie ldquois this improvement due to the model orsome underlying data quirksrdquo) and the externalvalidity (ie ldquocan this method be applied robustlyacross deployments eg Tagliabue et al (2020b)rdquo)of our findings

5 Experiments

51 Experiment 1 Next Event Prediction

Next Event Prediction (NEP) is our first evaluationtask since it is a standard way to evaluate the qual-ity of product representations (Letham et al 2013Caselles-Dupre et al 2018) briefly NEP consistsin predicting the next action the shopper is going toperform given her past actions Hence in the caseof Prod2BERT we mask the last item of every ses-sion and fit the sequence as input to a pre-trainedProd2BERT model7 Provided with the modelrsquosoutput sequence we take the top K most likelyvalues for the masked token and perform compar-ison with the true interaction As for prod2vecwe perform the NEP task by following indus-try best practices (Bianchi et al 2020) given a

type is not considered when preparing session for training6We only keep sessions that have between 3 and 20 prod-

uct interactions to eliminate unreasonably short sessions andensure computation efficiency

7Note that this is similar to the word prediction task forcloze sentences in the NLP literature (Petroni et al 2019)

trained prod2vec we take all the before-last itemsin a session to construct a session vector by aver-age pooling and use kNN to predict the last item8Following industry standards nDCGK (Mitraand Craswell 2018) with K = 10 is the chosenmetric9 and all tests ran on 10 000 testing cases(test set is randomly sampled first and then sharedacross Prod2BERT and prod2vec to guarantee afair comparison)

511 Results

Model Config Shop A Shop B

Prod2BERT e = 10 l = 4m = 025 d = 0

0433 0259

Prod2BERT e = 5 l = 4m = 025 d = 1

0458 0282

Prod2BERT e = 10 l = 8m = 025 d = 0

0027 0260

Prod2BERT e = 100 l = 4m = 025 d = 0

0427 0255

Prod2BERT e = 10 l = 4m = 015 d = 0

0416 0242

prod2vec dimension = 48 0326 0214


Table 3 nDCG10 on NEP task for both shops withProd2BERT and prod2vec (bold are best scores forProd2BERT underline are best scores for prod2vec)

Table 3 reports results on the NEP task by high-lighting some key configurations that led to com-petitive performances Prod2BERT is significantlysuperior to prod2vec scoring up to 40 higherthan the best prod2vec configurations Since shop-ping sessions are significantly shorter than sentencelengths in Devlin et al (2019) we found that chang-ing masking probability from 015 (value fromstandard BERT) to 025 consistently improved per-formance by making the training more effectiveAs for the number of layers similar to Lan et al(2020) we found that adding layers helps only upuntil a point with l = 8 training Prod2BERT withmore layers resulted in a catastrophic drop in modelperformance for the smaller Shop A however the

8Previous work using LSTM in NEP (Tagliabue et al2020b) showed some improvements over kNN howeverthe differences cannot explain the gap we have found be-tween prod2vec and Prod2BERT Hence kNN is chosen herefor consistency with the relevant literature

9We also tracked HR10 but given insights were similarwe omitted it for brevity in what follows

Model Time A-B Cost A-B

prod2vec 4-20 0006-0033$Prod2BERT 240-1200 4896-2448$

Table 4 Time (minutes) and cost (USD) for trainingone model instance per shop prod2vec is trained ona c4large instance Prod2BERT is trained (10 epochs)on a Tesla V100 16GB GPU from p38xlarge instance

same model trained on the bigger Shop B obtaineda small boost Finally duplicating training datahas been shown to bring consistent improvementswhile keeping all other hyperparameters constantusing duplicated data results in an up to 9 in-crease in nDCG10 not to mention that after only5 training epochs the model outperforms other con-figurations trained for 10 epochs or more

While encouraging the performance gap be-tween Prod2BERT and prod2vec is consistent withTransformers performance on sequential tasks (Sunet al 2019) However as argued in Section 11product representations are used as input to manydownstream systems making it essential to evalu-ate how the learned embeddings generalize outsideof the pure sequential setting Our second experi-ment is therefore designed to test how well contex-tual representations transfer to other eCommercetasks helping us to assess the accuracycost trade-off when difference in training resources betweenthe two models is significant as reported by Ta-ble 4 the difference (in USD) between prod2vecand Prod2BERT is several order of magnitudes10

52 Experiment 2 Intent Prediction

A crucial element in the success of Transformer-based language model is the possibility of adapt-ing the representation learned through pre-trainingto new tasks for example the original Devlinet al (2019) fine-tuned the pre-trained modelon 11 downstream NLP tasks However thepractical significance of these results is still un-clear on one hand Li et al (2020) Reimers andGurevych (2019) observed that sometimes BERTcontextual embeddings can underperform a sim-ple GloVe (Pennington et al 2014) model on the

10Costs are from official AWS pricing with 010USDh for the c4large (httpsawsamazoncomitec2pricingon-demand) and 1224 USDh forthe p38xlarge (httpsawsamazoncomitec2instance-typesp3) While obviously cost optimiza-tions are possible the ldquonaiverdquo pricing is a good proxy toappreciate the difference between the two methods

other Mosbach et al (2020) highlights catastrophicforgetting vanishing gradients and data varianceas important factors in practical failures Hencegiven the range of downstream applications andthe active debate on transferability in NLP we in-vestigate how Prod2BERT representations performwhen used in the intent prediction task

Intent prediction is the task of guessing whethera shopping session will eventually end in the useradding items to the cart (signaling purchasing in-tention) Since small increases in conversion canbe translated into massive revenue boosting thistask is both a crucial problem in the industry andan active area of research (Toth et al 2017 Re-quena et al 2020) To implement the intent pre-diction task we randomly sample from our dataset20 000 sessions ending with an add-to-cart actionsand 20 000 sessions without add-to-cart and splitthe resulting dataset for training validation andtest Hence given the list of previous products thata user has interacted with the goal of the intentmodel is to predict whether an add-to-cart eventwill happen or not We experimented with severaladaptation techniques inspired by the most recentNLP literature (Peters et al 2019 Li et al 2020)

1 Feature extraction (static) we extract the con-textual representations from a target hiddenlayer of pre-trained Prod2BERT and throughaverage pooling feed them as input to a multi-layer perceptron (MLP) classifier to generatethe binary prediction In addition to alternat-ing between the first hidden layer (enc 0) tothe last hidden layer (enc 3) we also triedconcatenation (concat) ie combining em-beddings of all hidden layers via concatena-tion before average pooling

2 Feature extraction (learned) we implementa linear weighted combination of all hiddenlayers (wal) with learnable parameters asinput features to the MLP model (Peters et al2019)

3 Fine-tuning we take the pre-trained model upuntil the last hidden layer and add the MLPclassifier on top for intent prediction (fine-tune) During training both Prod2BERT andtask-specific parameters are trainable

As for our baseline ie prod2vec we implementthe intent prediction task by encoding each productwithin a session with its prod2vec embeddings and

Model Method Shop Accuracy

Prod2BERT enc 0 Shop B 0567Prod2BERT enc 3 Shop B 0547Prod2BERT concat Shop B 0553Prod2BERT wal Shop B 0543Prod2BERT fine-tune Shop B 0560

prod2vec - Shop B 0558

Prod2BERT enc 0 Shop A 0593prod2vec - Shop A 0602

Table 5 Accuracy scores in the intent prediction task(best scores for each shop in bold)

feeding them to a LSTM network (so that it canlearn sequential information) followed by a binaryclassifier to obtain the final prediction

521 Results

From our experiments Table 5 highlights themost interesting results obtained from adaptingto the new task the best-performing Prod2BERTand prod2vec models from NEP As a first consid-eration the shallowest layer of Prod2BERT for fea-ture extraction outperforms all other layers andeven beats concatenation and weighted averagestrategies11 Second the quality of contextual rep-resentations of Prod2BERT is highly dependent onthe amount of data used in the pre-training phaseComparing Table 3 with Table 5 even though themodel delivers strong results in the NEP task onShop A its performance on the intent predictiontask is weak as it remains inferior to prod2vecacross all settings In other words the limitedamount of traffic from Shop A is not enough tolet Prod2BERT form high-quality product repre-sentations however the model can still effectivelyperform well on the NEP task especially sincethe nature of NEP is closely aligned with the pre-training task Third fine-tuning instability is en-countered and has a severe impact on model perfor-mance Since the amount of data available for in-tent prediction is not nearly as important as the datautilized for pre-training Prod2BERT overfittingproved to be a challenging aspect throughout ourfine-tuning experiments Fourth by comparing theresults of our best method against the model learntwith prod2vec embeddings we observed prod2vec

11This is consistent with Peters et al (2019) which statesthat inner layers of a pre-trained BERT encode more transfer-able features

embeddings can only provide limited values for in-tent estimation and the LSTM-based model stops toimprove very quickly in contrast the features pro-vided by Prod2BERT embeddings seem to encodemore valuable information allowing the model tobe trained for longer epochs and eventually reach-ing a higher accuracy score As a more generalconsideration ndash reinforced by a qualitative visualassessment of clusters in the resulting vector spacendash the performance gap is very small especially con-sidering that long training and extensive optimiza-tions are needed to take advantage of the contextualembeddings

6 Conclusion and Future Work

Inspired by the success of Transformer-basedmodels in NLP this work explores contextualizedproduct representations as trained through aBERT-inspired neural network Prod2BERTBy thoroughly benchmarking Prod2BERTagainst prod2vec in a multi-shop setting wewere able to uncover important insights on therelationship between hyperparameters adaptationstrategies and eCommerce performances on oneside and we could quantify for the first time qualitygains across different deployment scenarios onthe other If we were to sum up our findings forinterested practitioners these are our highlights

1 Generally speaking our experimental set-ting proved that pre-training Prod2BERT withMask Language Modeling can be applied suc-cessfully to sequential prediction problemsin eCommerce These results provide inde-pendent confirmation for the findings in Sunet al (2019) where BERT was used forin-session recommendations over academicdatasets However the tighter gap on down-stream tasks suggests that Transformersrsquo abil-ity to model long-range dependencies maybe more important than pure representationalquality in the NEP task as also confirmed byhuman inspection of the product spaces (seeAppendix A for comparative t-SNE plots)

2 Our investigation on adapting pre-trained con-textual embeddings for downstream tasks fea-tured several strategies in feature extractionand fine-tuning Our analysis showed thatfeature-based adaptation leads to the peak per-formance as compared to its fine-tuning coun-terpart

3 Dataset size does indeed matter as evi-dent from the performance difference in Ta-ble 5 Prod2BERT shows bigger gains withthe largest amount of training data avail-able Considering the amount of resourcesneeded to train and optimize Prod2BERT(Section 511) the gains of contextualizedembedding may not be worth the investmentfor shops outside the top 5k in the Alexaranking12 on the other hand our resultsdemonstrate that with careful optimizationshops with a large user base and significantresources may achieve superior results withProd2BERT

While our findings are encouraging there arestill many interesting questions to tackle whenpushing Prod2BERT further In particular our re-sults require a more detailed discussion with re-spect to the success of BERT for textual represen-tations with a focus on the differences betweenwords and products for example an important as-pect of BERT is the tokenizer that splits words intosubwords this component is absent in our settingbecause there exists no straightforward concept ofldquosub-productrdquo ndash while far from conclusive it shouldbe noted that our preliminary experiments using cat-egories as ldquomorphemesrdquo that attach to product iden-tifiers did not produce significant improvementsWe leave the answer to these questions ndash as wellas the possibility of adapting Prod2BERT to evenmore tasks ndash to the next iteration of this project

As a parting note we would like to emphasizethat Prod2BERT has been so far the largest and(economically) more significant experiment runby Coveo while we do believe that the methodol-ogy and findings here presented have significantpractical value for the community we also recog-nize that for example not all possible ablation stud-ies were performed in the present work As Bianchiand Hovy (2021) describe replicating and compar-ing some models is rapidly becoming prohibitive interm of costs for both companies and universitiesEven if the debate on the social impact of large-scale models often feels very complex (Thompsonet al 2020 Bender et al 2021) ndash and sometimesremoved from our day-to-day duties ndash Prod2BERTgave us a glimpse of what unequal access to re-sources may mean in more meaningful contextsWhile we (as in ldquohumanity werdquo) try to find a solu-tion we (as in ldquoauthors werdquo) may find temporary

12See httpswwwalexacomtopsites

solace knowing that good olrsquo prod2vec is still prettycompetitive

7 Ethical Considerations

User data has been collected by Coveo in the pro-cess of providing business services data is col-lected and processed in an anonymized fashion incompliance with existing legislation In particularthe target dataset uses only anonymous uuids tolabel events and as such it does not contain anyinformation that can be linked to physical entities

ReferencesSamar Al-Saqqa and Arafat Awajan 2019 The use

of word2vec model in sentiment analysis A sur-vey In Proceedings of the 2019 International Con-ference on Artificial Intelligence Robotics and Con-trol pages 39ndash43

Emily M Bender Timnit Gebru Angelina McMillan-Major and Shmargaret Shmitchell 2021 On thedangers of stochastic parrots Can language modelsbe too big In Proceedings of the 2021 ACM Confer-ence on Fairness Accountability and TransparencyFAccT rsquo21 page 610ndash623 New York NY USA As-sociation for Computing Machinery

Federico Bianchi Ciro Greco and Jacopo Tagliabue2021a Language in a (search) box Grounding lan-guage learning in real-world human-machine inter-action In Proceedings of the 2021 Conference ofthe North American Chapter of the Association forComputational Linguistics Human Language Tech-nologies pages 4409ndash4415 Online Association forComputational Linguistics

Federico Bianchi and Dirk Hovy 2021 On the gapbetween adoption and understanding in nlp In Find-ings of the Association for Computational Linguis-tics ACL-IJCNLP 2021 Association for Computa-tional Linguistics

Federico Bianchi Jacopo Tagliabue and Bingqing Yu2021b Query2Prod2Vec Grounded word em-beddings for eCommerce In Proceedings of the2021 Conference of the North American Chapterof the Association for Computational LinguisticsHuman Language Technologies Industry Paperspages 154ndash162 Online Association for Computa-tional Linguistics

Federico Bianchi Jacopo Tagliabue Bingqing YuLuca Bigon and Ciro Greco 2020 Fantastic embed-dings and how to align them Zero-shot inference ina multi-shop scenario In Proceedings of the SIGIR2020 eCom workshop July 2020 Virtual Event pub-lished at httpceur-wsorg (to appear)

Hugo Caselles-Dupre Florian Lesaint and JimenaRoyo-Letelier 2018 Word2vec applied to recom-mendation hyperparameters matter In Proceedings

of the 12th ACM Conference on Recommender Sys-tems RecSys 2018 Vancouver BC Canada Octo-ber 2-7 2018 pages 352ndash356 ACM

Mark Chen Alec Radford Rewon Child Jeffrey WuHeewoo Jun David Luan and Ilya Sutskever 2020Generative pretraining from pixels In Proceed-ings of the 37th International Conference on Ma-chine Learning ICML 2020 13-18 July 2020 Vir-tual Event volume 119 of Proceedings of MachineLearning Research pages 1691ndash1703 PMLR

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

Ruben Eschauzier 2020 ProdBERT Shopping basketcompletion using bidirectional encoder representa-tions from transformers In Bachelorrsquos Thesis

Mihajlo Grbovic Vladan Radosavljevic NemanjaDjuric Narayan Bhamidipati Jaikit Savla VarunBhagwan and Doug Sharp 2015 E-commercein your inbox Product recommendations at scaleIn Proceedings of the 21th ACM SIGKDD Inter-national Conference on Knowledge Discovery andData Mining Sydney NSW Australia August 10-132015 pages 1809ndash1818 ACM

Aditya Grover and Jure Leskovec 2016 node2vecScalable feature learning for networks In Proceed-ings of the 22nd ACM SIGKDD International Con-ference on Knowledge Discovery and Data MiningSan Francisco CA USA August 13-17 2016 pages855ndash864 ACM

Matthieu Harbich Gael Bernard P BerkesB Garbinato and P Andritsos 2017 Discov-ering customer journey maps using a mixture ofmarkov models In SIMPDA

Jared Kaplan Sam McCandlish Tom HenighanTom B Brown Benjamin Chess Rewon Child ScottGray Alec Radford Jeffrey Wu and Dario Amodei2020 Scaling laws for neural language modelsarXiv preprint arXiv200108361

Guillaume Lample Alexis Conneau MarcrsquoAurelioRanzato Ludovic Denoyer and Herve Jegou 2018Word translation without parallel data In 6th Inter-national Conference on Learning RepresentationsICLR 2018 Vancouver BC Canada April 30 - May3 2018 Conference Track Proceedings OpenRe-viewnet

Zhenzhong Lan Mingda Chen Sebastian GoodmanKevin Gimpel Piyush Sharma and Radu Soricut2020 ALBERT A lite BERT for self-supervisedlearning of language representations In 8th Inter-national Conference on Learning Representations

ICLR 2020 Addis Ababa Ethiopia April 26-302020 OpenReviewnet

Thomas K Landauer and Susan T Dumais 1997 Asolution to platorsquos problem The latent semanticanalysis theory of acquisition induction and rep-resentation of knowledge Psychological review104(2)211

Sara Latifi Noemi Mauro and D Jannach 2020Session-aware recommendation A surprising questfor the state-of-the-art ArXiv abs201103424

Benjamin Letham Cynthia Rudin and David Madigan2013 Sequential event prediction Machine learn-ing 93(2-3)357ndash380

Omer Levy Yoav Goldberg and Ido Dagan 2015 Im-proving distributional similarity with lessons learnedfrom word embeddings Transactions of the Associ-ation for Computational Linguistics 3211ndash225

Bohan Li Hao Zhou Junxian He Mingxuan WangYiming Yang and Lei Li 2020 On the sentenceembeddings from pre-trained language models InProceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP)pages 9119ndash9130 Online Association for Computa-tional Linguistics

Y Liu Myle Ott Naman Goyal Jingfei Du Man-dar Joshi Danqi Chen Omer Levy M LewisLuke Zettlemoyer and Veselin Stoyanov 2019RoBERTa A robustly optimized BERT pretrainingapproach ArXiv abs190711692

Yifei Ma Balakrishnan (Murali) NarayanaswamyHaibin Lin and Hao Ding 2020 Temporal-contextual recommendation in real-time In KDD

rsquo20 The 26th ACM SIGKDD Conference on Knowl-edge Discovery and Data Mining Virtual Event CAUSA August 23-27 2020 pages 2291ndash2299 ACM

Laurens van der Maaten and Geoffrey Hinton 2008Viualizing data using t-sne Journal of MachineLearning Research 92579ndash2605

Tomas Mikolov Kai Chen Gregory S Corrado andJeffrey Dean 2013 Efficient estimation of word rep-resentations in vector space CoRR abs13013781

Bhaskar Mitra and Nick Craswell 2018 An introduc-tion to neural information retrieval Foundationsand Trendsreg in Information Retrieval 13(1)1ndash126

Marius Mosbach Maksym Andriushchenko andD Klakow 2020 On the stability of fine-tuningbert Misconceptions explanations and strong base-lines ArXiv abs200604884

Patrick Ng 2017 dna2vec Consistent vector rep-resentations of variable-length k-mers ArXivabs170106279

Maximilian Nickel and Douwe Kiela 2017 Poincareembeddings for learning hierarchical representa-tions In Advances in Neural Information Process-ing Systems 30 Annual Conference on Neural In-formation Processing Systems 2017 December 4-92017 Long Beach CA USA pages 6338ndash6347

Debora Nozza Federico Bianchi and Dirk Hovy2020 What the [MASK] making sense oflanguage-specific BERT models arXiv preprintarXiv200302912

Jeffrey Pennington Richard Socher and ChristopherManning 2014 GloVe Global vectors for wordrepresentation In Proceedings of the 2014 Confer-ence on Empirical Methods in Natural LanguageProcessing (EMNLP) pages 1532ndash1543 DohaQatar Association for Computational Linguistics

Matthew E Peters Sebastian Ruder and Noah ASmith 2019 To tune or not to tune adapting pre-trained representations to diverse tasks In Proceed-ings of the 4th Workshop on Representation Learn-ing for NLP (RepL4NLP-2019) pages 7ndash14 Flo-rence Italy Association for Computational Linguis-tics

Fabio Petroni Tim Rocktaschel Sebastian RiedelPatrick Lewis Anton Bakhtin Yuxiang Wu andAlexander Miller 2019 Language models as knowl-edge bases In Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) pages 2463ndash2473 Hong Kong China As-sociation for Computational Linguistics

Ann Pichestapong 2019 Website personalization Im-proving conversion with personalized shopping ex-periences

Nils Reimers and Iryna Gurevych 2019 Sentence-BERT Sentence embeddings using Siamese BERT-networks In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) pages3982ndash3992 Hong Kong China Association forComputational Linguistics

Borja Requena Giovanni Cassani Jacopo TagliabueCiro Greco and Lucas Lacasa 2020 Shopper intentprediction from clickstream e-commerce data withminimal browsing information Scientific Reports202016983

Anna Rogers Olga Kovaleva and Anna Rumshisky2020 A primer in BERTology What we knowabout how BERT works Transactions of the Associ-ation for Computational Linguistics 8842ndash866

Statista Research Department 2020 Global retail e-commerce sales 2014-2023

Emma Strubell Ananya Ganesh and Andrew McCal-lum 2019 Energy and policy considerations fordeep learning in NLP In Proceedings of the 57thAnnual Meeting of the Association for Computa-tional Linguistics pages 3645ndash3650 Florence ItalyAssociation for Computational Linguistics

Fei Sun Jun Liu Jian Wu Changhua Pei Xiao LinWenwu Ou and Peng Jiang 2019 Bert4rec Se-quential recommendation with bidirectional encoderrepresentations from transformer In Proceedings ofthe 28th ACM International Conference on Informa-tion and Knowledge Management CIKM 2019 Bei-jing China November 3-7 2019 pages 1441ndash1450ACM

Jacopo Tagliabue Ciro Greco Jean-Francis RoyBingqing Yu Patrick John Chia Federico Bianchiand Giovanni Cassani 2021 Sigir 2021 e-commerce workshop data challenge

Jacopo Tagliabue and Bingqing Yu 2020 Shoppingin the multiverse A counterfactual approach to in-session attribution ArXiv abs200710087

Jacopo Tagliabue Bingqing Yu and Marie Beaulieu2020a How to grow a (product) tree Personalizedcategory suggestions for eCommerce type-ahead InProceedings of The 3rd Workshop on e-Commerceand NLP pages 7ndash18 Seattle WA USA Associa-tion for Computational Linguistics

Jacopo Tagliabue Bingqing Yu and Federico Bianchi2020b The embeddings that came in from the coldImproving vectors for new and rare products withcontent-based inference In RecSys 2020 Four-teenth ACM Conference on Recommender SystemsVirtual Event Brazil September 22-26 2020 pages577ndash578 ACM

Alon Talmor and Jonathan Berant 2019 MultiQA Anempirical investigation of generalization and trans-fer in reading comprehension In Proceedings of the57th Annual Meeting of the Association for Com-putational Linguistics pages 4911ndash4921 FlorenceItaly Association for Computational Linguistics

Techcrunch 2019a Algolia finds $110m from acceland salesforce

Techcrunch 2019b Coveo raises 227m at 1b valua-tion

Techcrunch 2019c Lucidworks raises $100m to ex-pand in ai finds

Techcrunch 2021 Bloomreach raises $150m on$900m valuation and acquires exponea

Neil C Thompson Kristjan Greenewald Keeheon Leeand G Manso 2020 The computational limits ofdeep learning ArXiv abs200705558

Arthur Toth L Tan G Fabbrizio and Ankur Datta2017 Predicting shopping behavior with mixture ofrnns In eCOMSIGIR

Manos Tsagkias Tracy Holloway King SuryaKallumadi Vanessa Murdock and Maarten de Ri-jke 2020 Challenges and research opportunities inecommerce search and recommendations In SIGIRForum volume 54

US Department of Commerce 2020 Us census bu-reau news

Flavian Vasile Elena Smirnova and Alexis Conneau2016a Meta-prod2vec Product embeddings usingside-information for recommendation In Proceed-ings of the 10th ACM Conference on RecommenderSystems Boston MA USA September 15-19 2016pages 225ndash232 ACM

Flavian Vasile Elena Smirnova and Alexis Conneau2016b Meta-prod2vec Product embeddings usingside-information for recommendation In Proceed-ings of the 10th ACM Conference on RecommenderSystems Boston MA USA September 15-19 2016pages 225ndash232 ACM

Ashish Vaswani Noam Shazeer Niki Parmar JakobUszkoreit Llion Jones Aidan N Gomez LukaszKaiser and Illia Polosukhin 2017 Attention is allyou need In Advances in Neural Information Pro-cessing Systems 30 Annual Conference on NeuralInformation Processing Systems 2017 December 4-9 2017 Long Beach CA USA pages 5998ndash6008

David Vilares Michalina Strzyz Anders Soslashgaard andCarlos Grsquoomez-Rodrrsquoiguez 2020 Parsing as pre-training ArXiv abs200201685

Bo Yan Krzysztof Janowicz Gengchen Mai and SongGao 2017 From itdl to place2vec Reasoningabout place type similarity and relatedness by learn-ing embeddings from augmented spatial contexts InProceedings of the 25th ACM SIGSPATIAL Interna-tional Conference on Advances in Geographic Infor-mation Systems SIGSPATIAL rsquo17 New York NYUSA Association for Computing Machinery

Fajie Yuan Xiangnan He Alexandros Karatzoglouand Liguang Zhang 2020 Parameter-efficient trans-fer from sequential behaviors for user modeling andrecommendation In Proceedings of the 43rd Inter-national ACM SIGIR conference on research anddevelopment in Information Retrieval SIGIR 2020Virtual Event China July 25-30 2020 pages 1469ndash1478 ACM

Han Zhang Songlin Wang Kang Zhang ZhilingTang Yunjiang Jiang Yun Xiao Weipeng Yan andWenyun Yang 2020 Towards personalized andsemantic retrieval An end-to-end solution for e-commerce search via embedding learning In Pro-ceedings of the 43rd International ACM SIGIR con-ference on research and development in InformationRetrieval SIGIR 2020 Virtual Event China July 25-30 2020 pages 2407ndash2416 ACM

Xiaoting Zhao Raphael Louca Diane Hu and LiangjieHong 2020 The difference between a click and a

cart-add Learning interaction-specific embeddingsIn Companion Proceedings of the Web Conference2020 WWW rsquo20 page 454ndash460 New York NYUSA Association for Computing Machinery

Zhen Zuo L Wang Michinari Momma W WangYikai Ni Jianfeng Lin and Y Sun 2020 A flexi-ble large-scale similar product identification systemin e-commerce

A Visualization of Session Embeddings

Figures 3 to 6 represent browsing sessions pro-jected in two-dimensions with t-SNE (van derMaaten and Hinton 2008) for each browsing ses-sion we retrieve the corresponding type (eg shoespants etc) of each product in the session and usemajority voting to assign the most frequent prod-uct type to the session Hence the dots are color-coded by product type and each dot represents a

Figure 3 T-SNE plot of browsing session vector spacefrom Shop A and built with the first hidden layer ofpre-trained Prod2BERT

Figure 4 T-SNE plot of browsing session vector spacefrom Shop A and built with prod2vec embeddings

Figure 5 T-SNE plot of browsing session vector spacefrom Shop B and built with the first hidden layer ofpre-trained Prod2BERT

Figure 6 T-SNE plot of browsing session vector spacefrom Shop B and built with prod2vec embeddings

unique session from our logs It is easy to noticethat first both contextual and non-contextual em-beddings built with a smaller amount of data ieFigures 3 and 4 from Shop A have a less clearseparation between clusters moreover the qualityof Prod2BERT seems even lower than prod2vec asthere exists a larger central area where all types areheavily overlapping Second comparing Figure 5with Figure 6 both Prod2BERT and prod2vec im-prove which confirms Prod2BERT given enoughpre-training data is able to deliver better separa-tions in terms of product types and more meaning-ful representations

downstream tasks

2 we benchmark Prod2BERT against prod2vecembeddings showing the potential accuracygain of contextual representations across dif-ferent shops and data requirements By testingon shops that differ for traffic catalog anddata distribution we increase our confidencethat our findings are indeed applicable to avast class of typical retailers

3 we perform extensive experiments by vary-ing hyperparameters architectures and fine-tuning strategies We report detailed resultsfrom numerous evaluation tasks and finallyprovide recommendations on how to besttrade off accuracy with training cost

4 we share our code1 to help practitioners repli-cate our findings on other shops and improveon our benchmarks

11 Product Embeddings an IndustryPerspective

The eCommerce industry has been steadily grow-ing in recent years according to US Departmentof Commerce (2020) 16 of all retail transac-tions now occur online worldwide eCommerce isestimated to turn into a $45 trillion industry in2021 (Statista Research Department 2020) In-terest from researchers has been growing at thesame pace (Tsagkias et al 2020) stimulated bychallenging problems and by the large-scale im-pact that machine learning systems have in thespace (Pichestapong 2019) Within the fast adop-tion of deep learning methods in the field (Ma et al2020 Zhang et al 2020 Yuan et al 2020) prod-uct representations obtained through prod2vec playa key role in many neural architectures after train-ing a product space can be used directly (Vasileet al 2016b) as a part of larger systems for rec-ommendation (Tagliabue et al 2020b) or in down-stream NLPIR tasks (Tagliabue and Yu 2020)Combining the size of the market with the pastsuccess of NLP models in the space investigatingwhether Transformer-based architectures result insuperior product representations is both theoreti-cally interesting and practically important

Anticipating some of the themes below it isworth mentioning that our study sits at the intersec-tion of two important trends on one side neural

1Code available at httpsgithubcomvinidprodb

models typically show significant improvementsat large scale (Kaplan et al 2020) ndash by quantify-ing expected gains for ldquoreasonable-sizedrdquo shopsour results are relevant also outside a few publiccompanies (Tagliabue et al 2021) and allow for aprincipled trade-off between accuracy and ethicalconsiderations (Strubell et al 2019) on the otherside the rise of multi-tenant players2 makes sophis-ticated models potentially available to an unprece-dented number of shops ndash in this regard we designour methodology to include multiple shops in ourbenchmarks and report how training resources andaccuracy scale across deployments For these rea-sons we believe our findings will be interesting toa wide range of researchers and practitioners

2 Related Work

Distributional Models Word2vec (Mikolov et al2013) enjoyed great success in NLP thanks to itscomputational efficiency unsupervised nature andaccurate semantic content (Levy et al 2015 Al-Saqqa and Awajan 2019 Lample et al 2018) Re-cently models such as BERT (Devlin et al 2019)and RoBERTa (Liu et al 2019) shifted much ofthe community attention to Transformer architec-tures and their performance (Talmor and Berant2019 Vilares et al 2020) while it is increasinglyclear that big datasets (Kaplan et al 2020) andsubstantial computing resources play a role in theoverall accuracy of these architectures in our ex-periments we explicitly address robustness by i)varying model designs together with other hyper-parameters and ii) test on multiple shops differingin traffic industry and product catalog

Product Embeddings Prod2vec is a straightfor-ward adaptation to eCommerce of word2vec (Gr-bovic et al 2015) Product embeddings quicklybecame a fundamental component for recommenda-tion and personalization systems (Caselles-Dupreet al 2018 Tagliabue et al 2020a) as well asNLP-based predictions (Bianchi et al 2020) Tothe best of our knowledge this work is the first toexplicitly investigate whether Transformer-basedarchitectures deliver higher-quality product rep-resentations compared to non-contextual embed-dings Eschauzier (2020) uses Transformers on cart

2As an indication of the market opportunity in the spaceof AI-powered search and recommendations we recently wit-nessed Algolia (Techcrunch 2019a) and Lucidworks rais-ing 100M USD (Techcrunch 2019c) Coveo raising 227MCAD (Techcrunch 2019b) Bloomreach raising 115M USD(Techcrunch 2021)


3 Prod2BERT

31 Overview












Parameter Values



for one size

4 Methods






Shop A 1970832 38486 5 7Shop B 3992794 102942 5 7




5 Experiments







511 Results


Prod2BERT e = 10 l = 4m = 025 d = 0

0433 0259

Prod2BERT e = 5 l = 4m = 025 d = 1

0458 0282

Prod2BERT e = 10 l = 8m = 025 d = 0

0027 0260

Prod2BERT e = 100 l = 4m = 025 d = 0

0427 0255

Prod2BERT e = 10 l = 4m = 015 d = 0

0416 0242



























521 Results


























































































3 Prod2BERT

31 Overview












Parameter Values



for one size

4 Methods






Shop A 1970832 38486 5 7Shop B 3992794 102942 5 7




5 Experiments







511 Results


Prod2BERT e = 10 l = 4m = 025 d = 0

0433 0259

Prod2BERT e = 5 l = 4m = 025 d = 1

0458 0282

Prod2BERT e = 10 l = 8m = 025 d = 0

0027 0260

Prod2BERT e = 100 l = 4m = 025 d = 0

0427 0255

Prod2BERT e = 10 l = 4m = 015 d = 0

0416 0242



























521 Results





























































































Parameter Values



for one size

4 Methods






Shop A 1970832 38486 5 7Shop B 3992794 102942 5 7




5 Experiments







511 Results


Prod2BERT e = 10 l = 4m = 025 d = 0

0433 0259

Prod2BERT e = 5 l = 4m = 025 d = 1

0458 0282

Prod2BERT e = 10 l = 8m = 025 d = 0

0027 0260

Prod2BERT e = 100 l = 4m = 025 d = 0

0427 0255

Prod2BERT e = 10 l = 4m = 015 d = 0

0416 0242



























521 Results


























































































Shop A 1970832 38486 5 7Shop B 3992794 102942 5 7




5 Experiments







511 Results


Prod2BERT e = 10 l = 4m = 025 d = 0

0433 0259

Prod2BERT e = 5 l = 4m = 025 d = 1

0458 0282

Prod2BERT e = 10 l = 8m = 025 d = 0

0027 0260

Prod2BERT e = 100 l = 4m = 025 d = 0

0427 0255

Prod2BERT e = 10 l = 4m = 015 d = 0

0416 0242



























521 Results













































































































521 Results

























































































BERT Goes Shopping - arXiv

Documents

Transcript of BERT Goes Shopping - arXiv