Hierarchical Training: Scaling Deep Recommendation Models ... · compute-intensive and can be...

9
Hierarchical Training: Scaling Deep Recommendation Models on Large CPU Clusters Yuzhen Huang, Xiaohan Wei, Xing Wang, Jiyan Yang, Bor-Yiing Su, Shivam Bharuka, Dhruv Choudhary, Zewei Jiang, Hai Zheng, Jack Langman Facebook, 1 Hacker Way, Menlo Park, CA 94065 {yuzhenhuang,ubimeteor,xingwang,chocjy,boryiingsu,shivamb,choudharydhruv,zeweijiang,haizheng,youknowjack}@fb.com ABSTRACT Neural network based recommendation models are widely used to power many internet-scale applications including product recom- mendation and feed ranking. As the models become more complex and more training data is required during training, improving the training scalability of these recommendation models becomes an urgent need. However, improving the scalability without sacrific- ing the model quality is challenging. In this paper, we conduct an in-depth analysis of the scalability bottleneck in existing training ar- chitecture on large scale CPU clusters. Based on these observations, we propose a new training architecture called Hierarchical Train- ing, which exploits both data parallelism and model parallelism for the neural network part of the model within a group. We implement hierarchical training with a two-layer design: a tagging system that decides the operator placement and a net transformation system that materializes the training plans, and integrate hierarchical train- ing into existing training stack. We propose several optimizations to improve the scalability of hierarchical training including model ar- chitecture optimization, communication compression, and various system-level improvements. Extensive experiments at massive scale demonstrate that hierarchical training can speed up distributed rec- ommendation model training by 1.9x without model quality drop. CCS CONCEPTS Computing methodologies Distributed algorithms; Com- puter systems organization Distributed architectures. KEYWORDS distributed training, optimization, system for machine learning ACM Reference Format: Yuzhen Huang, Xiaohan Wei, Xing Wang, Jiyan Yang, Bor-Yiing Su,, Shivam Bharuka, Dhruv Choudhary, Zewei Jiang, Hai Zheng, Jack Langman. 2021. Hierarchical Training: Scaling Deep Recommendation Models on Large CPU Clusters. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’21), August 14–18, 2021, Virtual Event, Sin- gapore. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3447548. 3467084 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. KDD ’21, August 14–18, 2021, Virtual Event, Singapore © 2021 Association for Computing Machinery. ACM ISBN 978-1-4503-8332-5/21/08. . . $15.00 https://doi.org/10.1145/3447548.3467084 1 INTRODUCTION The wide adoption of the recommendation and personalization applications including apps/news/friends/videos recommendation, groups suggestions, feed ranking, etc. have greatly improved the Internet and Apps experience [7, 9, 15, 49]. With the recent advent of deep learning, most of these personalization workloads are powered by neural network based machine learning models, e.g., DLRM [31] and Wide & Deep [6], for better recommendation accuracy. At Facebook, the training of these recommendation models takes up more than 50% of the total AI training cycles [2]. In these recommendation models [6, 13, 28, 31, 50], embeddings and higher-order interactions are used to learn from the sparse categorical features, and then deep neural network is applied to improve the generalization of the models. The embedding tables learned for the large number of the sparse categorical features are in general memory-intensive and can consume up to tens of terabytes of the memory [15, 36, 47]. This makes the recommen- dation models different from the content understanding models used in computer vision, speech recognition, etc., which are mainly compute-intensive and can be better accelerated by GPUs. These recommendation models can be trained efficiently using large scale CPU clusters where each training job contains tens to hundreds of CPU nodes at Facebook [15, 17]. A hybrid of model parallelism and data parallelism scheme is used to train these recommendation models for the sparse part and the dense part [48]. More specifically, the embedding tables of the sparse categorical features are parti- tioned onto a dedicated set of parameter servers which are accessed in a model parallel manner, while the deep neural network part (i.e., the dense part) of the model is replicated across all trainers and is trained using data parallelism. To further improve the user experience and provide more accu- rate and satisfactory recommendation, the deep learning models are becoming larger and more complex. For example, more complicated layers, e.g., deep attention, skip layers, etc., and wider and heavier layers are added to the models to improve model quality [38, 41]. At the same time, more training data are used to train these complex recommendation models. The growing model size and data size call for the need of improving the training scalability of the large scale distributed training system. The existing distributed training architecture presented above has been demonstrated to have good scalability in terms of the training throughput [48]. To increase the number of sparse cat- egorical features, one can add more sparse parameter servers to the system to hold the additional embedding tables. To boost the training throughput, more trainers can be added to improve the processing power. However, increasing the training throughput

Transcript of Hierarchical Training: Scaling Deep Recommendation Models ... · compute-intensive and can be...

Page 1: Hierarchical Training: Scaling Deep Recommendation Models ... · compute-intensive and can be better accelerated by GPUs. These ... 3UHGLFWLRQThe dense part, on the other hand, are

Hierarchical Training: Scaling Deep Recommendation Modelson Large CPU Clusters

Yuzhen Huang, Xiaohan Wei, Xing Wang, Jiyan Yang, Bor-Yiing Su,Shivam Bharuka, Dhruv Choudhary, Zewei Jiang, Hai Zheng, Jack Langman

Facebook, 1 Hacker Way, Menlo Park, CA 94065{yuzhenhuang,ubimeteor,xingwang,chocjy,boryiingsu,shivamb,choudharydhruv,zeweijiang,haizheng,youknowjack}@fb.com

ABSTRACTNeural network based recommendation models are widely used topower many internet-scale applications including product recom-mendation and feed ranking. As the models become more complexand more training data is required during training, improving thetraining scalability of these recommendation models becomes anurgent need. However, improving the scalability without sacrific-ing the model quality is challenging. In this paper, we conduct anin-depth analysis of the scalability bottleneck in existing training ar-chitecture on large scale CPU clusters. Based on these observations,we propose a new training architecture calledHierarchical Train-ing, which exploits both data parallelism and model parallelism forthe neural network part of the model within a group. We implementhierarchical training with a two-layer design: a tagging system thatdecides the operator placement and a net transformation systemthat materializes the training plans, and integrate hierarchical train-ing into existing training stack. We propose several optimizations toimprove the scalability of hierarchical training including model ar-chitecture optimization, communication compression, and varioussystem-level improvements. Extensive experiments at massive scaledemonstrate that hierarchical training can speed up distributed rec-ommendation model training by 1.9x without model quality drop.

CCS CONCEPTS•Computingmethodologies→Distributed algorithms; •Com-puter systems organization → Distributed architectures.KEYWORDSdistributed training, optimization, system for machine learning

ACM Reference Format:Yuzhen Huang, Xiaohan Wei, Xing Wang, Jiyan Yang, Bor-Yiing Su,, ShivamBharuka, Dhruv Choudhary, Zewei Jiang, Hai Zheng, Jack Langman. 2021.Hierarchical Training: Scaling Deep RecommendationModels on Large CPUClusters. In Proceedings of the 27th ACM SIGKDD Conference on KnowledgeDiscovery and Data Mining (KDD ’21), August 14–18, 2021, Virtual Event, Sin-gapore. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3447548.3467084

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’21, August 14–18, 2021, Virtual Event, Singapore© 2021 Association for Computing Machinery.ACM ISBN 978-1-4503-8332-5/21/08. . . $15.00https://doi.org/10.1145/3447548.3467084

1 INTRODUCTIONThe wide adoption of the recommendation and personalizationapplications including apps/news/friends/videos recommendation,groups suggestions, feed ranking, etc. have greatly improved theInternet andApps experience [7, 9, 15, 49].With the recent advent ofdeep learning, most of these personalization workloads are poweredby neural network based machine learning models, e.g., DLRM [31]and Wide & Deep [6], for better recommendation accuracy. AtFacebook, the training of these recommendation models takes upmore than 50% of the total AI training cycles [2].

In these recommendation models [6, 13, 28, 31, 50], embeddingsand higher-order interactions are used to learn from the sparsecategorical features, and then deep neural network is applied toimprove the generalization of the models. The embedding tableslearned for the large number of the sparse categorical featuresare in general memory-intensive and can consume up to tens ofterabytes of the memory [15, 36, 47]. This makes the recommen-dation models different from the content understanding modelsused in computer vision, speech recognition, etc., which are mainlycompute-intensive and can be better accelerated by GPUs. Theserecommendation models can be trained efficiently using large scaleCPU clusters where each training job contains tens to hundredsof CPU nodes at Facebook [15, 17]. A hybrid of model parallelismand data parallelism scheme is used to train these recommendationmodels for the sparse part and the dense part [48]. More specifically,the embedding tables of the sparse categorical features are parti-tioned onto a dedicated set of parameter servers which are accessedin a model parallel manner, while the deep neural network part (i.e.,the dense part) of the model is replicated across all trainers and istrained using data parallelism.

To further improve the user experience and provide more accu-rate and satisfactory recommendation, the deep learning models arebecoming larger and more complex. For example, more complicatedlayers, e.g., deep attention, skip layers, etc., and wider and heavierlayers are added to the models to improve model quality [38, 41]. Atthe same time, more training data are used to train these complexrecommendation models. The growing model size and data size callfor the need of improving the training scalability of the large scaledistributed training system.

The existing distributed training architecture presented abovehas been demonstrated to have good scalability in terms of thetraining throughput [48]. To increase the number of sparse cat-egorical features, one can add more sparse parameter servers tothe system to hold the additional embedding tables. To boost thetraining throughput, more trainers can be added to improve theprocessing power. However, increasing the training throughput

Page 2: Hierarchical Training: Scaling Deep Recommendation Models ... · compute-intensive and can be better accelerated by GPUs. These ... 3UHGLFWLRQThe dense part, on the other hand, are

KDD ’21, August 14–18, 2021, Virtual Event, SingaporeYuzhen Huang, Xiaohan Wei, Xing Wang, Jiyan Yang, Bor-Yiing Su,

Shivam Bharuka, Dhruv Choudhary, Zewei Jiang, Hai Zheng, Jack Langman

Sparse Features

Sparse Layer

FC

Dense Features

FC

Den

se L

ayer

Inte

ract

ion

Laye

r

...FC

FC

Fina

l Lay

er

...

Sigmoid

Prediction

Figure 1: The DLRM-like Model

does not come for free in the existing architecture. When increas-ing the trainers in the system, we observe a non-negligible modelquality drop which blocks us from scaling out.

In this paper, we identify through experiments that in existingsystem, increasing the number of dense replica will hurt the modelquality severely, but increasing the parallelism and staleness withinone dense replica will not degrade the model quality a lot. Basedon these observations, we propose a new distributed training ar-chitecture called Hierarchical Training to greatly improve thescalability. In hierarchical training, we replace a trainer with agroup of training nodes, which we call a group. This setup greatlyimproves the processing power within a group while at the sametime keeps the number of dense replicas small. To realize hierarchi-cal training in our existing training stack: we design a two-layersystem: (1) The Tagging System decides the operators placementstrategy in a group using a rule-based tagging process and an itera-tive greedy refinement process; (2) The Net Transformation Systemis the back-end to handle the net partitioning and replication, andfinally materialize the training plans for all nodes.

Our main contributions can be summarized as follows:• We conduct an in-depth analysis of the scalability limitationof the current distributed training architecture on recommen-dation models and identify potential opportunity to increasethe training throughput while at the same time preserve themodel quality. (Section 2)

• We design a novel architecture called Hierarchical Training(HT) based on the observations and implement it with atwo-layer system. (Sections 3)

• We propose a set of optimizations for HT including modelarchitecture optimization and communication compression.We highlight challenges we face when shipping complexmodels with HT. (Section 4)

• We conduct a comprehensive evaluation for HT on massivescale recommendation workloads. (Section 5)

2 BACKGROUND AND MOTIVATION2.1 DLRMThe models powering most of the deep learning recommendationworkfloads at Facebook are in the form of DLRM [31]. As shownin Figure 1, a DLRM-like model is composed of four major layers.

EASGD lookups Sparse PSs

Embeddings

Dense Replica

Trainer 0 Trainer 1 Trainer 2

Dense Replica

Dense Replica

Global Dense Params

Global Dense Params

Dense PSs

Figure 2: Existing EASGD-based Architecture

The sparse layer contains the embedding tables which are used toprocess the sparse categorical features. One can scale the modelsby adding more sparse features. The dense layer is several MLPs(multilayer perceptron) processing the continuous features (i.e., thedense features). The interaction layer takes the output of the denselayer and the sparse layer, computes the second-order interactionamong them by taking the dot product between all the combinationamong the embedding vectors and the processed dense features.And then the final layer of MLPs are applied on the interaction anda sigmoid function will be applied to obtain the final prediction.Note that this is a high level description of the DLRM-like models,and each layer can be more complicated in reality.

The model can be expressed as a directed graph in the context ofCaffe2 where each node represents an operator and each directededge represents data dependency between two operators [3]. Forexample, a FC operator is followed by a Relu operator. The modelgraph is also known as the training net. It contains the forward andbackward computation operators for one batch of the training data.The training process is to iteratively executing the training net untilthe model converges or a certain number of samples are processed.A typical model graph for DLRM-like models have hundreds tothousands of operators.

2.2 The Existing Training ArchitectureHere we describe howwemap the DLRM-like models to the existingtraining system at Facebook. In the existing training architecture,there are mainly three types of nodes: trainers, sparse parameterservers, dense parameter servers, as shown in Figure 2. From thesystem perspective, there are two parts in a DLRM-like model:the sparse part which is the sparse layer holding the embeddingtables as mentioned in Section 2.1, and the dense part which con-tains everything else of the model, including the dense layer, theinteraction layer and the final layer. This is because the systemscales the sparse part and the dense part individually using differ-ent mechanisms. The sparse part (the sparse embedding tables) arepartitioned onto a set of dedicated sparse parameter servers (sparsePSs). These PSs are shared by all trainers and can be easily scaledby adding more machines when more embedding tables are added.The dense part, on the other hand, are replicated on each trainer. Inother words, each trainer holds one copy of the dense parametersfrom the dense part of the model. It accesses and updates the localcopy of the dense parameters through local training. The synchro-nization of the dense parameters across all trainers is executed in

Page 3: Hierarchical Training: Scaling Deep Recommendation Models ... · compute-intensive and can be better accelerated by GPUs. These ... 3UHGLFWLRQThe dense part, on the other hand, are

Hierarchical Training: Scaling Deep Recommendation Modelson Large CPU Clusters KDD ’21, August 14–18, 2021, Virtual Event, Singapore

# Trainers

Thro

ughp

ut (k

EP

S)

0

50

100

150

200

10 20 30 40

(a) Training Throughput

# Samples (Billion)

Rel

ativ

e Lo

ss

0.000.250.500.751.00

2 4 6 8

20 trainers 30 trainers 40 trainers

(b) Model Quality

Figure 3: Effect of Inter-trainer Asynchronicity

background through the dense parameter servers (dense PSs) usingthe ShadowSync framework [48] with the EASGD algorithm [46].This allows the system to scale linearly by adding more trainers asthe synchronization happens in the background and never blocksthe iterative training process.

To fully utilize multiple CPU cores of each machine, trainersare multi-threaded and each trainer processes a number of copiesof the dense part of the training net iteratively. Though the densepart of the training net has multiple copies/replicas on one trainer,there is only one dense parameter copy on each trainer sharedby all training net replicas on the same trainer. The replicas ofthe training net are executed by a thread pool concurrently andthey access and update the shared local dense parameters in a lock-free manner (similar to Hogwild! style update [33]). This designintroduces asynchronicity and staleness in each trainer, becausewhen a thread/net updates the parameter after reading it, otherthreads/nets may have already updated them. We found that suchkind of asynchronicity at a certain range is acceptable.

During the training execution, each net in each trainer executesthe following logic iteratively: reads a batch of data, sends requeststo sparse PSs to perform embedding lookup operations, runs for-ward/backward computation on the local copy of the dense param-eters, and finally updates the local dense parameters as well as theembedding tables on sparse PSs. In the background, each trainerhas a dedicated net talking to the global dense parameter serversto synchronize the copy of the dense parameters using the EASGDalgorithm [46] asynchronously. For simplicity, we call the existingtraining architecture the EASGD-based training system.

The system is expressing both the model parallelism and dataparallelism: model parallelism is used to partition and place theembedding tables on sparse PSs; data parallelism is used within atrainer and across the trainers as multiple net replicas within eachtrainer are processing different batches of the data concurrently.

2.3 Asynchronicity Analysis in Existing SystemTo improve the model quality, more complex models and more train-ing data are being explored. This calls for the needs of increasingthe training throughput of the system, i.e., improving the trainingscalability, while at the same time preserving the model quality.Following the definition from ShadowSync [48], we use ExamplesPer Second (EPS), the number of examples processed by the systemper second, to evaluate the throughput of a training system.

Here we analyze the asynchronicity in the EASGD-based sys-tem. There are two types of asynchronicity for the dense part:

# Replicas

Thro

ughp

ut (k

EP

S)

0

25

50

75

100

12 24 48 96

(a) Training Throughput

# Samples (Billion)

Rel

ativ

e Lo

ss

0.0000.0250.0500.0750.1000.125

2 4 6 8

24 replica 48 replica 96 replica

(b) Model Quality

Figure 4: Effect of Intra-trainer Asynchronicity

inter-trainer and intra-trainer. Inter-trainer asynchronicity is intro-duced by the background EASGD synchronization algorithm. Thisis because the EASGD is running in the background, and denseparameters communication is not blocking the forward/backwardtraining. Intra-trainer asynchronicity is incurred by the asynchro-nous update within one trainer. In each trainer, there are multiplereplicated nets executing at the same time and they will access andupdate the same local copy of the dense parameters asynchronously.We define the staleness of the dense parameters introduced here asthe number of writes conducted by other net replicas between a netread (during the forward pass) and write (during the backward pass)the same parameter. The maximum dense staleness for a trainer isbounded by the number of concurrent net replicas.

Intuitively both inter-trainer asynchronicity and intra-trainerasynchronicity may deteriorate the model quality to different ex-tents and we conduct experiments to better understand their effects.We first analyze the effect of inter-trainer asynchronicity by chang-ing the number of trainers for EASGD-based system. As shown inFigure 3a, the training throughput grows linearly with the num-ber of trainers. However, we observed non-trivial model qualitydegradation when scaling out as shown in Figure 3b (the lowerthe better). Specifically, with 40 trainers, we observe 0.25% lossincrease (relative loss compared with the 10-trainer case) which issignificant. We believe it is because of the nature of the backgroundmodel averaging algorithm. That is, the model quality drops a lotwhen there are many dense replicas in the system. Besides, addingdense replica will make each trainer see less training data giventhe total number of data fed into the system is unchanged.

Observation 2.1. The EASGD-based system incurs non-neglectablemodel quality drop when increasing the number of trainers.

We then conduct experiments to test the effect of intra-trainerasynchronicity as shown in Figure 4. Though the training through-put is not increasing because each trainer only has 18 physicalCPU cores, we found that increasing the number of concurrentnets within a trainer will not severely hurt the model quality. Theloss value is increased by around 0.02% for 96 nets compared with12 nets when training with 9 billions of examples, which is muchsmaller compared with the degradation brought by inter-trainerasynchronicity.

Observation 2.2. The EASGD-based system is not sensitive tointra-trainer asynchronicity, i.e., increasing the number of net replicaswithin a trainer will not severely hurt the model quality.

Based on these observations, we have concluded that modelquality is more sensitive to inter-trainer asynchronicity compared

Page 4: Hierarchical Training: Scaling Deep Recommendation Models ... · compute-intensive and can be better accelerated by GPUs. These ... 3UHGLFWLRQThe dense part, on the other hand, are

KDD ’21, August 14–18, 2021, Virtual Event, SingaporeYuzhen Huang, Xiaohan Wei, Xing Wang, Jiyan Yang, Bor-Yiing Su,

Shivam Bharuka, Dhruv Choudhary, Zewei Jiang, Hai Zheng, Jack Langman

Global Dense Params

Global Dense Params

Group 0

Partition ops to trainers and Group PSs in one group

Model

Op

Op

OpOpOpOpOp

TrainersG

roup PS

s

Group 1

Dense P

Ss

Figure 5: Hierarchical Training Architecture

to the intra-trainer asynchronicity. This motivates us to redesignthe system and introduce hierarchical training.

3 DESIGN3.1 Hierarchical TrainingWe design Hierarchical Training (HT). We increase the pro-cessing power for a “trainer" while keeping the total number of"trainers", i.e., total number of dense replicas, small. We replacea trainer in the EASGD-based system with a group of nodes toincrease the processing power. We call it a group. A group con-sists of two types of nodes: a set of trainers and a set of groupparameter servers (group PSs). The model graph (i.e., the train-ing net) is partitioned to the group PSs and trainers within agroup. An overall architecture is shown in Figure 5 where wehave 2 groups and each group has 2 group PSs and 3 trainers.The trainers read data batches, conduct embedding lookups withthe sparse PSs and initiate the iterative training loop as in theEASGD-based system (the sparse PSs are omitted in the figurefor simplicity). The group PSs within a group participate in thecomputation and they jointly store one copy of the dense param-eters of the model. The dense copy in each group are partitionedamong group PSs. In this architecture, the total amount of dataparallelism is 𝑛𝑢𝑚_𝑡𝑟𝑎𝑖𝑛𝑒𝑟𝑠 × 𝑛𝑢𝑚_𝑐𝑜𝑛𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑛𝑒𝑡𝑠_𝑝𝑒𝑟_𝑡𝑟𝑎𝑖𝑛𝑒𝑟and is preserved as the EASGD-based system. However, we onlycreate 𝑛𝑢𝑚_𝑔𝑟𝑜𝑢𝑝𝑠 dense replicas instead of 𝑛𝑢𝑚_𝑡𝑟𝑎𝑖𝑛𝑒𝑟𝑠 densereplicas compared to the EASGD-based system.

3.1.1 Computation and Communication Pattern. We introduce thecomputation and communication pattern within a group for hierar-chical training. We first consider the partitioning of one net replica.In each group, the training net replica is partitioned into 𝑘 + 1 partsif there are 𝑘 group PSs per group. Each group PS gets one partitionof the net and all the trainers within a group gets a copy of thelast partition. In other words, operators on trainers are replicatedwhile operators on group PSs are partitioned; see Figure 5 for anexample of 𝑘 = 2: in each group, each group PS holds one operator,and each trainer holds the remaining 5 operators. Trainers controlthe training loop. When one or more operators are placed on agroup PS, a trainer will send the input tensor to the group PS forexecution. The group PS can then finish all the local execution,and then the output will be returned to the trainer who initiatesthe computation. As all the concurrent net replicas for all trainers

Rule-based Tagging

Greedy Refinement

Single-group Net Generation

Multi-group Net Generation

The Tagging System

The Net

Transformation

System

A tagged train net

Distributed nets

Group PS Trainer Dense PS

Input Config

Figure 6: Hierarchical Training System Components

within a group access and read the same copy of the dense parame-ters stored on the group PSs, if there are 𝑡 trainers in a group, themaximum staleness on dense parameters is 𝑡 × 𝑟 , where 𝑟 is thenumber of concurrent nets per trainer. This effectively means weincrease the number of threads in Hogwild!-style update within agroup in hierarchical training.

Following the EASGD-based training system, hierarchical train-ing uses background averaging algorithms, e.g., EASGD, to synchro-nize the local dense parameter copy among different groups. Thecommunication is between group PSs and the global dense PSs as allthe dense parameters are stored on group PSs. Note that other back-ground synchronization algorithms introduced in ShadowSync [48]can also be used here.

3.1.2 Parallelism Analysis. Hierarchical training introduces modelparallelism within a group as it distributes the operators in a netto different group PSs. At the same time, it also increases the dataparallelism degree for a dense replica in a group as HT replaces atrainer with a group of nodes which results in more parallelism.HT also implies pipeline parallelism as group PSs in a group can beviewed as pipeline workers each responsible for the computationof a certain part of the model. They also work on different batchesof the data concurrently due to the data parallel execution.

3.2 System ImplementationNow we address the following questions for hierarchical training:(1). How dowe realize hierarchical training into the existing system?(2). How do we decide the net/graph partitioning for nodes ina group, i.e., which operators should be on trainers and whichoperators should be on group PSs? (3). As hierarchical trainingintroduces extra communication overhead within a group, how dowe minimize the communication overhead?

We answer these questions through a two-layer system designfor hierarchical training as shown in Figure 6. The first layer is thetagging system for operator placement on group PSs and trainers. Ittakes in a user config and generates a single tagged train net. Eachoperator in the train net is tagged by a 𝑛𝑜𝑑𝑒_𝑛𝑎𝑚𝑒 field, indicatingwhich device this operator will be run on. The second layer is thenet transformation system which takes the tagged train net fromthe tagging system as the input, applies net transformation on thenet and finally generates the distributed nets for each device, e.g.,trainers, group PSs, dense PSs, etc., in the system. We describe the

Page 5: Hierarchical Training: Scaling Deep Recommendation Models ... · compute-intensive and can be better accelerated by GPUs. These ... 3UHGLFWLRQThe dense part, on the other hand, are

Hierarchical Training: Scaling Deep Recommendation Modelson Large CPU Clusters KDD ’21, August 14–18, 2021, Virtual Event, Singapore

hierarchical training implementation in the Caffe2 stack [3] whilethe design is general and can be applied to other commonly useddeep learning frameworks, e.g., PyTorch [32], Tensorflow [1].

3.2.1 The Tagging System. The tagging system decides the operatorplacement for operators in the train net. Specifically, we need todecide (1). whether an operator should be placed on trainers orgroup PSs; and (2). if an operator is going to be placed on group PS,which group PS it should be placed on. We consider the placementin one group as the placements for all groups are identical. The goalof the tagging system is to generate an operator placement strategysuch that: (1). the communication between trainers and group PSsis minimized; (2). the computation on trainers and group PSs arebalanced; (3). trainers and group PSs will not run out of memory.

The problem can be modeled as a mixed integer linear program-ming problem and can be solved by a solver like Xpress [43]. How-ever, the train net can have up to thousands of operators and solvingit using the solver is infeasible. To give a relatively reasonable so-lution, we develop a rule-based tagging process based on priorknowledge of the models and a greedy refinement process to refinethe operator placement from the rule-based process.

The rule-based tagging process works as the followings. Weconsider the operator placement problem as a bi-partition problemto put operators on trainers or group PSs. First of all, we tag alloperators related to the initialization of the parameter weights, e.g.,FC weights and bias, to make the initialization run on the groupPSs. This step ensures that there is only one copy of the parametersin each group. We also tag all the related operators in the samemodules, e.g., the corresponding FC operators and the FCGradientoperators. Second, based on prior knowledge of the deep learningrecommendation models, we develop several rules to tag someoperators on the trainers. Third, we run a connected componentdetection algorithm on the train net to identify several connectedcomponents, each corresponds to a set of operators running on onegroup PS. Finally, we allocate those components on different groupPSs in a way that balancing the size of the dense weights on eachgroup PSs. The rule-based tagging process is not optimal but givesus a reasonably good initial operator placement. We can improvethe placement by plugging in more advanced rules based on theknowledge of the models.

Then we apply a greedy refinement process to further tune theoperator placement plan generated by the above rule-based process.The refinement runs as the followings: Given the current place-ment, it identifies a device which is the bottleneck of the system. Abottleneck device is a device with the highest resource utilization.We consider the following three dimensions for utilization: CPU,network and memory. If one device is the bottleneck, it tries tomove one of the operator from the bottleneck device to anotherdevice if the move can alleviate the bottleneck of the system. Thealgorithm runs the above process iteratively until the placement isbalanced or a pre-defined search time limit is reached.

The tagging process is a combination of a knowledge-basedprocess and a greedy process but in practice it is sufficient to yieldsatisfactory operator placement plan. We will show in Section 5.3about the effectiveness of the tagging process.

3.2.2 The Net Transformation System. We develop a net transfor-mation system to transform a tagged training net into distributed

training nets running on different hosts. The primary goal here isto partition some parts of the nets to express model parallelism,and replicate some other parts to express data parallelism. Thetagging system adds annotations/tags to the train net. With theannotations, the net transformation framework can infer the modelparallelism and data parallelism schemes used for different parts ofthe net. For computations that are related to embedding lookups,the corresponding operators will be partitioned and placed to theavailable sparse PSs. Hierarchical training architecture expressesmodel parallelism within a group, and data parallelism among thegroups. Therefore, we need to partition the train nets and place theoperators to the group PSs based on the tagged train net. Then forthe remaining operators that happen on the trainer, we replicatethem and place them on all trainers. For the ShadowSync EASGDsynchronization, we create one net per group PS that iterates overall the parameters owned by the group PSs and syncs them withthe dense PSs. Once the distributed nets for one group are cre-ated, we replicate the distributed nets to each group with smallmanipulations to place the nets to the correct hosts.

The net transformation system is a generic design which allowsus to define different combinations of model and data parallelism,and execute partitioned or replicated nets on different hosts.

4 SYSTEM OPTIMIZATIONSIn this section, we introduce several optimizations specific to HT.

4.1 Model Architecture OptimizationWe first identify potential improvement by modifying the models.The fully-connected layer (FC) is one of the most common layersin the dense part of the DLRM-like recommendation models. Itis represented by the FC Op and FCGradient Op in the forwardand backward computation respectively which are similar to ma-trix multiplication [3], i.e., AB = C where A is the input matrixwith size 𝑖𝑛𝑝𝑢𝑡_𝑑𝑖𝑚×ℎ𝑖𝑑𝑑𝑒𝑛_𝑑𝑖𝑚, B is the weight matrix with sizeℎ𝑖𝑑𝑑𝑒𝑛_𝑑𝑖𝑚×𝑜𝑢𝑡𝑝𝑢𝑡_𝑑𝑖𝑚 and C is the output. To better parallelizethe FC, we introduce two schemes for FC splits: vertical split andhorizontal split. In vertical split, we split the weight matrix verti-cally, and the final result of the FC is the concatenation of the submatrix multiplication. In horizontal split, we split both the inputmatrix and the weight matrix along the ℎ𝑖𝑑𝑑𝑒𝑛_𝑑𝑖𝑚. The result isthe sum of the sub matrix multiplication. Both schemes of FC splithelp with paralleling the computation on group PSs and each submatrix operation can be placed on one group PS.

We also apply co-partitioning for two consecutive FCs to avoidunnecessary traffic. If the 𝑜𝑢𝑡𝑝𝑢𝑡_𝑑𝑖𝑚 is larger thanℎ𝑖𝑑𝑑𝑒𝑛_𝑑𝑖𝑚 forthe first FCweight and theℎ𝑖𝑑𝑑𝑒𝑛_𝑑𝑖𝑚 is larger than the𝑜𝑢𝑡𝑝𝑢𝑡_𝑑𝑖𝑚of the second FC weight, we apply co-partitioning to the two con-secutive FCs. The first FC is better to use vertical split while thesecond one is better to use horizontal split. Before any optimization,the first FC will be distributed to different group PSs and the resultswill be sent back to trainers for aggregation (e.g., concat), and thenfor the second FC, we redistribute the input (i.e., the output of theprevious FC) to different group PSs. To avoid the unnecessary traffic,we enable co-partitioning of the two FCs so that the intermediateresults between the two FCs do not need to be sent back to trainers,and thus saves precious network bandwidth.

Page 6: Hierarchical Training: Scaling Deep Recommendation Models ... · compute-intensive and can be better accelerated by GPUs. These ... 3UHGLFWLRQThe dense part, on the other hand, are

KDD ’21, August 14–18, 2021, Virtual Event, SingaporeYuzhen Huang, Xiaohan Wei, Xing Wang, Jiyan Yang, Bor-Yiing Su,

Shivam Bharuka, Dhruv Choudhary, Zewei Jiang, Hai Zheng, Jack Langman

4.2 Communication CompressionWe employ two techniques to reduce the network traffic intro-duced by the operator partitioning in Hierarchical Training [16] -𝑸𝒖𝒂𝒏𝒕 𝒊𝒛𝒂𝒕 𝒊𝒐𝒏 can typically reduce the traffic by 2x-4x dependingon the bit precision. We found that using specialized formats likebfloat16 can compress communication with almost no impact tomodel accuracy in lieu of increased latency of quantization anddequantization operations. Lower bit-precision like INT8 and INT4can typically provide more compression but also comes at the costof model accuracy. We observed that gradient information sentbetween sub-networks is extremely sensitive to lower bit precision,and hence INT8 is only applied to layer activations in the forwardpass, while gradients are quantized to bfloat16. 𝑺𝒑𝒂𝒓𝒔 𝒊𝒇 𝒊𝒄𝒂𝒕 𝒊𝒐𝒏is another technique we employ that further boosts the EPS byzeroing out low magnitude values in gradients. This introducesbias in the system which can be compensated by employing errorcorrection techniques [24].

Another important lever that helps improve the efficiency of hi-erarchical training is the batch size for computation. Larger batcheslead to more computation during each batch of training, therebyboosting EPS and CPU utilization. But this also increases the net-work communication which scales linearly with the number oflayer activations. Hence communication compression and largebatch can organically complement each other in boosting EPS.

4.3 System Optimizations for ScalingHierarchical training increases the number of distributed nodesfrom dozens to more than 200. This adds pressure on the underlyinginfrastructure to meet end-to-end training latency guarantees whilemaintaining high reliability. We applied several optimizations toimprove the model publishing performance, memory managementscheme, and the overall fault-tolerance strategy.

During online training, models incrementally train on fresh dataand publish new inference files at an interval of two hours. Themodels stop training to save inference files to disk with low la-tency and resumes training quickly to avoid loss of training data.A background thread uploads the files from disk to remote storageto serve predictor traffic. Hierarchical training allows training oflarger model which increases the write latency of inference filesto local disk and can lead to training data loss. Moreover, a lowerend-to-end publishing latency between two consecutive snapshotsis required to avoid model accuracy degradation from over-fitting.We optimized our model publishing logic to directly write to remotestorage and avoid the slow temporary write to local disk.

Hierarchical training executes complex computations insidegroup PS and trainers. Dynamic memory of these hosts are utilizedwhile training large models which places heavy computational com-ponents on a single node. Moreover, sparse PS host large embeddingtables, making these hosts susceptible to run out of memory. Weutilize a detection framework to identify fused components whichcannot be sharded across trainer and group PS and fail early. Wealso deploy an auto-tuning framework with dynamic memory ac-counting to estimate the memory requirements of sparse PS andshard large embedding tables across multiple nodes.

Training at scale with large number of nodes increases the sur-face area for failures due to communication challenges between

# Samples (Billion)

Rel

ativ

e Lo

ss

-0.3-0.2-0.10.00.10.2

2 4 6 8

HT 5g HT 8g HT 10g

(a) Model-A

# Samples (Billion)

Rel

ativ

e Lo

ss

-0.8-0.6-0.4-0.20.00.2

2 4 6 8

HT 5g HT 8g HT 10g

(b) Model-B

Figure 7: Relative loss compared with 40 trainers EASGD

Thro

ughp

ut (k

EP

S)

0

20

40

60

80

EASGD 40t

HT 5g HT 8g HT 10g

(a) Model-A

Thro

ughp

ut (k

EP

S)

0

20

40

60

EASGD 40t

HT 5g HT 8g HT 10g

(b) Model-B

Figure 8: Training throughput (EPS)

nodes and routine maintenance of hosts. We use checkpoint-basedfailure recovery strategy for fault tolerance and increase the num-ber of retries. We also implement a controller to defer the hostmaintenance events to much later so that we can group multipleevents and perform maintenance on the hosts altogether.

5 EXPERIMENTSIn this section, we conduct experiments to evaluate the effectivenessof hierarchical training on DLRM-like models for click-through-rate prediction tasks. We use internal models and dataset for theexperiments. The models consist hundreds of sparse categoricalfeatures and the dense part is of several hundreds megabytes. Wewill omit the detailed descriptions for the models and data. Hyper-parameters, e.g., the learning rate, batch size, etc., are the samefor all the experiments. For all the comparisons except those inSection 5.3, we employed all system optimization techniques intro-duced in Section 4. We use enough sparse PSs and readers for alljobs to ensure they are not the system bottleneck. The total numberof nodes (gang size) we used ranges from 100 to 200.

5.1 HT vs. EASGDFirst of all, we demonstrate the model quality and training through-put improvement for hierarchical training (HT) compared to theEASGD-based training system. We take two models: Model-Aand Model-B, where Model-A uses full precision computationson fully-connected layers, and Model-B uses 8-bit low precisioncomputations for FC layers on the forward pass. Both scenariosare commonly used in real world recommendation model training.Both models are trained using 9 billion samples. The loss we useis the normalized entropy (NE) loss between click-through-rateprediction probability and the true labels. The detailed descriptionof the loss can be found in [19]. We use averaged EPS (examples

Page 7: Hierarchical Training: Scaling Deep Recommendation Models ... · compute-intensive and can be better accelerated by GPUs. These ... 3UHGLFWLRQThe dense part, on the other hand, are

Hierarchical Training: Scaling Deep Recommendation Modelson Large CPU Clusters KDD ’21, August 14–18, 2021, Virtual Event, Singapore

Usage (%)

Group PS 0Group PS 1Group PS 2Group PS 3Group PS 4Group PS 5

Trainers

0 20 40 60 80

Network CPU

(a) Model-A

Usage (%)

Group PS 0Group PS 1Group PS 2Group PS 3Group PS 4Group PS 5

Trainers

0 25 50 75

Network CPU

(b) Model-B

Figure 9: CPU and Network Utilization

per second) throughout the training to measure the throughput ofthe system. Note that the EPS is stable during training.

In each of the two models, we use EASGD with 40 trainers(EASGD 40t) as the baseline and compare it with hierarchical train-ing (HT) with 5, 8, 10 groups (HT 5g, HT 8g, HT 10g), where eachgroup contains 6 trainers and 6 group PSs. We do not considerusing more nodes in HT as an efficiency loss because sparse PSsand readers already take up more than half of the total nodes.

We first evaluate the relative loss (the lower the better) for HTruns compared with 40 trainers EASGD baseline over the trainingprocess in Figure 7. HT versions achieved neutral to better losscompared with the baseline. The gain is larger when fewer groups,e.g., 5 groups, are used in HT. This is because fewer dense replicascan yield to better model quality as also observed in Section 2.Specifically, with 5 groups, HT achieves more than 0.2% modelquality gain compared with the baseline which is significant.

Figure 8 shows the corresponding training throughput (EPS)for HT and EASGD baseline. HT with 10 groups is 1.6x and 1.9xof the EPS compared with the EASGD baseline for Model-A andModel-B respectively. Also note that with 10 groups, HT achievesneutral model quality for both models as shown in Figure 7. Thisresult demonstrates that HT can speedup the training by up to 1.9xwhile at the same time preserving the model quality. From anotherperspective, it also shows that with similar training throughput,HT can achieve up to 0.2% loss improvement compared with theEASGD-based system.

We also report the CPU and network utilization for group PSsand trainers for the HT 10 groups case on both models in Figure 9.The utilization of group PSs from different groups are similar sowe only plot the utilization for the first group. Different trainersalso have similar utilization because they are identical, and we onlyplot the averaged usage for trainers. Utilization for HT 5 groupsand 8 groups are similar with HT 10 groups because changing thenumber of groups does not change the utilization of the group PSsand trainers within a group. The result suggests that for Model-A, the CPU and network utilization for the training nodes (groupPSs and trainers) are relatively balanced, and there is no obviousbottleneck. This demonstrates the effectiveness of our operatorplacement strategy, including the rule-based tagging and greedyrefinement process. For Model-B, the CPU and network are moreutilized, and more specifically, the CPU on group PS 4 becomes abottleneck. Further improving the efficiency of hierarchical trainingwill be left as future work.

# Groups

Rel

ativ

e Th

roug

hput

012345

2 4 6 8 10

(a) Inter-group Scalability

# Nodes Per Group

Rel

ativ

e Th

roug

hput

0

1

2

3

4

4 8 12 16 20

(b) Intra-group Scalability

Figure 10: Scalability

Relative Throughput

All OptsNo Comm Opt

No GreedyNo Rule and Greedy

No Model Arch Opt

0.00 0.25 0.50 0.75 1.00

(a) Model-A

Relative Throughput

All OptsNo Comm Opt

No GreedyNo Rule and Greedy

No Model Arch Opt

0.00 0.25 0.50 0.75

(b) Model-B

Figure 11: Effect of Optimizations

5.2 ScalabilityIn this section, wemeasure the scalability of hierarchical training onModel-A. We first evaluate the inter-group scalability by increasingthe number of groups in HT as shown in Figure 10a. Each grouphas 6 trainers and 6 group PSs. We report the relative speedupcompared with the 2 groups setting. The result shows that HT canscale linearly to 10 groups as long as other system componentsdo not become the bottleneck. This is expected as similar to theEASGD-based system, the model averaging for the dense part isrunning in the background and allows HT to increase the trainingthroughput almost linearly with more groups.

We then evaluate the effect of changing the (# trainers, # groupPSs) pair in each group from (2, 2) to (10, 10) in an 8-group HT runon Model-A and measure the averaged EPS. Figure 10b shows therelative speedup compared with the (2 trainers, 2 group PSs) setting.We found that when we have more nodes per group, the modelsplit becomes worse and the communication traffic among trainersand group PSs increased. Thus, the EPS grows sublinearly whenincreasing the nodes per group. A reasonable number of nodes pergroup could be around 12 (6 trainers and 6 group PSs).

5.3 The Effectiveness of System OptimizationsFinally, we evaluate the effectiveness of various optimization tech-niques for hierarchical training mentioned in Section 3.2 and Sec-tion 4. In particular, we look at the following techniques in bothmodel-A and model-B: (1) greedy refinement tagging process; (2)rule-based tagging + greedy refinement; (3) model architecture op-timization, and (4) communication compression. We remove eachof them independently from the HT baseline and measure the EPS.In both model-A and model-B experiments, we use 5 groups with 6trainers and 6 group PSs in each group.

Figure 11a and Figure 11b show that the tagging strategy isimportant to the efficiency of HT. Disabling both the rule-basedtagging and greedy refinement can reduce the EPS by more than

Page 8: Hierarchical Training: Scaling Deep Recommendation Models ... · compute-intensive and can be better accelerated by GPUs. These ... 3UHGLFWLRQThe dense part, on the other hand, are

KDD ’21, August 14–18, 2021, Virtual Event, SingaporeYuzhen Huang, Xiaohan Wei, Xing Wang, Jiyan Yang, Bor-Yiing Su,

Shivam Bharuka, Dhruv Choudhary, Zewei Jiang, Hai Zheng, Jack Langman

40%. The greedy refinement can improve the efficiency considerablyif the rule based tagging strategy is not good enough, e.g., for Model-A, but gives little improvement if the rule-based process is alreadygood, e.g., for Model-B. Model architecture optimizations can alsoboost the EPS as they help model parallelism to achieve a morebalanced computation/communication among training nodes in agroup. Finally, communication compression does not always help.The reason is that the compression/decompression operators alsotake some CPU cycles and it might outweighs the benefit of lesscommunication sizes, e.g., for Model-A. This indicates that takingthe CPU cycles for compression/decompression into considerationsin the tagging process can potentially further improve the trainingefficiency. We leave this as future work.

6 RELATEDWORKModels for Recommendation Systems. The recommendationmodels has evolved from large scale logistic regression [18] and fac-torized machine [13, 28, 34] to more advanced neural network suchas DLRM [31] form Facebook and Deep & Wide [6] from Google.Baidu and Alibaba use a similar architecture for ads recommenda-tion [47] and product recommendation [49] respectively. Youtube,Netflix, Microsoft, etc, also report similar architecture for variouspersonalization workloads [7, 9, 10]. At Facebook, DLRM-like mod-els are powering the majority of the recommendation productsincluding instagram story ranking, news feed ranking, and grouprecommendation, etc [2, 14, 15, 17]. Our previous work providesan in-depth analysis about model architecture, hardware and sys-tem configurations [2]. DLRM-like models are also widely used inrecommendation models research [12, 26, 37].ParallelismSchemes.Data Parallelism is a commonly used schemefor distributed training which partitions the input dataset amongmultiple devices. Each device holds a replica of the model and themodel are synchronized using parameter server [1, 8, 21, 22, 27,42] or collective communication primitive [11, 40]. Model Paral-lelism splits the model into different devices to achieve parallelism,e.g., STRADS [25], NOMAD [44, 45] and Mesh-Tensorflow [35].PipeDream [30], GPipe [20] adopt pipeline parallelism and partitionthe model into multiple devices and splits a batch of data into minibatches to fully utilize all the devices in the pipeline. FlexFlow [23],Tofu [39] and TensorOpt [4] employ different strategies to discoverthe parallelism schemes for different operators automatically. Hier-archical training combines data parallelism and model parallelismin the execution of each group, and utilizes a two-phase taggingstrategy for operator placement.Parameter Sychronization. Parameter server architecture storesthe model on distributed parameter servers and allows multipleworkers to update the parameters in RPC-like style, e.g, ParameterServer [27], Petuum [42], DistBelief [8], etc. At Facebook, we alsouse a dedicated set of sparse parameter servers to host the embed-dings for recommendation models [2]. Hogwild! [33] exploits thesparsity of data and lets multiple threads update the shared model inone machine in a lock-free manner. Another line of work focuses onmodel averaging algorithms [51], e.g., EASGD [46], BMUF [5]. De-centralized parallel SGD [29] relies on peer-to-peer communicationto get rid of the centralized parameter servers. ShadowSync [48]makes the model averaging happen in the background and avoid the

synchronization blocking the training process. Hierarchical train-ing extends ShadowSync across groups and uses Hogwild!-styleupdate within each group.

7 CONCLUSIONScaling deep learning based recommendation models training with-out sacrificing the model quality is challenging. In this paper, weprovide an in-depth analysis of the challenges and identify thatthe model quality is more sensitive to inter-trainer asynchronic-ity than intra-trainer asynchronicity. Based on these observations,we propose Hierarchical Training to scale the real-world recom-mendation models. We address the operator placement problem,develop optimization strategies for communication reduction, andintegrate hierarchical training into our existing training stack. Ourexperiments verify the effectiveness of hierarchical training onreal-world models. Our work demonstrates that a thorough under-standing of the system asynchronicity and a design that leveragesthe characteristic of the models could significantly boost trainingscalability on large CPU clusters.

8 ACKNOWLEDGEMENTWe would like to thank Lin Yang, Chunzhi Yang, Xi Tao, BangshengTang, Yunchen Pu, Ximing Chen, Huayu Li, Chonglin Sun, AshwinBahulkar, Shuai Yang, Xi Liu, Sherman Wong, Tristan Rice, WenqiCao, Hassan Eslami, James March, Jeff Kerzner, Dianshi Li, IsabelGao, Ashot Melik-Martirosian, Joyce Xu, Carole-Jean Wu, MaxLeung, Amit Nagpal, Bingyue Peng, John Bocharov, Rocky Liu,Wenlin Chen, Yantao Yao, Shuo Chang, Jason Chen, Liang Xiong,Hagay Lupesko, Stanley Wang, Shri Karandikar, Mohamed Fawzyfor the helpful discussions and consistent support.

REFERENCES[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey

Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al.2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX}symposium on operating systems design and implementation ({OSDI} 16). 265–283.

[2] Bilge Acun, Matthew Murphy, Xiaodong Wang, Jade Nie, Carole-Jean Wu, andKim Hazelwood. 2020. Understanding Training Efficiency of Deep LearningRecommendation Models at Scale. arXiv preprint arXiv:2011.05497 (2020).

[3] Caffe2 Operators Catalog. 2021. https://caffe2.ai/docs/operators-catalogue.html.(2021).

[4] Zhenkun Cai, Kaihao Ma, Xiao Yan, Yidi Wu, Yuzhen Huang, James Cheng, TengSu, and Fan Yu. 2020. TensorOpt: Exploring the Tradeoffs in Distributed DNNTraining with Auto-Parallelism. CoRR abs/2004.10856 (2020). arXiv:2004.10856https://arxiv.org/abs/2004.10856

[5] Kai Chen and Qiang Huo. 2016. Scalable training of deep learning machines byincremental block training with intra-block parallel optimization and blockwisemodel-update filtering. In 2016 ieee international conference on acoustics, speechand signal processing (icassp). IEEE, 5880–5884.

[6] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al.2016. Wide & deep learning for recommender systems. In Proceedings of the 1stworkshop on deep learning for recommender systems. 7–10.

[7] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networksfor youtube recommendations. In Proceedings of the 10th ACM conference onrecommender systems. 191–198.

[8] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, MarkMao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. 2012.Large scale distributed deep networks. Advances in neural information processingsystems 25 (2012), 1223–1231.

[9] Ali Mamdouh Elkahky, Yang Song, and Xiaodong He. 2015. A multi-view deeplearning approach for cross domain user modeling in recommendation systems.In Proceedings of the 24th International Conference on World Wide Web. 278–288.

[10] Carlos A Gomez-Uribe and Neil Hunt. 2015. The netflix recommender system:Algorithms, business value, and innovation. ACM Transactions on Management

Page 9: Hierarchical Training: Scaling Deep Recommendation Models ... · compute-intensive and can be better accelerated by GPUs. These ... 3UHGLFWLRQThe dense part, on the other hand, are

Hierarchical Training: Scaling Deep Recommendation Modelson Large CPU Clusters KDD ’21, August 14–18, 2021, Virtual Event, Singapore

Information Systems (TMIS) 6, 4 (2015), 1–19.[11] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski,

Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate,large minibatch sgd: Training imagenet in 1 hour. arXiv:1706.02677 (2017).

[12] Hui Guan, Andrey Malevich, Jiyan Yang, Jongsoo Park, and Hector Yuen.2019. Post-training 4-bit quantization on embedding tables. arXiv preprintarXiv:1911.02079 (2019).

[13] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017.DeepFM: a factorization-machine based neural network for CTR prediction. arXivpreprint arXiv:1703.04247 (2017).

[14] Udit Gupta, Samuel Hsia, Vikram Saraph, Xiaodong Wang, Brandon Reagen, Gu-YeonWei, Hsien-Hsin S Lee, David Brooks, and Carole-JeanWu. 2020. Deeprecsys:A system for optimizing end-to-end at-scale neural recommendation inference.In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture(ISCA). IEEE, 982–995.

[15] Udit Gupta, Carole-Jean Wu, Xiaodong Wang, Maxim Naumov, Brandon Reagen,David Brooks, Bradford Cottel, Kim Hazelwood, Mark Hempstead, Bill Jia, et al.2020. The architectural implications of facebook’s DNN-based personalizedrecommendation. In 2020 IEEE International Symposium on High PerformanceComputer Architecture (HPCA). IEEE, 488–501.

[16] Vipul Gupta, Dhruv Choudhary, Ping Tak Peter Tang, Xiaohan Wei, Xing Wang,Yuzhen Huang, Arun Kejariwal, Kannan Ramchandran, and Michael W. Mahoney.2020. Fast Distributed Training of Deep Neural Networks: Dynamic Communica-tion Thresholding for Model and Data Parallelism. CoRR abs/2010.08899 (2020).arXiv:2010.08899 https://arxiv.org/abs/2010.08899

[17] Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, DmytroDzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, et al. 2018.Applied machine learning at facebook: A datacenter infrastructure perspective.In 2018 IEEE International Symposium on High Performance Computer Architecture(HPCA). IEEE, 620–629.

[18] Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, AntoineAtallah, Ralf Herbrich, Stuart Bowers, et al. 2014. Practical lessons from predictingclicks on ads at facebook. In Proceedings of the Eighth International Workshop onData Mining for Online Advertising. 1–9.

[19] Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, AntoineAtallah, Ralf Herbrich, Stuart Bowers, and et al. 2014. Practical Lessons fromPredicting Clicks on Ads at Facebook. In Proceedings of the Eighth InternationalWorkshop on Data Mining for Online Advertising (ADKDD’14). Association forComputing Machinery, New York, NY, USA. https://doi.org/10.1145/2648584.2648589

[20] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, DehaoChen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2018.Gpipe: Efficient training of giant neural networks using pipeline parallelism.arXiv preprint arXiv:1811.06965 (2018).

[21] Yuzhen Huang, Tatiana Jin, Yidi Wu, Zhenkun Cai, Xiao Yan, Fan Yang, JinfengLi, Yuying Guo, and James Cheng. 2018. FlexPS: Flexible Parallelism Controlin Parameter Server Architecture. Proc. VLDB Endow. 11, 5 (2018), 566–579.https://doi.org/10.1145/3187009.3177734

[22] Yuzhen Huang, Xiao Yan, Guanxian Jiang, Tatiana Jin, James Cheng, An Xu,Zhanhao Liu, and Shuo Tu. 2019. Tangram: bridging immutable and mutableabstractions for distributed data analytics. In 2019 {USENIX} Annual TechnicalConference ({USENIX}{ATC} 19). 191–206.

[23] Zhihao Jia, Matei Zaharia, and Alex Aiken. 2018. Beyond data and model paral-lelism for deep neural networks. arXiv preprint arXiv:1807.05358 (2018).

[24] Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi.2019. Error Feedback Fixes SignSGD and other Gradient Compression Schemes.In International Conference on Machine Learning. 3252–3261.

[25] Jin Kyu Kim, Qirong Ho, Seunghak Lee, Xun Zheng, Wei Dai, Garth A Gibson,and Eric P Xing. 2016. Strads: A distributed framework for scheduled modelparallel machine learning. In Proceedings of the Eleventh European Conference onComputer Systems. 1–16.

[26] Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2019. Tensordimm: A practicalnear-memory processing architecture for embeddings and tensor operationsin deep learning. In Proceedings of the 52nd Annual IEEE/ACM InternationalSymposium on Microarchitecture. 740–753.

[27] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed,Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scalingdistributed machine learning with the parameter server. In 11th {USENIX} Sym-posium on Operating Systems Design and Implementation ({OSDI} 14). 583–598.

[28] Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, andGuangzhong Sun. 2018. xdeepfm: Combining explicit and implicit feature in-teractions for recommender systems. In Proceedings of the 24th ACM SIGKDDInternational Conference on Knowledge Discovery & Data Mining. 1754–1763.

[29] Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. 2018. Asynchronous decentral-ized parallel stochastic gradient descent. In International Conference on MachineLearning. PMLR, 3043–3052.

[30] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil RDevanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019.

PipeDream: generalized pipeline parallelism for DNN training. In Proceedings ofthe 27th ACM Symposium on Operating Systems Principles. 1–15.

[31] Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang,Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-JeanWu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherni-avskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kon-dratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, LiangXiong, and Misha Smelyanskiy. 2019. Deep Learning Recommendation Model forPersonalization and Recommendation Systems. arXiv preprint arXiv:1906.00091(2019).

[32] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, GregoryChanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al.2019. Pytorch: An imperative style, high-performance deep learning library. InAdvances in neural information processing systems. 8026–8037.

[33] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild!:A lock-free approach to parallelizing stochastic gradient descent. Advances inneural information processing systems 24 (2011), 693–701.

[34] S. Rendle. 2010. Factorization Machines. In 2010 IEEE International Conference onData Mining. 995–1000. https://doi.org/10.1109/ICDM.2010.127

[35] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Pen-porn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, CliffYoung, et al. 2018. Mesh-tensorflow: Deep learning for supercomputers. arXivpreprint arXiv:1811.02084 (2018).

[36] Hao-Jun Michael Shi, Dheevatsa Mudigere, Maxim Naumov, and Jiyan Yang. 2020.Compositional embeddings using complementary partitions for memory-efficientrecommendation systems. In Proceedings of the 26th ACM SIGKDD InternationalConference on Knowledge Discovery & Data Mining. 165–175.

[37] Qingquan Song, Dehua Cheng, Hanning Zhou, Jiyan Yang, Yuandong Tian, andXia Hu. 2020. Towards automated neural interaction discovery for click-throughrate prediction. In Proceedings of the 26th ACM SIGKDD International Conferenceon Knowledge Discovery & Data Mining. 945–955.

[38] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is allyou need. arXiv preprint arXiv:1706.03762 (2017).

[39] Minjie Wang, Chien-chin Huang, and Jinyang Li. 2019. Supporting very largemodels using automatic dataflow graph partitioning. In Proceedings of the Four-teenth EuroSys Conference 2019. 1–17.

[40] Yidi Wu, Kaihao Ma, Xiao Yan, Zhi Liu, Zhenkun Cai, Yuzhen Huang, JamesCheng, Han Yuan, and Fan Yu. 2021. Elastic Deep Learning in Multi-Tenant GPUClusters. IEEE Transactions on Parallel and Distributed Systems (2021).

[41] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017.Aggregated residual transformations for deep neural networks. In Proceedings ofthe IEEE conference on computer vision and pattern recognition. 1492–1500.

[42] Eric P Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, XunZheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A newplatform for distributed machine learning on big data. IEEE Transactions on BigData 1, 2 (2015), 49–67.

[43] Xpress Optimization. 2021. https://www.fico.com/en/products/fico-xpress-optimization. (2021).

[44] Fan Yang, Fanhua Shang, Yuzhen Huang, James Cheng, Jinfeng Li, Yunjian Zhao,and Ruihao Zhao. 2017. Lftf: A framework for efficient tensor analytics at scale.Proceedings of the VLDB Endowment 10, 7 (2017), 745–756.

[45] Hyokun Yun, Hsiang-Fu Yu, Cho-Jui Hsieh, SVN Vishwanathan, and InderjitDhillon. 2013. Nomad: Non-locking, stochastic multi-machine algorithm for asyn-chronous and decentralized matrix completion. arXiv preprint arXiv:1312.0193(2013).

[46] Sixin Zhang, Anna E Choromanska, and Yann LeCun. 2015. Deep learning withelastic averaging SGD. Advances in neural information processing systems 28(2015), 685–693.

[47] Weijie Zhao, Deping Xie, Ronglai Jia, Yulei Qian, Ruiquan Ding, Mingming Sun,and Ping Li. 2020. Distributed Hierarchical GPU Parameter Server for MassiveScale Deep Learning Ads Systems. arXiv preprint arXiv:2003.05622 (2020).

[48] Qinqing Zheng, Bor-Yiing Su, Jiyan Yang, Alisson Azzolini, Qiang Wu, Ou Jin,Shri Karandikar, Hagay Lupesko, Liang Xiong, and Eric Zhou. 2020. ShadowSync:Performing Synchronization in the Background for Highly Scalable DistributedTraining. arXiv preprint arXiv:2003.03477 (2020).

[49] Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, XiaoqiangZhu, and Kun Gai. 2019. Deep interest evolution network for click-through rateprediction. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33.5941–5948.

[50] Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, XiaoMa, YanghuiYan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-throughrate prediction. In Proceedings of the 24th ACM SIGKDD International Conferenceon Knowledge Discovery & Data Mining. 1059–1068.

[51] Martin Zinkevich, MarkusWeimer, Lihong Li, and Alex J Smola. 2010. Parallelizedstochastic gradient descent. In Advances in neural information processing systems.2595–2603.