Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan...

24
Gagandeep Singh , Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu, Henk Corporaal 56 th Design Automation Conference (DAC), Las Vegas 4 th -June-2019 Funded by the Horizon 2020 Framework Programme of the European Union MSCA-ITN-EID

Transcript of Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan...

Page 1: Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu,

Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira,

Stefano Corda, Sander Stuijk, Onur Mutlu, Henk Corporaal

56th Design Automation Conference (DAC), Las Vegas

4th-June-2019

Funded by the Horizon 2020 Framework

Programme of the European Union

MSCA-ITN-EID

Page 2: Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu,

Executive Summary

• Motivation: A promising paradigm to alleviate data movement bottleneck is near-memory computing (NMC), which consists of placing compute units close to thememory subsystem

• Problem: Simulation times are extremely slow, imposing long run-time especiallyin the early-stage design space exploration

• Goal: A quick high-level performance and energy estimation framework for NMCarchitectures

• Our contribution: NAPEL• Fast and accurate performance and energy prediction for previously-unseen applications using

ensemble learning• Use intelligent statistical techniques and micro-architecture-independent application features to

minimize experimental runs

• Evaluation• NAPEL is, on average, 220x faster than state-of-the-art NMC simulator• Error rates (average) of 8.5% and 11.5% for performance and energy estimation

2We open source Ramulator-PIM: https://github.com/CMU-SAFARI/ramulator-pim/

Page 3: Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu,

3

searches on

98PB

uploads on

180PB

15PB 15PB

3PB

SKA

300PB

Michael Wise, ASTRON, ”Science data Centre challenges”, DOME Symposium, 18 May, 2017

Page 4: Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu,

3

searches on

98PB

uploads on

180PB

15PB 15PB

3PB

SKA

300PB

Michael Wise, ASTRON, ”Science data Centre challenges”, DOME Symposium, 18 May, 2017

Massive amounts of data

Page 5: Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu,

DDR I/O

DDR chip

* R. Nair et al., “Active memory cube: A processing-in memory architecture for exascale systems”, IBM J. Research Develop., vol. 59, no. 2/3, 2015

System-level power break down*

Data Movement

Data Access

ProcessorCompute Centric Approach

• Memory hierarchies take advantage of locality

• Spatial locality

• Temporal locality

• Not suitable for all workloads

• Graph processing

• Neural networks

• Data access consumes a major part

– Applications are increasingly data hungry

• Data movement energy dominates compute

– Especially true for off-chip movement

4

Integer core

link

Page 6: Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu,

DDR I/O

DDR chip

System-level power break down*

Data Movement

Data Access

ProcessorCompute Centric Approach

• Memory hierarchies take advantage of locality

• Spatial locality

• Temporal locality

• Not suitable for all workloads

• Graph processing

• Neural networks

• Data access consumes a major part

– Applications are increasingly data hungry

• Data movement energy dominates compute

– Especially true for off-chip movement

4

Data movement bottleneck

Integer core

link

* R. Nair et al., “Active memory cube: A processing-in memory architecture for exascale systems”, IBM J. Research Develop., vol. 59, no. 2/3, 2015

Page 7: Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu,

Interconnect

Vault Ctrl

Link Ctrl

Vault Ctrl

Vault Ctrl

Vault Ctrl

Link Ctrl

Link Ctrl

Vault Ctrl

Vault Ctrl

....

....

External Interface

TSVs to memory

Core Core Core....

DRAM layerLogic layer

Vault

TSV

Partition

Paradigm Shift - NMC

• Compute-centric to a data-centric approach

• Biggest enabler – stacking technology

5

Page 8: Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu,

NMC Simulators

• Simulation for:

• Design space exploration (DSE)

• Workload suitability analysis

• NMC Simulators:

• Sinuca, 2015

• HMC-SIM, 2016

• CasHMC, 2016

• Smart Memory Cube (SMC), 2016

• CLAPPS, 2017

• Gem5+HMC, 2017

• Ramulator-PIM1, 2019

61Ramulator-PIM: https://github.com/CMU-SAFARI/ramulator-pim/

Page 9: Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu,

NMC Simulators

• Simulation for:

• Design space exploration (DSE)

• Workload suitability analysis

• NMC Simulators:

• Sinuca, 2015

• HMC-SIM, 2016

• CasHMC, 2016

• Smart Memory Cube (SMC), 2016

• CLAPPS, 2017

• Gem5+HMC, 2017

• Ramulator-PIM1, 2019

6

Simulation of real workloads can be 10000x slower

than native-execution!!!

1Ramulator-PIM: https://github.com/CMU-SAFARI/ramulator-pim/

Page 10: Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu,

NMC Simulators

• Simulation for:

• Design space exploration (DSE)

• Workload suitability analysis

• NMC Simulators:

• Sinuca, 2015

• HMC-SIM, 2016

• CasHMC, 2016

• Smart Memory Cube (SMC), 2016

• CLAPPS, 2017

• Gem5+HMC, 2017

• Ramulator-PIM1, 2019

6

Idea: Leverage ML with statistical techniques for

quick NMC performance/energy prediction

1Ramulator-PIM: https://github.com/CMU-SAFARI/ramulator-pim/

Page 11: Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu,

NAPEL:

7

NAPEL Model Training

Page 12: Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu,

Phase 1: LLVM Analyzer

8

Page 13: Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu,

Phase 2: Microarchitecture Simulation

9

Central composite design of experiments technique to minimize the number of experiments while data collection

Page 14: Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu,

Phase 3: Ensemble ML Training

10

Application Features

Instruction Mix

ILP

Reuse distance

Memory traffic

Register traffic

Memory footprint

Architecture Features

Core type

#PEs

Core frequency

Cache line size

DRAM layers

Cache access fraction

DRAM access fraction

Page 15: Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu,

NAPEL Framework

11

Page 16: Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu,

NAPEL Prediction

12

Page 17: Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu,

Experimental Setup

• Host System

• IBM POWER9

• Power: AMESTER

• NMC Subsystem• Ramulator-PIM1

• Workloads

• PolyBench and Rodinia

• Heterogeneous workloads such as image processing, machine learning, graph processing etc.

• Accuracy in terms of mean relative error (MRE)

131https://github.com/CMU-SAFARI/ramulator-pim/

Page 18: Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu,

NAPEL Accuracy: Performance and Energy Estimates

14

40

.4

16

.31

1.6

0

20

40

60

80

100

atax bfs bp chol gemv gesu gram kme lu mvt syrk trmm gmean

Me

an R

ela

tive

Er

ror

(%)

27

.2

14

.78

.5

0

20

40

60

80

atax bfs bp chol gemv gesu gram kme lu mvt syrk trmm gmean

Me

an R

ela

tive

Er

ror

(%)

Decision treeANNNAPEL

(a) Performance prediction

(b) Energy prediction

Page 19: Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu,

NAPEL Accuracy: Performance and Energy Estimates

14

40

.4

16

.31

1.6

0

20

40

60

80

100

atax bfs bp chol gemv gesu gram kme lu mvt syrk trmm gmean

Me

an R

ela

tive

Er

ror

(%)

27

.2

14

.78

.5

0

20

40

60

80

atax bfs bp chol gemv gesu gram kme lu mvt syrk trmm gmean

Me

an R

ela

tive

Er

ror

(%)

Decision treeANNNAPEL

(a) Performance prediction

(b) Energy prediction

MRE of 8.5% and 11.6% for performance and energy

Page 20: Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu,

Speed of Evaluation

15

0

200

400

600

800

1000

1200

NA

PEL

's P

red

icti

on

Sp

eed

up

o

ver

Ram

ula

tor

DoE configurations

256 DoE configurations for 12

evaluatedapplications

2561

Page 21: Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu,

Speed of Evaluation

15

0

200

400

600

800

1000

1200

NA

PEL

's P

red

icti

on

Sp

eed

up

o

ver

Ram

ula

tor

DoE configurations

256 DoE configurations for 12

evaluatedapplications

2561

220x (up to 1039x) faster than NMC simulator

Page 22: Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu,

0

1

2

3

4

5

6

EDP

Red

uct

ion Actual

NAPEL

Use Case: NMC Suitability Analysis

• Assess the potential of offloading a workload to NMC

• NAPEL provides accurate prediction of NMC suitability

• MRE between 1.3% to 26.3% (average 14.1%)

16

Page 23: Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu,

Conclusion and Summary

• Motivation: A promising paradigm to alleviate data movement bottleneck is near-memory computing (NMC), which consists of placing compute units close to thememory subsystem

• Problem: Simulation times are extremely slow, imposing long run-time especiallyin the early-stage design space exploration

• Goal: A quick high-level performance and energy estimation framework for NMCarchitectures

• Our contribution: NAPEL• Fast and accurate performance and energy prediction for previously-unseen applications using

ensemble learning• Use intelligent statistical techniques and micro-architecture-independent application features to

minimize experimental runs

• Evaluation• NAPEL is, on average, 220x faster than state-of-the-art NMC simulator• Error rates (average) of 8.5% and 11.5% for performance and energy estimation

17We open source Ramulator-PIM: https://github.com/CMU-SAFARI/ramulator-pim/

Page 24: Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo … · Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, Sander Stuijk, Onur Mutlu,

Gagandeep Singh, Juan Gomez-Luna, Giovanni Mariani, Geraldo F. Oliveira,

Stefano Corda, Sander Stuijk, Onur Mutlu, Henk Corporaal

56th Design Automation Conference (DAC), Las Vegas

4th-June-2019

Funded by the Horizon 2020 Framework

Programme of the European Union

MSCA-ITN-EID