Progetto MAIS - WP5 esplorazione di architetture alternative Resoconto delle attività svolte Andrea...
-
Upload
olivia-love -
Category
Documents
-
view
216 -
download
1
Transcript of Progetto MAIS - WP5 esplorazione di architetture alternative Resoconto delle attività svolte Andrea...
Progetto MAIS - WP5 esplorazione di architetture alternative
Resoconto delle attività svolte
Andrea Pagni
STMicroelectronics
Advanced System Architectures Group
Milano, 17-18 Novembre 2004
2Resoconto WP5
Topics
Part 1: VLIW-SIM Overview.
Part 2: VLIW-SIM Performance.
Part 3: VLIW-SIM Library.
Part 4: Next Steps.
Part 1: VLIW-SIM Overview
4Resoconto WP5
Part 1: VLIW-SIM Overview
Simulation Approach (1-7).
Modeled Target Architectures.
Supported platforms.
Simulation functionalities.
5Resoconto WP5
Simulation Approach 1/7Overview
Interpretative Simulation Approach
Simulation Technology based on a set of re-usable sub-blocks
Pipeline modeling
Instruction execution
Memory modeling
Register file management
I/O simulation
Efficient Host Resources Allocation
Target Architecture Description capability (IS, TAD)
Challenging compromise between Speed and Accuracy
BIN loader
Instruction Fetch
Instruction Decode
Instruction Execute
Instruction Set
Register File
Pipeline
Memory Executable file
Program Memory
Initialization
Data Memory Load/Store
Read Operands / Write Results
Load Instructions
Processor Instruction Simulation
Write Instructions into Pipeline
Instructions are executed and passed to
next phase
Instructions are decoded
and passed to next phase
6Resoconto WP5
op8
op4
op8
op4 F
D
R
E1
E2
W
op1 op 2 op3 op4
Operations
op5 op 6 op7 op8
t-1
t
t+1 Operation
Phase
Time
During simulation, the pipeline is represented as a 3-dimensional space (phase, operation, time): operation means the instruction’s position in the bundle, phase is the pipeline’s phase and time is the given time stamp.
Simulation Approach 2/7 pipeline modelling
7Resoconto WP5
Simulation Approach 3/7Pipeline modelling
The pipeline status is modelled via a two-dimension array:
The first index is the pipeline phase and the second one is the position of a certain instruction in the fetch-packet.
The simulation process is based on two arrays like the one described above, to represent the current and the following pipeline statuses.
At each machine cycle the pipeline status is processed: actions depending on which instructions are at that phase and then the instructions are moved to the next pipeline phase.
8Resoconto WP5
F
D
R E1
E2
W
Current Status Following Status
op1 op 2 op3 op4
Pipeline Phases
Operations Operations
op1 op 2 op3 op4
F
D
R E1
E2
W
op1 op 2 op3 op4
op1 op 2 op3 op4
Load fetch packet Pipeline Phases
At each machine cycle the pipeline status is processed
Simulation Approach 4/7 pipeline status update
9Resoconto WP5
Dispatch
unit Decode
instr
opfield Index
Instruction Table
Index
n
Pointer to instruction
routine
Instruction execute
opcode 3 2 1
……… ………
………
………
………
category
Instructions execution is simulated through an Instruction Table which contains the instruction-routine address and the instruction latency value.
Simulation Approach 5/7 Instruction execution
10Resoconto WP5
The simulation environment is based on the progressive pipeline status updating taking into account the data coherence in memory locations and in the register file.
To support data coherence two Register files have been used: one for the current Register File status and the other one for the following.
Each time an instruction is executed its operands are loaded from the current register file and results are stored in the following.
This allows sequential simulation of parallel instruction execution.
Pipeline Phase
….. …..
Reg 0
Register File
t
t+1
Current State
Next State
F
D
R
E1
E2
W
Reg 63
Reg 1
….. …..
Reg 0 Reg 1
Read from
Write to
Reg 63
Simulation Approach 6/7 register file status update
11Resoconto WP5
Simulation Approach 7/7 I/O simulation
I/O
instruction
syscall
bundle (i-1)
bundle (i+1)
...
...
I/O Handling save_core
...
exception is generated
return
I/O operation Selection
...
call
Host File System I/O execution
...
return ...
call
...
call restore_core
...
rfi
...
I/O Target Architecture specific features separated from Simulation kernel
The SYSCALL pseudo-instruction manages the interface between internal I/O instruction (processor side) and File System I/O calls (OS side).
SYSCALL handle also the general Exception Handling
This mechanism is transparent to other simulator modules:
Performance and data flow are not influenced if I/O operation are not present.
DetailsDetails
12Resoconto WP5
Modeled Target Architectures
Multi-cluster Architecture
4-issue VLIW core
I/D-cache memories
6-stages pipeline
RISC-like Instruction Set
64 32-bit General registers, 8 1-bit special registers
ST210TI C62x 8-issue VLIW core
Optional I-cache memory
11-stages pipeline
RISC-like Instruction Set
32 32-bit General registers
TI C64x 8-issue VLIW core
I/D cache memories
11-stages pipeline
RISC/SIMD Instruction Set
64 32-bit General registers
13Resoconto WP5
Windows OS (Visual C++):
• text mode: project file in vliw_sim/vliw_sim
• graphical mode: project file in vliw_sim/gui/gui
Windows OS (Cygwin, gcc):
• text mode: makefile in vliw_sim/vliw_sim
• graphical mode (with XWindows on Cygwin)
Linux OS (RedHat, gcc):
• text mode: makefile in vliw_sim/vliw_sim
• graphical mode: makefile in vliw_sim/gui/gui
Sun OS (Solaris, gcc)
• text mode: makefile in vliw_sim/vliw_sim
• graphical mode: makefile in vliw_sim/gui/gui
vliw_sim• bin_loader• cache• gui/gui• instruction_set• io_interf• memory• pipeline• profdebug• registers• vliw_sim• vliw_sim_dll
Supported Platforms
14Resoconto WP5
Simulation functionalities
Debug Support Step-by-step execution Breakpoint Register & Memory access Pipeline Visibility (instruction &
addresses)
Profiling Application Code region Profile Statistics extraction for profiled code
Simulator Dynamic Library Simulation API SoC simulation facilities
Exception Handling simulation Efficient I/O interface simulation
Part 2: VLIW-SIM Performance
16Resoconto WP5
Part 2: VLIW-SIM Performance
Tested Applications.
SW apps on ST210.
SW apps on TI C62x.
SW apps on TI C64x.
SW apps on ST210 (1-2).
17Resoconto WP5
Tested Applications
ST210.• MPEG-2 Intra Video Encoder (0.2s, 5 frames, 15 Mbit/s).• MPEG-1 Layer 2 Audio Encoder (1s, 32KHz 256 kbit/s).• MPEG-2 M=3 MP@ML Video Decoder (1s, 25 frames/s, 15
Mbit/s).• MPEG-4 QCIF SP@L3 Video Decoder (1s, 25 frames/s, 512
kbit/s).• MPEG-4 QCIF SP@L3 Video Encoder (27 frames, 64 kbit/s,
QP=12).• H.263+ QCIF Video Encoder (10 frames, No rate-control).• G.723.1 Audio Enc-Dec (20 frames, 8 kHz, 5.3 kbit/s).• Automatic Speech Recognition (HMM, 5 words, 8 MEL, 50
active words).
TI C62x & C64x.• H.263+ Video Enc QCIF (5 frames, No rate-control)• G.726 Audio Enc-Dec (10 frames, 8kHz, 32 kbit/s)
18Resoconto WP5
SW apps on TI-C62x
application ME2DYA ISS TI C62x Fast Simulator DIFFERENCE
CPU time (sec) MOPS CPU time (sec) MOPSH263+ with caches with cachesVideo Enc 104 0.61 509 0.125 fr QCIF bundles operations I$ misses D$ misses bundles operations I$ misses D$ misses bundles operations I$ misses D$ missesNo rate-Ctrl 36670008 63448050 36670011 3 0 0
CPU time (sec) MOPS CPU time (sec) MOPSG.726 with caches with cachesAudio Enc-Dec 7 0.59 35 0.1210 fr bundles operations I$ misses D$ misses bundles operations I$ misses D$ misses bundles operations I$ misses D$ misses8KHz 32 Kbps 2366855 4164831 2366858 3 0 0
Simulation Platform: Pentium II 400 MHz 128 MB RAM Windows NT4 SP6
Operation = one syllable (elementary 32-bit RISC instruction)
19Resoconto WP5
SW apps on TI-C64x
application ME2DYA ISS TI C64x C64_CSIM DIFFERENCE
CPU time (sec) MOPS CPU time (sec) MOPSH263+ with caches with cachesVideo Enc 95 0.55 897 0.065 fr QCIF bundles operations I$ misses D$ misses bundles operations I$ misses D$ misses bundles operations I$ misses D$ missesNo rate-Ctrl 28533664 51919843 330117 684905 28533667 387578 770661 3 57461 85756
CPU time (sec) MOPS CPU time (sec) MOPSG.726 with caches with cachesAudio Enc-Dec 8 0.52 78 0.0510 fr bundles operations I$ misses D$ misses bundles operations I$ misses D$ misses bundles operations I$ misses D$ misses8KHz 32 Kbps 2316134 4123399 45277 4286 2316137 60623 4286 3 15346 0
Simulation Platform: Pentium II 400 MHz 128 MB RAM Windows NT4 SP6
Bundle = more syllables (max 8 for TI C6xx, max 4 for ST210) per clock cycle
20Resoconto WP5
SW apps on ST210 1/3
application ME2DYA ISS HP ISS v 2.34 DIFFERENCE%
CPU time (sec) MOPS CPU time (sec) MOPSMPEG 1 Layer 2 with caches with cachesAudio Enc (1 s) 33 1.69 36 1.5532KHz 256Kbps bundles operations I$ misses D$ misses bundles operations I$ misses D$ misses bundles ops I$ mis D$ mis
22244270 55912864 44833 2576 22250327 55922242 47087 101529 0.03 0.02 4.79 97.46
CPU time (sec) MOPS CPU time (sec) MOPSMPEG 4 QCIF with caches with cachesVideo Dec (1s) 27 1.63 27.2 1.6225 fr/s 512 kbps bundles operations I$ misses D$ misses bundles operations I$ misses D$ misses bundles ops I$ mis D$ mis
20937469 43948023 110051 157526 20946900 43945855 103745 227471 0.05 0.00 -6.08 30.75
CPU time (sec) MOPS CPU time (sec) MOPSMPEG-2 Intra with caches with cachesVideo Enc 64 1.92 67 1.83(5 fr) 15 Mbps bundles operations I$ misses D$ misses bundles operations I$ misses D$ misses bundles ops I$ mis D$ mis
41767235 122788664 27438 412745 41793358 122782323 27537 446544 0.06 -0.01 0.36 7.57
Simulation Platform: Pentium II 800 MHz 256 MB RAM Windows 2000
HP ISS configured with:ignore_non_cacheable_areas TRUEprofile_gprof_on FALSE
21Resoconto WP5
SW apps on ST210 2/3
application ME2DYA ISS HP ISS v 2.34 DIFFERENCE%
CPU time (sec) MOPS CPU time (sec) MOPSMPEG 2 Video with caches with cachesDecoder MP@ML 361 1.93 365 1.91Renata M=3 bundles operations I$ misses D$ misses bundles operations I$ misses D$ misses bundles ops I$ mis25 fr/s 210332372 698079303 114211 2149829 210408042 698072522 133596 2704002 0.04 0.00 14.51
CPU time (sec) MOPS CPU time (sec) MOPSMPEG 4 QCIF with caches with cachesVideo Enc SP@L3 829 1.44 740 1.6127 fr 64 kbps bundles operations I$ misses D$ misses bundles operations I$ misses D$ misses bundles ops I$ misQP=12 638253918 1190289114 3387100 186212 638264837 1190299380 3427876 406397 0.05 0.00 -6.08
Simulation Platform: Pentium II 800 MHz 256 MB RAM Windows 2000
HP ISS configured with:ignore_non_cacheable_areas TRUEprofile_gprof_on FALSE
22Resoconto WP5
SW apps on ST210 3/3
application ME2DYA ISS HP ISS v 2.22 DIFFERENCE
CPU time (sec) MOPS CPU time (sec) MOPSH263+ with caches with cachesVideo Enc 24 2.93 29 2.4210 fr QCIF bundles operations I$ misses D$ misses bundles operations I$ misses D$ misses bundles ops I$ misses D$ missesNo rate-Ctrl 30381273 70272426 118303 33190 30381792 70272489 116161 129534 519 63 -2142 96344
CPU time (sec) MOPS CPU time (sec) MOPSG.723.1 with caches with cachesAudio Enc-Dec 11 2.99 12.8 2.5720 fr bundles operations I$ misses D$ misses bundles operations I$ misses D$ misses bundles ops I$ misses D$ misses8KHz 5.3 Kbps 12323550 32934110 25160 791 12323977 32934196 25229 3529 427 86 69 2738
CPU time (sec) MOPS CPU time (sec) MOPSSpeech Rec. with caches with caches5 words 305 2.86 360 2.428 Mels bundles operations I$ misses D$ misses bundles operations I$ misses D$ misses bundles ops I$ misses D$ misses50 active words 386584108 872624864 349878 1372934 386585313 872624303 354724 2845644 1205 -561 4846 1472710
Simulation Platform: Pentium IV 2.00 GHz 256 MB RAM Windows 2000 SP3
MOPS = Millions Of Operations Per Sec
Part 3: VLIW-SIM Library
24Resoconto WP5
Part 3: VLIW-SIM Library
VLIW-SIM Library (1-2).
25Resoconto WP5
VLIW-SIM Library 1/2
The VLIW-SIM can be configured as both stand-alone and dynamic library (DLL).
• extremely useful to interface VLIW-SIM with other applications (system on chip simulation environment, Graphical User Interface, etc.).
The simulator-exported functionalities can be divided into two subgroups:
• Command Functionalities: used to control the simulation (Run, Stop, Insert/remove breakpoint, Continue, Step, etc.)
• Status Functionality: used to retrieve the simulator internal status and resource allocation (pipeline status and size, register file content and size, etc.)
26Resoconto WP5
VLIW-SIM Library 2/2
The simulator DLL exports the following functionalities: Control Functions
• Load• Init• Step / Step N / Stall• Run• Restart
Debug Support• View simulator status ( Pipeline, Register File,
Memory )• Breakpoint
Utility functions• Code profiling• Simulated Program Arguments
Part 4: Next Steps
28Resoconto WP5
Part 4: Next Steps
Where we are.
VLIW-SIM Developments.
29Resoconto WP5
Released version 2.0 and 3.0 of VLIW-SIM.
A lot of SW engineering work to improve:
• Modularity
• Readibility (doxygen generated documentation)
• Simulation speed
• Architectural accuracy:
• ST210: IPU, DPU, Interrupt Controller, Core Memory Controller, I-cache, D-cache
• TI C6x: I-cache and D-cache for CPU style , program memory and data memory for DSP style
• Accurate and not invasive flat profiling (GNU format compatible)
• Architectural flexible re-configurability
• Host platform independency
• Future integration into high level system tools
Where we are
30Resoconto WP5
ST220 accurate modelling
Integration inside MaxSim system simulation tools and related experiments
VLIW-SIM developments
Fine
Domande?