L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

41
Alessandro Lonardo - 18/1 2/06 L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni Alessandro Lonardo I.N.F.N Roma - gruppo APE* [email protected] *http ://apegate.roma1.infn.it/APE

description

L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni. Alessandro Lonardo I.N.F.N Roma - gruppo APE* [email protected] * http://apegate.roma1.infn.it/APE. Index. Machine Architecture Software Areas Programming Model Languages Example Applications - PowerPoint PPT Presentation

Transcript of L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Page 1: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

L'ambiente software di apeNEXT: sviluppo ed esecuzione delle

applicazioni

Alessandro LonardoI.N.F.N Roma - gruppo APE*

[email protected]

*http://apegate.roma1.infn.it/APE

Page 2: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

Index

• Machine Architecture

• Software Areas

• Programming Model

• Languages

• Example Applications

• Development Tools

• Execution Environment

Page 3: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT ArchitectureThe Network

• 3D mesh of computing nodes• Vertexes are processors• Each proc hosts its local

memory• Each proc supports 64bit

complex, vector2, double and integer types

• Edges are 3D torus network channels

• 6 bi-dir channels per proc• Basic comm primitive is first-

neighbour send-recv• Processors synchronize on

communications (send starts when recv is issued)

Page 4: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT ArchitectureThe J&T Processor

R egister F ile(256 x128bit)

F ILU

N et R X Q ueues(128 x128bit)

N et TX Q ueues(128 x 128bit)VLIW

instructionsm em ory

buffer(8K x128bit)

AG U

VLIW m icrocode

External D D R m em ory

Local Q ueue(1K x 128bit)

3.2 GB/s

1.6 GFlops

Page 5: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT ArchitectureVery Long Instruction Word

S1 D1

D2

D3

123

D1

D2

D3

Control Word Control Word

Dec

od

er&

Sch

edu

ler

Page 6: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT ArchitectureThe J&T FILU

FILU is FP,Integer and Logical unit:

• MAC op: A*B+C• fully pipelined(1 result

per cycle)• ~ 12 cycles latency• synthesizes to 200MHz• 4 multipliers• 4 adders• 1.6GFlops on complex

MAC

Page 7: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT SoftwareAreas

• Architecture design, development and validation: simulators, no regression tools,…

• Application development: compilation chain, libraries, profiler…

• Execution environment: operating system, batch system, …

• Applications.• System administration.

Page 8: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT SW development team

• Average ~5 persons• People in all the collaboration sites

– INFN Roma & Ferrara– Desy Zeuthen– Univ. Bielefeld– INRIA (France)

Page 9: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT Programming Model

• Single Program Multiple Data: each node executes the same program, but on its own data.– synchronization barriers at global condition evaluations, with

explicit statement or at I/O operations;– node to node synchronization at remote communications.

• Nodes are connected by a 3D network, each node can efficently transfer data with its first, second and third neighbour.

=> well suited for homogeneous problems with short range interactions.

Page 10: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT Programming Model Data Decomposition(1)

• Application discretized D-dim lattice domain decomposed onto a 3-dim processor mesh (maybe D != 3) => Each node has a subset of the lattice sites in its own memory

• In other cases no decomposition is done, simulation is

done in parallel without communications just to have a better statistics (FARM).

Page 11: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT Programming Model Data Decomposition(2)

For each lattice site and on each node in parallel the program performs an “evolutionary step”.Short-range interactions => first neighbour inter-node communication.

00 01

10 11

x

y

Page 12: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT Programming Model Programming Languages(1)

TAO dedicated parallel language• Fortran-like base syntax.• Dynamic Language: the (experienced)

programmer can freely extend syntax with new statements, data types and operators=> libraries configure the language for specific

application domains (LQCD, Spin Glass, …)

• Allows writing of high efficency codes by exposing the features of the hardware architecture (registers, prefetch queues, cache)

Page 13: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT Programming Model Programming Languages(2)

C99 language• Few extensions to the standard language.• Eases the porting of applications and standard

libraries.• Allows writing of high efficency codes by

exposing the features of the hardware architecture (registers, prefetch queues, cache)

Page 14: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT Programming Model Parallel Language Constructs(1)

few parallel language constructs (same in C99 and TAO):

• Conditioned execution on a subset of nodes based on local to node conditions (where)

• Boolean operators for promotion of local to global conditions to be used in flow control statements (any, all, none).

• Communications between nodes in the 3D mesh expressed as variable assignment, directions specified by mean of magic constants in the source address (X_PLUS, X_MINUS, …, Z_MINUS).

Page 15: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT Programming Model Parallel Language Constructs(2)

where statement - conditional execution on a mesh subset.

where (x>=y) max_xy=x min_xy=yelsewhere

max_xy=y min_xy=xendwhere

Page 16: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT Programming Model Parallel Language Constructs(3)

Inter-node communications:

integer i

real u[1024]register real rd1, rd2, rd3, rloc...rd1 = u[i+Z_PLUS] rd2 = u[i+Y_PLUS+Z_PLUS] rloc = u[i+X_PLUS+X_MINUS]rd3 = u[i+X_PLUS+Y_MINUS+Z_PLUS]

loads u[i] from node [x, y, z+1] into rd1

Page 17: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT Programming Model Parallel Language Constructs(4)

any()/all()/none() boolean operators

!!evaluation of mesh size along X!!with systolic algorithmsum_ix=1sum_r[0]=node_abs_xsum_r[0]=sum_r[X_PLUS] !!internode communication

while(any(sum_r[0]!=node_abs_x)) sum_ix=sum_ix+1 sum_r[0]=sum_r[X_PLUS]endwhile

Page 18: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

example 2D application kernel C function

T datain[LVOL],dataout[LVOL]; // LVOL is node local volume// precalculate neighbourhood tablesint neighp[LVOL,2], neighm[LVOL,2];...void kernel_fun() {

register T res, d, dp0, dp1, dm0, dm1;for(i=0; i<LVOL; ++i) { // i is a linearized index

d = datain[i]; // always local accessdp0 = datain[neighp[i,0]];// local or remote accessdp1 = datain[neighp[i,1]];dm0 = datain[neighm[i,0]];dm1 = datain[neighm[i,1]];res = calc(d,dp0,dp1,dm0,dm1); // big & inlinedataout[i] = res;

}}

Page 19: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

example 2D application kerneldomain decomposition

datain[LVOL]

domain decomposition of the datain[] array

datain[] local domain of node

xy

boundary of local domain

x

y

Page 20: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

example 2D application kernelFirst Neighbour Systolic Communication

00 10

01 11

dm0 = datain[neighp[i,0]];

neighp[i,0]: local displacement +X_MINUS

neighp[i,0]:local displacement

x

y

Page 21: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

Example: Monte Carlo Pi Calculation

• Estimate Pi by throwing darts at a unit square• Calculate percentage that fall in the unit circle

– Area of square = r2 = 1– Area of circle quadrant = ¼ * r2 =

• Randomly throw darts at x,y positions• If x2 + y2 < 1, then point is inside circle• Compute ratio:

– # points inside / # points total– = 4*ratio

• Replicate the calculation on N nodes in parallel to have better statistics

r =1

Page 22: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

Example: Monte Carlo Pi Calculation C+OpenMP Code

#include <stdio.h>#include <math.h> #include <stdlib.h>#include "omp.h"

inline int hit(){ double x = (double) rand() / (double) RAND_MAX; double y = (double) rand() / (double) RAND_MAX; if ((x*x + y*y) <= 1.0) return(1); else return(0); }

#define FIRST_SEED 3374int main(int argc, char **argv) { int i, hits = 0, trials = 0; int seeds_index = 0; const int max_threads = omp_get_max_threads(); unsigned int seeds[max_threads]; double pi; printf("MAX_THREADS = %d\n", max_threads);

if (argc != 2) trials = 1000000; else trials = atoi(argv[1]);

srand(FIRST_SEED); for(i=0; i<max_threads; i++) /*scorrelo i seeds*/ { seeds[i] = rand(); printf("seed%d=%d\n",i, seeds[i]); }#pragma omp parallel private(i,seeds_index ) shared(seeds, hits, trials) { seeds_index = omp_get_thread_num(); srand(seeds[seeds_index]);

#pragma omp for reduction(+:hits) for (i=0; i < trials; i++) hits += hit();

}

pi = 4.0*(double)hits/(double)trials; printf("PI estimated to %.10g\n", pi); return 0;}

Page 23: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

Example: Monte Carlo Pi Calculation apeNEXT C Code

#include <stdio.h>#include <math.h> #include <stdlib.h>#include <sysvars.h>#include <topology.h>

inline int hit(){ double x = (double) rand() / (double) RAND_MAX; double y = (double) rand() / (double) RAND_MAX; if ((x*x + y*y) <= 1.0) return(1); else return(0);}

int main(int argc, char **argv) { int i, hits = 0, trials = 0; int seeds_index = 0; const int max_threads = *_mem_imachine_size_x_p * *_mem_imachine_size_y_p * *_mem_imachine_size_z_p; const node_index = *_mem_inode_abs_id_p; unsigned int seeds[max_threads]; double pi; printf("MAX_THREADS = %d\n", max_threads);

if (argc != 2) trials = 1000000; else trials = atoi(argv[1]); srand(FIRST_SEED);

for(i=0; i<max_threads; i++) { seeds[i] = rand(); printf("seed%d=%d\n",i, seeds[i]); }

srand(seeds[node_index]); for (i=0; i < trials; i++) hits += hit();

hits = global_sum(hits); trials *= max_threads; pi = 4.0*(double)hits/(double)trials; printf("PI estimated to %.10g\n", pi); return 0;}

Page 24: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

Example: Monte Carlo Pi CalculationResults – Intel P4 Dual Core

lonardo@marlin>env OMP_NUM_THREADS=16 ./monte_pi-gcc.o MAX_THREADS = 16seed0=1396293760seed1=1488115307seed2=1303873515seed3=37393359seed4=824846176seed5=1138759395seed6=1184683763seed7=1884735975seed8=443160774seed9=326610858seed10=878347714seed11=501308535seed12=1066424433seed13=1420631951seed14=391631339seed15=1730610200PI estimated to 3.14108lonardo@marlin>

Page 25: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

Example: Monte Carlo Pi CalculationResults - apeNEXT Board (16 Nodes)

lonardo@ant>nrun -hib -board 033 -minit0 monte-api.memMAX_THREADS = 16seed0=6556077425992558173seed1=4923530068770806084seed2=4637196908100545377seed3=6221712952809700854seed4=279065984179923185seed5=7751953660738243840seed6=7614450982016732205seed7=1120288809807653798seed8=4640801604175907269seed9=4885633457180056444seed10=905770433927994553seed11=1598073754810041858seed12=7232028785291230425seed13=6726612558212505416seed14=3567338195430110971seed15=5194800804163472670PI estimated to 3.13989775lonardo@ant>

Page 26: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT Compilation Chain

rtc tao compiler• Retargetable Tao Compiler:

produce an intermediate pseudo-assembly file which is further translated into assembly for APEmille or apeNEXT.

• Based on Zz dynamic parser.• Relies on a separate module

for assembly code optimizations.

• Stable, production quality compiler

Page 27: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT Compilation Chain

nlcc c compiler• lcc 4.2 compiler port on

apeNEXT architecture.• few optimizations.• c99 + apeNEXT syntax

extensions• Low bug reports rate.

Page 28: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT Compilation Chain

ngcc c compiler • Porting of GNU C compiler (GCC)

for apeNEXT architecture• Based on gcc version 4.1• Optimization passes performed

on the compiler’s internal representation of code (tree-SSA, RTL)

• Source language: C99 and GNU Extensions to C99, apeNEXT extensions for parallel programming

• Possibility to integrate frontends to other source languages (C++, Fortran, TAO)

• Target language: apeNEXT user level assembly (SASM)

Page 29: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT Compilation Chain

ngcc status:• Single node C compiler – DONE• Vector data types and arithmetics

– ALMOST DONE• Exploitation of native complex

types and arithmetics – TO DO• Remote memory accesses

implementation – DONE• Prefetch instructions – ALMOST

DONE• Cache handling – TO DO• Where(), any(), all(), none()

constructs – TO DO• libc adaptation – JUST STARTED • Work in progress

Page 30: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT Compilation Chain

mpp macro-assembler

• translates a “user-friendly” assembly into a micro-assembly representation

• macro expansion.

• label analisys.

• emission of masm-instructions for cache handling.

Page 31: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT Compilation Chain

sofan micro-assemby optimizer

• based on the salto (INRIA) optimization toolkit

• Transforms the micro-assembly code in order to perform a series of optimizations, such as:

– mul-add fusion

– Dead code removal

– Copy propagation

– Address generation optimization

– Intruction pre-scheduling

Page 32: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT Compilation Chain

shaker microcode scheduler

• generation of optimized microcode to exploit the Pipelined Very Long Instruction Word Processor Architecture

– scheduling

– Register renaming

– Register allocation

– Microcode compression

– Optional generation of executable for the functional simulator

Page 33: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT Compilation Chainshaker microcode scheduler

• generation of microcode patterns, texec = tmax

• “shake up” phase: try to schedule each pattern earlier as possible respecting:

– dependencies between instructions

– device occupation at each cycle

texec = tsu

0 0

CY

LE

S

shake up

1 1

1

2

2

2

2

3

3

3 3

4 4

4 4

5

5 5

5

1

1 1

3 2

2 3

2 3 3 2

4 4

4 5 4

5 5

5

DEVICES

tmax

tsu

Page 34: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT Compilation Chainshaker microcode scheduler

• “shake down” phase: try to schedule each pattern later as possible respecting:

– dependencies between instructions

– device occupation at each cycle

texec = tsu- tsd

• Tipically tmax / texec ~ 10 in computing intensive code sections

shake down

CY

CL

ES

1

1 1

3 2

2 3

2 3 3 2

4 4

4 5 4

5 5

5

1

3 2 2

1

2

2

3

3

1

4

5

3

4 5 5

5 4 4tsu tsu

tsd

DEVICES 0

Page 35: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT Compilation Chain

sf functional simulator• micro-assemblyInstruction

level simulator.• Support for single and

multinode simulations (1x1x1, 2x2x2, 4x2x2).

• Fast simulation (multithreaded)• no cycle accurate.• bit exact arithmetic

(microcode scheduling may give differences).

Page 36: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT Execution EnvironmentOS distributed architecture(1)

I2C: bootstrap,exception handling,debugging (1.5 MB/s)

7thLink: •Program loading•I/O operations1 channel per unit200 MB/s per channel

Page 37: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT Execution EnvironmentOS distributed architecture(2)

• Master– resides on the front-end linux PC– user interface (shell commands)– Partitioning – dispatch I/O request to the slaves

• Slave– Resides on the blade PCs– Handles communication with apeNEXT on I2C and 7thLink

PCI boards• tiny kernel of routines embedded in the apeNEXT program

– loader– I/O (routing of data to and from the interface node)– System services (time counters, etc)

Page 38: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

apeNEXT Execution Environment

• programs can be loaded and executed on a machine partition:– node (1x1x1)– board = 16 nodes (4x2x2)– unit = 4 boards (4x2x8)– crate = 4 units (4x8x8)

– rack = 2 crates (8x8x8) • Partition is reserved until the program execution finishes

(no multitasking!)– Single process– No virtual memory

Page 39: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

Batch system

• Torque/OpenPBS

• today fifo-Scheduling, implementing a users group quota based scheduler.

• queues:– rack– crate– unit

Page 40: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

Batch SystemJob Submission

• nsub: wrapper of the qsub commandUsage: nsub [OPTIONS] script

Submits a apeNEXT job

where OPTIONS are: -a date_time Declares the time after which the job is eligible for execution the format is: [[[[CC]YY]MM]DD]hhmm[.SS] -c conf chooses among available apeNEXT configurations conf=board|unit|unit[01][0-3]|crate|crate[01]

|rack(default=crate) -m host_name requests a particular host -g group_name overrides user group -o logfile overrides logfile name -V dumps version information -v be verbose -h shows this help

Page 41: L'ambiente software di apeNEXT: sviluppo ed esecuzione delle applicazioni

Alessandro Lonardo - 18/12/06

Batch SystemJob submission example

lonardo@theboss>nsub -c crate 7h_test.sh15942.theboss.apelonardo@theboss>qstat -an1

theboss.ape: Req'd Req'd ElapJob ID Username Queue Jobname SessID NDS TSK Memory Time S Time-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----15813.theboss.ape orifici crate stoc2 1826 1 -- -- 24:00 R 21:52 rack10/115860.theboss.ape simula crate run_tdilu 29170 1 -- -- 24:00 R 14:34 rack4/115877.theboss.ape zeidlew crate mu.056.cjo 5925 1 -- -- 24:00 R 11:38 rack8/015880.theboss.ape delia crate run.sh 6386 1 -- -- 24:00 R 10:59 rack8/115896.theboss.ape delia unit rum0175.sh 32291 1 -- -- 24:00 R 08:47 rack7/515900.theboss.ape frezzott rack RUN_Rack5. 18099 1 -- -- 24:00 R 07:56 rack5/015906.theboss.ape delia unit run0175.sh 1072 1 -- -- 24:00 R 06:49 rack7/015918.theboss.ape frezzott rack RUN_Rack2. 15890 1 -- -- 24:00 R 04:30 rack2/015926.theboss.ape simula crate run1_tdilu 4409 1 -- -- 24:00 R 02:47 rack4/015927.theboss.ape delia unit run0200.sh 3596 1 -- -- 24:00 R 02:47 rack7/415928.theboss.ape lacagnin crate run.5.7.sh 4772 1 -- -- 24:00 R 02:36 rack1/015930.theboss.ape lacagnin crate run.5.6.sh 2787 1 -- -- 24:00 R 02:34 rack9/015932.theboss.ape cosmai crate b5.450_n0. 2994 1 -- -- 24:00 R 02:02 rack9/115933.theboss.ape delia unit rum0200.sh 4216 1 -- -- 24:00 R 01:53 rack7/215934.theboss.ape delia unit run0225.sh 4552 1 -- -- 24:00 R 01:24 rack7/115935.theboss.ape devitiis crate theboss.sh 28504 1 -- -- 24:00 R 01:22 rack3/015936.theboss.ape orifici crate stoc 28592 1 -- -- 24:00 R 00:57 rack3/115937.theboss.ape devitiis crate theboss.sh 18065 1 -- -- 24:00 R 00:57 rack6/015939.theboss.ape cosmai crate b5.450_n1. 5845 1 -- -- 24:00 R 00:54 rack1/115940.theboss.ape devitiis crate theboss.sh 18472 1 -- -- 24:00 R 00:22 rack6/115941.theboss.ape delia unit rum0225.sh 5290 1 -- -- 24:00 R 00:14 rack7/315942.theboss.ape lonardo crate 7h_test.sh 11226 1 -- -- 24:00 R -- rack10/0