Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

61
Prof. Pier Luca Lanzi OpenMP Algoritmi e Calcolo Parallelo

description

Slide del corso di Algoritmi e Calcolo Parallelo per il corso di laurea magistrale in Ingegneria Matematica 2012/2013 - Politecnico di Milano

Transcript of Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Page 1: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

OpenMPAlgoritmi e Calcolo Parallelo

Page 2: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

References

• Using OpenMP: Portable Shared Memory Parallel Programming,Barbara Chapman, Gabriele Jost and Ruud van der Pas

• OpenMP.orghttp://openmp.org/

• OpenMP Tutorialhttps://computing.llnl.gov/tutorials/openMP/

2

Page 3: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Prologue

Page 4: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Threads for Dummies (1)

• A thread is a sequence of instructions to be executed within a program

• Usually a program consist of a single thread of execution that starts in main(). Each line of your code is executed in turn, exactly one line at a time.

• A (UNIX) process consists of A virtual address space (i.e., stack, data, and code

segments) System resources (e.g., open files) A thread of execution

4

Page 5: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Threads for Dummies (2)

• In a threaded environment, one process consists of multiple threads

• Each user process can have several threads of execution, different parts of the user’s code can be executing simultaneously

• The collection of threads form a single process!

• Each thread in the process sees the same virtual address space, files, etc.

5

Page 6: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

You can manage the multiple threads in your process yourself (but it is quite complex!)

Or you can let OpenMP do it almost automatically!

Page 7: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

The Parallelization Model of OpenMP

7

Page 8: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Fork Join Parallelization 8

Page 9: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Introduction

Page 10: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

OpenMP = Open Multi-Processing

API for multi-platform shared memorymultiprocessing programming

Supports C/C++ and FortranAvailable on Unix and on MS Windows

Compiler directives, Library routines, &Environment variables

Page 11: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

History of OpenMP

• The OpenMP Architecture Review Board (ARB) published its first API specifications, OpenMP for Fortran 1.0, in October 1997. October the following year they released the C/C++ standard.

• 2000 saw version 2.0 of the Fortran specifications with version 2.0of the C/C++ specifications being released in 2002.

• Version 2.5 is a combined C/C++/Fortran specification that wasreleased in 2005.

• Version 3.0, released in May, 2008, is the current version of the API specifications. Included in the new features in 3.0 is the concept of tasks and the task construct. These new features are summarized in Appendix F of the OpenMP 3.0 specifications.

11

Page 12: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

What are the Goals of OpenMP?

• Standardization Provide a standard among a variety of shared memory

architectures

• Lean and Mean Establish a simple and limited set of directives for

programming shared memory machines (significant parallelism can be implemented by using just 3 or 4 directives)

• Ease of Use Provide capability to incrementally parallelize a serial

program(unlike message-passing libraries)

Provide the capability to implement both coarse-grainand fine-grain parallelism

• Portability Supports Fortran (77, 90, and 95), C, and C++

12

Page 13: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

How Does It Work?

• OpenMP is an implementation of multithreading

• A master thread "forks" a specified number of slave threads

• Tasks are divided among slaves

• Slaves run concurrently as the runtime environment allocating threads to different processors

13

Page 14: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Most OpenMP constructs are compiler directives or pragmas

The focus of OpenMP is to parallelize loops

OpenMP offers an incrementalapproach to parallelism

Page 15: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Overview of OpenMP

• A compiler directive in C/C++ is called a pragma (pragmatic information). It is a preprocessor directive, thus it is declared with a hash (#). Compiler directives specific to OpenMP in C/C++ are written in codes as follows:

#pragma omp <rest of pragma>

15

Page 16: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

16The OpenMP Programming Model

• Based on compiler directives

• Nested Parallelism Support API allows parallel constructs inside other parallel

constructs

• Dynamic Threads API allows to dynamically change the number of threads

which may used to execute different parallel regions

• Input/Output OpenMP specifies nothing about parallel I/O The programmer has to insure that I/O is conducted

correctly within the context of a multi-threaded program

• Memory consistency Threads can "cache" their data and are not required to

maintain exact consistency with real memory all of the time

When it is critical that all threads view a shared variable identically, the programmer is responsible for insuring that the variable is FLUSHed by all threads as needed

Page 17: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Basic Elements

Page 18: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Hello World

#include <iostream>using namespace std;int main(){

cout << "Hello World\n";}

$ ./hello Hello World

18

Page 19: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Hello World in OpenMP

#include <omp.h>#include <iostream>using namespace std;int main(){

#pragma omp parallel num_threads(3) {

cout << "Hello World\n”;

// what if// cout << "Hello World" << endl;

}}

$ ./hello Hello WorldHello WorldHello World

19

Page 20: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

20Compiling an OpenMP Program

• OpenMP consists of a set of directives and macros that are translated in C/C++ and thread instructions

• To compile a program using OpenMP, we need to include the flag “openmp”,

g++ -o example ./source/openmp00.cpp –fopenmp

• The command compile the file openmp00.cpp using openmp and generates the executable named “example”

Page 21: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

21What Does Really Happen?

• OpenMP consists of a set of directives and macros that are translated in C/C++ and thread instructions

• We can check how the original program using OpenMP is translated, using the command

g++ -fdump-tree-omplower ./source/openmp00.cpp –fopenmp

Page 22: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

OpenMP Directives

• A valid OpenMP directive must appear after the #pragma omp name

• name identify an OpenMP directive

• The directive name can be followed by optional clauses (in any order)

• An OpenMP directive precedes the structured block which is enclosed by the directive itself

• Example#pragma omp parallel num_threads(3) { cout << "Hello World\n";}

#pragma omp name [clause, ...] {

…}

22

Page 23: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

23Directive Scoping

• Static (Lexical) Extent The code textually enclosed between the beginning and

the end of a structured block following a directive The static extent of a directives does not span multiple

routines or code files

• Orphaned Directive An OpenMP directive that appears independently from

another enclosing directive. It exists outside of another directive's static (lexical) extent.

Will span routines and possibly code files

• Dynamic Extent The dynamic extent of a directive includes both its static

(lexical) extent and the extents of its orphaned directives.

Page 24: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Directive Scoping: example

void f(){

#pragma omp name2 [clause, ...] {

…}

}

int main(){

#pragma omp name1 [clause, ...] {

f();…

}}

24

Page 25: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Directive Scoping: example

void f(){

#pragma omp name2 [clause, ...] {

…}

}

int main(){

#pragma omp name1 [clause, ...] {

f();…

}}

Static extent of name1

25

Page 26: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Directive Scoping: example

void f(){

#pragma omp name2 [clause, ...] {

…}

}

int main(){

#pragma omp name1 [clause, ...] {

f();…

}}

Dynamic extent of name1

Orphaned directive

Page 27: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Parallel Construct

Page 28: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

The Parallel Construct/Directive

• Plays a crucial role in OpenMP, a program without a parallel construct will be executed sequentially

#pragma omp parallel [clause, clause, ... ]{ // block}

• When a thread reaches a parallel directive, it creates a team of threads and becomes the master of the team.

• The master is a member of that team and has thread number 0 within that team.

28

Page 29: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

The Parallel Construct/Directive

• Starting from the beginning of the parallel region, the code is duplicated and all threads will execute that code.

• There is an implied barrier at the end of a parallel section. Only the master thread continues execution past this point.

• If any thread terminates within a parallel region, all threads in the team will terminate, and the work done up until that point is undefined.

• Threads are numbered from 0 (master thread) to N-1

29

Page 30: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

How Many Threads?

• The number of threads in a parallel region is determined by the following factors, in order of precedence: Evaluation of the if clause Setting of the num_threads clause Use of the omp_set_num_threads() library

function Setting of the OMP_NUM_THREADS environment

variable Implementation default, e.g., the number of

CPUs on a node

• When present, the if clause must evaluate to true (i.e., non-zero integer) in order for a team of threads to be created; otherwise, the region is executed serially by the master thread.

30

Page 31: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Run-Time Library Routines

Page 32: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Overview

• The OpenMP standard defines an API for a variety of library functions: Query the number of threads/processors, set number of

threads to use General purpose locking routines (semaphores) Portable wall clock timing routines

• Set execution environment functions: nested parallelism, dynamic adjustment of threads

• It may be necessary to specify the include file "omp.h".

• For the lock routines/functions The lock variable must be accessed only through the

locking routines The lock variable must have type omp_lock_t or type

omp_nest_lock_t, depending on the function being used.

32

Page 33: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Run-Time Routines

• void omp_set_num_threads(int num_threads) Sets the number of threads that will be used in the next

parallel region num_threads must be a postive integer This routine can only be called from the serial portions of

the code

• int omp_get_num_threads(void) Returns the number of threads that are currently in the

team executing the parallel region from which it is called

• int omp_get_thread_num(void) Returns the thread number of the thread, within the team,

making this call. This number will be between 0 and

OMP_GET_NUM_THREADS-1. The master thread of the team is thread 0

33

Page 34: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

A better Hello World in OpenMP

#include <omp.h>#include <iostream>using namespace std;int main(){

#pragma omp parallel num_threads(3) { cout << "Hello World. I am thread"; cout << omp_get_thread_num() << endl;}

}

$ ./hello Hello World. I am thread 0Hello World. I am thread 2Hello World. I am thread 1

34

Page 35: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Conditional Compiling

// openmp03.cpp

#ifdef _OPENMP #include <omp.h>#else #define omp_get_thread_num() 0#endif

...

// numero del thread correnteint TID = omp_get_thread_num();

...

35

Page 36: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Data Scope Attribute Clauses

Page 37: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Variables Scope in OpenMP

• OpenMP is based upon the shared memory programming model, most variables are shared by default

• The OpenMP Data Scope Attribute Clauses are used to explicitly define how variables should be scoped

• Four main scope types private shareddefault reduction

37

Page 38: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Variables Scope in OpenMP

• Data Scope Attribute Clauses are used in conjunction with several directives toDefine how and which data variables in the

serial section of the program are transferred to the parallel sections

Define which variables will be visible to all threads in the parallel sections and which variables will be privately allocated to all threads

• Data Scope Attribute Clauses are effective only within their lexical/static extent.

38

Page 39: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

The private Clause

• Syntax: private (list)

• The private clause declares variables in its list to be private to each thread

• Private variables behave as follows: a new object of the same type is declared once

for each thread in the team all references to the original object are replaced

with references to the new object variables should be assumed to be uninitialized

for each thread

39

Page 40: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Example: private Clause

#pragma omp parallel for private(i,a)for (i=0; i<n; i++){

a = i+1;printf("Thread %d has a value of a = %d for i = %d\n",

omp_get_thread_num(),a,i);} /*-- End of parallel for --*/

40

Page 41: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

The shared Clause

• Syntax: shared (list)

• The shared clause declares variables in its list to be shared among all threads in the team.

• A shared variable exists in only one memory location and all threads can read or write to that address.

• It is the programmer's responsibility to ensure that multiple threads properly access shared variables

41

Page 42: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Example: shared Clause

#pragma omp parallel for shared(a)for (i=0; i<n; i++){

a[i] += i;} /*-- End of parallel for --*/

42

Page 43: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

43The default clause

• Syntax: default (shared | none)

• The default scope of all variables is either set to shared or none

• The default clause allows the user to specify a default scope for all variables in the lexical extent of any parallel region

• Specific variables can be exempted from the default using specific clauses (e.g., private, shared, etc.)

• Using none as a default requires that the programmer explicitly scope all variables.

Page 44: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

44The reduction Clause

• Syntax: reduction (operator: list)

• Performs a reduction on the variables that appear in its list

• Operator defines the reduction operation

• A private copy for each list variable is created for each thread

• At the end of the reduction, the reduction variable is applied to all private copies of the shared variable, and the final result is written to the global shared variable

Page 45: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Work-Sharing

Page 46: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

for: shares iterations of a loop across the team (data parallelism)

sections: breaks work into separate, discrete sections, each executed by a thread (functional parallelism)

single: serializes a section of code.

Work-Sharing Constructs

• Divide the execution of the enclosed code region among the members of the team that encounter it

• A work-sharing construct must be enclosed dynamically within a parallel region in order for the directive to execute in parallel

• Work-sharing constructs do not launch new threads

• There is no implied barrier upon entry to a work-sharing construct, however there is an implied barrier at the end of a work sharing construct

46

Page 47: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

The for Construct/Directive

• #pragma omp for [clause ...] for_loop

• Three main clause types: schedule, nowait, and the data scope

• schedule (type [,chunk]) describes how iterations of the loop are divided among the threads in the team (default schedule is implementation dependent)

• nowait if specified, then threads do not synchronize at the end of the parallel loop.

• The data-scope clauses (private, shared, reduction)

47

Page 48: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

The for Construct/Directive(Schedule Types)

• static: loop iterations are divided into blocks of size chunk and then statically assigned to threads

• If chunk is not specified, the iterations are evenly (if possible) divided contiguously among the threads

• dynamic: loop iterations are divided into blocks of size chunk, and dynamically scheduled among the threads

• When a thread finishes one chunk, it is dynamically assigned another. The default chunk size is 1

48

Page 49: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Example: Dot Product (1)

intmain(){

double sum;vector<double> a(10000000), b(10000000);

generate(a.begin(),a.end(),drand48);generate(b.begin(),b.end(),drand48);

sum = 0;

for (int i=0; i<a.size(); ++i){

sum = sum + a[i]*b[i];}cout << sum << endl;

}

49

Page 50: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Example: Dot Product (2)

// openmp03-dotp-openmpintmain(){

double sum;vector<double> a(10000000), b(10000000);

generate(a.begin(),a.end(),drand48);generate(b.begin(),b.end(),drand48);

sum = 0;

#pragma omp parallel for reduction(+: sum)for (int i=0; i<a.size(); ++i){

sum = sum + a[i]*b[i];}cout << sum << endl;

}

50

Page 51: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Example: Dot Product (3)

// openmp03-dotp-openmp3

...int ChunkSize = a.size()/omp_get_max_threads();

#pragma omp parallel {

#pragma omp for schedule(dynamic, ChunkSize) reduction(+: sum)for (int i=0; i<a.size(); ++i){

sum = sum + a[i]*b[i];}

}

51

Page 52: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Example: Varying Loads

// openmp06

...void f(vector<double> &a, vector<double> &b, int n){

int i, j;

#pragma omp parallel shared(n){// schedule is dynamic of size 1 to manage the varying size of load#pragma omp for schedule(dynamic,1) private (i,j) nowaitfor (i = 1; i < n; i++)

for (j = 0; j < i; j++)b[j + n*i] = (a[j + n*i] + a[j + n*(i-1)]) / 2.0;

}}// \time ./openmp04-seq 10000 (n is 10000^2)// 11.20 real 5.68 user 0.68 sys// \time ./openmp04 10000// 4.55 real 13.07 user 0.62 sys

52

Page 53: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

The sections Construct/Directive

#pragma omp sections [clause ...] {

#pragma omp section { // execute A}

#pragma omp section { // execute A}

}

• Specifies that the enclosed section(s) of code are to be divided among the threads in the team

• Each section is executed once by a thread in the team. Different sections may be executed by different threads. It is possible that for a thread to execute more than one section.

private reductionnowaitetc.

53

Page 54: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Example: Dot Product (vectors init)

#pragma omp sections nowait{

#pragma omp section {

int ita;#pragma omp parallel for shared(a) private(ita)

schedule(dynamic,ChunkSize)for(ita=0; ita<a.size(); ++ita){

a[ita] = drand48();}

}

#pragma omp section {

int itb;#pragma omp parallel for shared(b) private(itb)

schedule(dynamic,ChunkSize)for(itb=0; itb<b.size(); ++itb){

b[itb] = drand48();}

}}

54

Page 55: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

The single Construct/Directive

• The single construct is associated with the structured block of code immediately following it and specifies that this block should be executed by one thread only

• It does not state which thread should execute the code block

• The thread chosen could vary from one run to another. It can also differ for different single constructs within one application

#pragma omp single [private | nowait | …]{

...}

55

Page 56: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Example: Single Construct/Directive

#pragma omp parallel shared(a,b) private(i){ #pragma omp single { a = 10; printf("Single construct executed by thread %d\n", omp_get_thread_num()); } /* A barrier is automatically inserted here */

#pragma omp for for (i=0; i<n; i++) b[i] = a;} /*-- End of parallel region --*/

56

Page 57: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Synchronization

Page 58: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

The critical Construct/Directive

• #pragma omp critical [ name ] { …}

• The critical directive specifies a region of code that must be executed by only one thread at a time

• If a thread is currently executing inside a critical region and another thread reaches that critical region and attempts to execute it, it will block until the first thread exits that critical region

• The optional name enables multiple different critical regions Names act as global identifiers Critical regions with the same name are treated as the

same region All critical sections which are unnamed, are treated as the

same section

58

Page 59: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

Example: critical Construct/Directive

sum = 0;#pragma omp parallel shared(n,a,sum) private(TID,sumLocal){ TID = omp_get_thread_num(); double sumLocal = 0; #pragma omp for for (size_t i=0; i<n; i++) sumLocal += a[i]; #pragma omp critical (update_sum) { sum += sumLocal; cout << "TID=" << TID << "sumLocal="

<< sumLocal << " sum=" << sum << endl; }} /*-- End of parallel region --*/cout << "Value of sum after parallel region: " << sum << endl;

59

Page 60: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

The barrier Construct/Directive

• Syntax: #pragma omp barrier

• The barrier directive synchronizes all threads in the team.

• When a barrier directive is reached, a thread will wait at that point until all other threads have reached that barrier. All threads then resume executing in parallel the code that follows the barrier.

• All threads in a team (or none) must execute the barrier region.

60

Page 61: Algoritmi e Calcolo Parallelo 2012/2013 - OpenMP

Prof. Pier Luca Lanzi

When to Use OpenMP?Target platform is multicore or

multiprocessor Application is cross-platform

Parallelizable loopsLast-minute optimizations