Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ îðèãèíàëüíîãî äîêóìåíòà : http://angel.cmc.msu.ru/~basrav/supercomp2010/Lecture_20100309.pdf
Äàòà èçìåíåíèÿ: Tue Mar 23 11:15:34 2010
Äàòà èíäåêñèðîâàíèÿ: Fri Feb 28 20:00:20 2014
Êîäèðîâêà:
MPI/OpenMP
, . . . ..-.., . ..

. . . , , 2 0 1 0 .



MPI/OpenMP OpenMP ­

9 , 2010

MPI/OpenMP

2 1 2 3



1.5-2 . ( ). . , , , 6 . -, , . Intel Itanium: Latency to L1: 1-2 cycles Latency to L2: 5 - 7 cycles Latency to L3: 12 - 21 cycles Latency to memory: 180 ­ 225 cycles - GUPS (Giga Updates Per Second)
9 , 2010

MPI/OpenMP

3 1 2 3



( "thread") ­ , , .















2















4 3 2 1








Chip MultiThreading


9 , 2010

-



-

MPI/OpenMP

4 1 2 3



«» - 60 TFlop/s / - 1250 / 5000 Linpack - 47.04 TFlop/s (78.4% ) - 330 - 720 ­ Pow er Efficency (Megaflops/watt) = > Chip MultiProcessing, .

9 , 2010

MPI/OpenMP

5 1 2 3



Six-Core AMD Opteron 6 (2 DDR2 800 ) 3 «-» HyperTransort

9 , 2010

MPI/OpenMP

6 1 2 3



Intel Core i7 ( Nehalem ) 4 8 Intel Hyper-Threading 8 - Intel Smart Cache (3 D DR3 ) Intel QuickPath Interconnect

9 , 2010

MPI/OpenMP

7 1 2 3



SUN UltraSPARC T2 Processor (Niagara 2) 8 64 4 ­ 60-123 2x10 Gigabit Ethernet

9 , 2010

MPI/OpenMP

8 1 2 3



+ => CMT (Chip MultiThreading) + = > CMP (Chip MultiProcessing, )

9 , 2010

MPI/OpenMP

9 1 2 3



MPI/OpenMP OpenMP ­

9 , 2010

MPI/OpenMP

10 1 2 3


MPI/OpenMP
MPI OpenMP Core Core ... Core 0
9 , 2010

OpenMP Core Core ... Core N
11 1 2 3

MPI/OpenMP


OpenMP MPI
· . · , . · , M P I- . · O penM P , M P I.

9 , 2010

MPI/OpenMP

12 1 2 3


OpenMP
· , , ­ OpenMP . · -, .

9 , 2010

MPI/OpenMP

13 1 2 3


National Institute for Computational Sciences. University of Tennessee
Kraken Cray XT5-HE Opteron Six Core 2.6 GHz 3 TOP 500 http://nics.tennessee.edu - 1028.85 TFlop/s / -- 16 288 / 98 928 Linpack - 831.7 TFlop/s (81% )

Updrage: 4- AMD Opteron 6- AMD Opteron : 6- TOP500 2009 - 3- TOP500 2009
9 , 2010

MPI/OpenMP

14 1 2 3


National Institute for Computational Sciences. University of Tennessee

9 , 2010

MPI/OpenMP

15 1 2 3



MVS-100K 38 TOP 500 http://www.jscc.ru/ - 140.16 TFlop/s / -- 2 920/11 680 Linpack - 107.45 TFlop/s (76.7% )

Updrage: 2- Intel Xeon 53xx 4- Intel Xeon 54xx : 57- TOP500 2008 - 36- TOP500 2008
9 , 2010

MPI/OpenMP

16 1 2 3


Oak Ridge National Laboratory
Jaguar Cray XT5-HE Opteron Six Core 2.6 GHz 1 TOP 500 http://computing.ornl.gov - 2331 TFlop/s -- 224 162 Linpack - 1759 TFlop/s (75.4% )

Updrage: 4- AMD Opteron 6- AMD Opteron : 2- TOP500 2009 - 1- TOP500 2009
9 , 2010

MPI/OpenMP

17 1 2 3


Oak Ridge National Laboratory
Jaguar Scheduling Policy
MIN Cores 1 3 5 0 0 0 4 5 0 0 0 4 5 0 0 1 2 5 0 1 1 3 4 9 9 9 4 4 9 9 9 4 4 9 9 1 2 4 9 MAX Cores MAXIMUM WALL-TIME (HOURS) 24 24 12 6 2

9 , 2010

MPI/OpenMP

18 1 2 3


.
/* Jacobi program */ #include #define L 1000 #define ITMAX 100 int i,j,it; double A[L][L]; double B[L][L]; int main(int an, char **as) { printf("JAC STARTED\n"); for(i=0;i<=L-1;i++) for(j=0;j<=L-1;j++) { A[i][j]=0.; B[i][j]=1.+i+j; }
9 , 2010

MPI/OpenMP

19 1 2 3


.
/****** iteration loop *************************/ for(it=1; it
9 , 2010

MPI/OpenMP

20 1 2 3


. MPI-

9 , 2010

MPI/OpenMP

21 1 2 3


. MPI-
/* Jacobi-1d program */ #include #include #include #include "mpi.h" #define m_printf if (myrank==0)printf #define L 1000 #define ITMAX 100 int i,j,it,k; int ll,shift; double (* A)[L]; double (* B)[L];

9 , 2010

MPI/OpenMP

22 1 2 3


. MPI-
int main(int argc, char **argv) { MPI_Request req[4]; int myrank, ranksize; int startrow,lastrow,nrow; MPI_Status status[4]; double t1, t2, time; MPI_Init (&argc, &argv); /* initialize MPI system */ MPI_Comm_rank(MPI_COMM_WORLD, &myrank);/*my place in MPI system*/ MPI_Comm_size (MPI_COMM_WORLD, &ranksize); /* size of MPI system */ MPI_Barrier(MPI_COMM_WORLD); /* rows of matrix I have to process */ startrow = (myrank *L) / ranksize; lastrow = (((myrank + 1) * L) / ranksize)-1; nrow = lastrow - startrow + 1; m_printf("JAC1 STARTED\n");
9 , 2010

MPI/OpenMP

23 1 2 3


. MPI-
/* dynamically allocate data structures */ A = malloc ((nrow+2) * L * sizeof(double)); B = malloc ((nrow) * L * sizeof(double)); for(i=1; i<=nrow; i++) for(j=0; j<=L-1; j++) { A[i][j]=0.; B[i-1][j]=1.+startrow+i-1+j; }

9 , 2010

MPI/OpenMP

24 1 2 3


. MPI-
/****** iteration loop *************************/ t1=MPI_Wtime(); for(it=1; it<=ITMAX; it++) { for(i=1; i<=nrow; i++) { if (((i==1)&&(myrank==0))||((i==nrow)&&(myrank==ranksize-1))) continue; for(j=1; j<=L-2; j++) { A[i][j] = B[i-1][j]; } }

9 , 2010

MPI/OpenMP

25 1 2 3


. MPI-
if(myrank!=0) MPI_Irecv(&A[0][0],L,MPI_DOUBLE, myrank-1, 1235, MPI_COMM_WORLD, &req[0]); if(myrank!=ranksize-1) MPI_Isend(&A[nrow][0],L,MPI_DOUBLE, myrank+1, 1235, MPI_COMM_WORLD,&req[2]); if(myrank!=ranksize-1) MPI_Irecv(&A[nrow+1][0],L,MPI_DOUBLE, myrank+1, 1236, MPI_COMM_WORLD, &req[3]); if(myrank!=0) MPI_Isend(&A[1][0],L,MPI_DOUBLE, myrank-1, 1236, MPI_COMM_WORLD,&req[1]); ll=4; shift=0; if (myrank==0) {ll=2;shift=2;} if (myrank==ranksize-1) {ll=2;} MPI_Waitall(ll,&req[shift],&status[0]);
9 , 2010

MPI/OpenMP

26 1 2 3


. MPI-
for(i=1; i<=nrow; i++) { if (((i==1)&&(myrank==0))||((i==nrow)&&(myrank==ranksize-1))) continue; for(j=1; j<=L-2; j++) B[i-1][j] = (A[i-1][j]+A[i+1][j]+ A[i][j-1]+A[i][j+1])/4.; } }/*DO it*/ printf("%d: Time of task=%lf\n",myrank,MPI_Wtime()-t1); MPI_Finalize (); return 0; }

9 , 2010

MPI/OpenMP

27 1 2 3


. MPI-

9 , 2010

MPI/OpenMP

28 1 2 3


. MPI-
/*Jacobi-2d program */ #include #include #include #include "mpi.h" #define m_printf if (myrank==0)printf #define L 1000 #define LC 2 #define ITMAX 100 int i,j,it,k; double (* A)[L/LC+2]; double (* B)[L/LC];

9 , 2010

MPI/OpenMP

29 1 2 3


. MPI-
int main(int argc, char **argv) { MPI_Request req[8]; int myrank, ranksize; int srow,lrow,nrow,scol,lcol,ncol; MPI_Status status[8]; double t1; int isper[] = {0,0}; int dim[2]; int coords[2]; MPI_Comm newcomm; MPI_Datatype vectype; int pleft,pright, pdown,pup; MPI_Init (&argc, &argv); /* initialize MPI system */ MPI_Comm_size (MPI_COMM_WORLD, &ranksize); /* size of MPI system */ MPI_Comm_rank (MPI_COMM_WORLD, &myrank); /* my place in MPI system */

9 , 2010

MPI/OpenMP

30 1 2 3


. MPI-
dim[0]=ranksize/LC; dim[1]=LC; if ((L%dim[0])||(L%dim[1])) { m_printf("ERROR: array[%d*%d] is not distributed on %d*%d processors\n",L,L,dim[0],dim[1]); MPI_Finalize(); exit(1); } MPI_Cart_create(MPI_COMM_WORLD,2,dim,isper,1,&newcomm); MPI_Cart_shift(newcomm,0,1,&pup,&pdown); MPI_Cart_shift(newcomm,1,1,&pleft,&pright); MPI_Comm_rank (newcomm, &myrank); /* my place in MPI system */ MPI_Cart_coords(newcomm,myrank,2,coords);

9 , 2010

MPI/OpenMP

31 1 2 3


. MPI-
/* rows of matrix I have to process */ srow = (coords[0] * L) / dim[0]; lrow = (((coords[0] + 1) * L) / dim[0])-1; nrow = lrow - srow + 1; /* columns of matrix I have to process */ scol = (coords[1] * L) / dim[1]; lcol = (((coords[1] + 1) * L) / dim[1])-1; ncol = lcol - scol + 1; MPI_Type_vector(nrow,1,ncol+2,MPI_DOUBLE,&vectype); MPI_Type_commit(&vectype); m_printf("JAC2 STARTED on %d*%d processors with %d*%d array, it= %d\n",dim[0],dim[1],L,L,ITMAX); /* dynamically allocate data structures */ A = malloc ((nrow+2) * (ncol+2) * sizeof(double)); B = malloc (nrow * ncol * sizeof(double));

9 , 2010

MPI/OpenMP

32 1 2 3


. MPI-
for(i=0; i<=nrow-1; i++) { for(j=0; j<=ncol-1; j++) { A[i+1][j+1]=0.; B[i][j]=1.+srow+i+scol+j; } } /****** iteration loop *************************/ MPI_Barrier(newcomm); t1=MPI_Wtime(); for(it=1; it<=ITMAX; it++) { for(i=0; i<=nrow-1; i++) {
if (((i==0)&&(pup==MPI_PROC_NULL))||((i==nrow-1)&&(pdown==MPI_PROC_NULL))) continue;

for(j=0; j<=ncol-1; j++) {
if (((j==0)&&(pleft==MPI_PROC_NULL))||((j==ncol-1)&&(pright==MPI_PROC_NULL))) continue;

A[i+1][j+1] = B[i][j];
9 } , 2010

}
MPI/OpenMP

33 1 2 3


. MPI-
MPI_Irecv(&A[0][1],ncol,MPI_DOUBLE, pup, 1235, MPI_COMM_WORLD, &req[0]); MPI_Isend(&A[nrow][1],ncol,MPI_DOUBLE, pdown, 1235, MPI_COMM_WORLD,&req[1]); MPI_Irecv(&A[nrow+1][1],ncol,MPI_DOUBLE, pdown, 1236, MPI_COMM_WORLD, &req[2]); MPI_Isend(&A[1][1],ncol,MPI_DOUBLE, pup, 1236, MPI_COMM_WORLD,&req[3]); MPI_Irecv(&A[1][0],1,vectype, pleft, 1237, MPI_COMM_WORLD, &req[4]); MPI_Isend(&A[1][ncol],1,vectype, pright, 1237, MPI_COMM_WORLD,&req[5]); MPI_Irecv(&A[1][ncol+1],1,vectype, pright, 1238, MPI_COMM_WORLD, &req[6]); MPI_Isend(&A[1][1],1,vectype, pleft, 1238, MPI_COMM_WORLD,&req[7]); MPI_Waitall(8,req,status);

9 , 2010

MPI/OpenMP

34 1 2 3


. MPI-
for(i=1; i<=nrow; i++) { if (((i==1)&&(pup==MPI_PROC_NULL))|| ((i==nrow)&&(pdown==MPI_PROC_NULL))) continue; for(j=1; j<=ncol; j++) { if (((j==1)&&(pleft==MPI_PROC_NULL))|| ((j==ncol)&&(pright==MPI_PROC_NULL))) continue; B[i-1][j-1] = (A[i-1][j]+A[i+1][j]+A[i][j-1]+A[i][j+1])/4.; } } } printf("%d: Time of task=%lf\n",myrank,MPI_Wtime()-t1); MPI_Finalize (); return 0; }

9 , 2010

MPI/OpenMP

35 1 2 3


. MPI-
for(i=1; i<=nrow; i++) { if (((i==1)&&(pup==MPI_PROC_NULL))|| ((i==nrow)&&(pdown==MPI_PROC_NULL))) continue; for(j=1; j<=ncol; j++) { if (((j==1)&&(pleft==MPI_PROC_NULL))|| ((j==ncol)&&(pright==MPI_PROC_NULL))) continue; B[i-1][j-1] = (A[i-1][j]+A[i+1][j]+A[i][j-1]+A[i][j+1])/4.; } } } printf("%d: Time of task=%lf\n",myrank,MPI_Wtime()-t1); MPI_Finalize (); return 0; }

9 , 2010

MPI/OpenMP

36 1 2 3


. MPI/OpenMP-
/* Jacobi-1d program */ #include #include #include #include "mpi.h" #define m_printf if (myrank==0)printf #define L 1000 #define ITMAX 100 int i,j,it,k; int ll,shift; double (* A)[L]; double (* B)[L];

9 , 2010

MPI/OpenMP

37 1 2 3


. MPI/OpenMP-
int main(int argc, char **argv) { MPI_Request req[4]; int myrank, ranksize; int startrow,lastrow,nrow; MPI_Status status[4]; double t1, t2, time; MPI_Init (&argc, &argv); /* initialize MPI system */ MPI_Comm_rank(MPI_COMM_WORLD,&myrank); /*my place in MPI system */ MPI_Comm_size (MPI_COMM_WORLD, &ranksize); /* size of MPI system */ MPI_Barrier(MPI_COMM_WORLD); /* rows of matrix I have to process */ startrow = (myrank * N) / ranksize; lastrow = (((myrank + 1) * N) / ranksize)-1; nrow = lastrow - startrow + 1; m_printf("JAC1 STARTED\n");
9 , 2010

MPI/OpenMP

38 1 2 3


. MPI/OpenMP-
/* dynamically allocate data structures */ A = malloc ((nrow+2) * N * sizeof(double)); B = malloc ((nrow) * N * sizeof(double)); for(i=1; i<=nrow; i++) #pragma omp parallel for for(j=0; j<=L-1; j++) { A[i][j]=0.; B[i-1][j]=1.+startrow+i-1+j; }

9 , 2010

MPI/OpenMP

39 1 2 3


. MPI/OpenMP-
/****** iteration loop *************************/ t1=MPI_Wtime(); for(it=1; it<=ITMAX; it++) { for(i=1; i<=nrow; i++) { if (((i==1)&&(myrank==0))||((i==nrow)&&(myrank==ranksize-1))) continue; #pragma omp parallel for for(j=1; j<=L-2; j++) { A[i][j] = B[i-1][j]; } }

9 , 2010

MPI/OpenMP

40 1 2 3


. MPI/OpenMP-
if(myrank!=0) MPI_Irecv(&A[0][0],L,MPI_DOUBLE, myrank-1, 1235, MPI_COMM_WORLD, &req[0]); if(myrank!=ranksize-1) MPI_Isend(&A[nrow][0],L,MPI_DOUBLE, myrank+1, 1235, MPI_COMM_WORLD,&req[2]); if(myrank!=ranksize-1) MPI_Irecv(&A[nrow+1][0],L,MPI_DOUBLE, myrank+1, 1236, MPI_COMM_WORLD, &req[3]); if(myrank!=0) MPI_Isend(&A[1][0],L,MPI_DOUBLE, myrank-1, 1236, MPI_COMM_WORLD,&req[1]); ll=4; shift=0; if (myrank==0) {ll=2;shift=2;} if (myrank==ranksize-1) {ll=2;} MPI_Waitall(ll,&req[shift],&status[0]);

9 , 2010

MPI/OpenMP

41 1 2 3


. MPI/OpenMP-
for(i=1; i<=nrow; i++) { if (((i==1)&&(myrank==0))||((i==nrow)&&(myrank==ranksize-1))) continue; #pragma omp parallel for for(j=1; j<=L-2; j++) B[i-1][j] = (A[i-1][j]+A[i+1][j]+ A[i][j-1]+A[i][j+1])/4.; } }/*DO it*/ printf("%d: Time of task=%lf\n",myrank,MPI_Wtime()-t1); MPI_Finalize (); return 0; }

9 , 2010

MPI/OpenMP

42 1 2 3


. MPI/OpenMP-
/****** iteration loop *************************/ t1=MPI_Wtime(); #pragma omp parallel default(none) private(it,i,j) shared (A,B,myrank, nrow,ranksize,ll,shift,req,status) for(it=1; it<=ITMAX; it++) { #pragma omp barrier /*Not necessary in OpenMP 3.0*/ for(i=1; i<=nrow; i++) { if (((i==1)&&(myrank==0))||((i==nrow)&&(myrank==ranksize-1))) continue; #pragma omp for nowait for(j=1; j<=L-2; j++) { A[i][j] = B[i-1][j]; } }
9 , 2010

MPI/OpenMP

43 1 2 3


. MPI/OpenMP-
#pragma omp barrier #pragma omp single { if(myrank!=0) MPI_Irecv(&A[0][0],L,MPI_DOUBLE, myrank-1, 1235, MPI_COMM_WORLD, &req[0]); if(myrank!=ranksize-1) MPI_Isend(&A[nrow][0],L,MPI_DOUBLE, myrank+1, 1235, MPI_COMM_WORLD,&req[2]); if(myrank!=ranksize-1) MPI_Irecv(&A[nrow+1][0],L,MPI_DOUBLE, myrank+1, 1236, MPI_COMM_WORLD, &req[3]); if(myrank!=0) MPI_Isend(&A[1][0],L,MPI_DOUBLE, myrank-1, 1236, MPI_COMM_WORLD,&req[1]); ll=4; shift=0; if (myrank==0) {ll=2;shift=2;} if (myrank==ranksize-1) {ll=2;} MPI_Waitall(ll,&req[shift],&status[0]);
9 , 2010

}

MPI/OpenMP

44 1 2 3


. MPI/OpenMP-
for(i=1; i<=nrow; i++) { if (((i==1)&&(myrank==0))||((i==nrow)&&(myrank==ranksize-1))) continue; #pragma omp for nowait for(j=1; j<=L-2; j++) B[i-1][j] = (A[i-1][j]+A[i+1][j]+ A[i][j-1]+A[i][j+1])/4.; } }/*DO it*/ printf("%d: Time of task=%lf\n",myrank,MPI_Wtime()-t1); MPI_Finalize (); return 0; }

9 , 2010

MPI/OpenMP

45 1 2 3


NASA MultiZone
BT (Block Tridiagonal Solver) 3D -, LU (Lower-Upper Solver) 3D -, SP (Scalar PentadiagonalSolver) 3D -, BeamWarning approximate factorization
http://www.nas.nasa.gov/News/Techreports/2003/PDF/nas-03-010.pdf

9 , 2010

MPI/OpenMP

46 1 2 3


NASA MultiZone

9 , 2010

MPI/OpenMP

47 1 2 3


SP-MZ ( A) IBM eServer pSeries 690 Regatta
100 90 80 70 60 50 40 30 20 10 0 90,47 , 45 40 35 30 25 20 15 10 5 0 1x 4 14 , 20, 87 20,44 12 10 8 6 4 2 0 1x 16 2x 8 4x 4 8x 2 16x 1 MPI- X 2x 2 4x 1 MPI- X 12,97 11,11 10, 98 10 9,92 39,56 41, 74 37,8

,

83,6

1x 2 25 21,64 , 20 15 10 5 0 1x 8

2x 1

MPI- X 21,76

2x 4

4x 2

8x 1

MPI- X

9 , 2010

MPI/OpenMP

48 1 2 3


LU-MZ ( A) IBM eServer pSeries 690 Regatta
70,00 , 60,00 50,00 40,00 30,00 20,00 10,00 0,00 1x 2 2x 1 MPI- X 30,00 , 25,00 20,00 15,00 10,00 5,00 0,00 1x 8 2x 4 4x 2 8x 1 MPI- X 19,23 16,83 15,61 60,13 66,29 , 35,00 30,00 25,00 20,00 15,00 10,00 5,00 0,00 1x 4 2x 2 4x 1 MPI- X 14,00 , 12,00 10,00 8,00 6,00 4,00 2,00 0,00 1x 16 2x 8 4x 4 8x 2 16x 1 MPI- X 12,03 8,98 31,19 28,58 32,38

27,19

12,36

8,19

8,42

9 , 2010

MPI/OpenMP

49 1 2 3


BT-MZ ( A) IBM eServer pSeries 690 Regatta
13 x 13 x 16 58 x 58 x 16
100 90 80 70 60 50 40 30 20 10 0 90,47 , 45 40 35 30 25 20 15 10 5 0 1x 4 35,00 , 30,00 25,00 20,00 15,00 10,00 5,00 0,00 1x 16 2x 8 4x 4 8x 2 16x 1 MPI- X 13,73 11,16 10,57 15,89 2x 2 4x 1 MPI- X 30,46 39,56 41, 74 37,8

,

83,6

1x 2 40,00 , 35,00 30,00 25,00 20,00 15,00 10,00 5,00 0,00 1x 8 23,25

2x 1

MPI- X 34,68

20,73

19,09

2x 4

4x 2

8x 1

MPI- X

9 , 2010

MPI/OpenMP

50 1 2 3




810 75 128 256 384 512

19258 11284 5642 3761 2821

Max 19296 11648 11648 11648 11648

9 , 2010

MPI/OpenMP

51 1 2 3


MPI/OpenMP
. OpenMP , MPI (, ­ ). .

9 , 2010

MPI/OpenMP

52 1 2 3



MPI/OpenMP OpenMP ­

9 , 2010

MPI/OpenMP

53 1 2 3


OpenMP
1998 OpenMP C/C++ 1.0 2002 OpenMP C/C++ 2.0 OpenMP F/C/C++ 2.5 OpenMP Fortran 1.0 1997 OpenMP Fortran 1.1 1999 OpenMP Fortran 2.0 2000
9 , 2010

2005

2008

OpenMP F/C/C++ 3.0

MPI/OpenMP

54 1 2 3


OpenMP Architecture Review Board


AMD C ra y Fujitsu HP IBM Intel NEC The Portland Group, Inc. SGI Sun Microsoft ASC/LLNL cOMPunity EPCC NASA RWTH Aachen University
MPI/OpenMP

9 , 2010

55 1 2 3


, OpenMP
OpenMP 3.0: Intel 11.0: Linux, Windows and MacOS Sun Studio Express 11/08: Linux and Solaris PGI 8.0: Linux and Windows IBM 10.1: Linux and AIX GNU gcc (4.4.0) OpenMP: Absoft Pro FortranMP Lahey/Fujitsu Fortran 95 PathScale HP Microsoft Visual Studio 2008 C++
9 , 2010

MPI/OpenMP

56 1 2 3


OpenMP
C$OMP FLUSH C$OMP THREADPRIVATE(/ABC/) C$OMP PARALLEL DO SHARED(A,B,C) CALL OMP_INIT_LOCK (LCK) C$OMP SINGLE PRIVATE(X) #pragma omp critical CALL CALL OMP_SET_NUM_THREADS(10) CALL OMP_TEST_LOCK(LCK) C$OMP ATOMIC C$OMP MASTER

SETENV OMP_SCHEDULE "STATIC,4" C$OMP ORDERED

C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C) C$OMP PARALLEL REDUCTION (+: A, B)

C$OMP SECTIONS C$OMP BARRIER

#pragma omp parallel for private(a, b) C$OMP PARALLEL COPYIN(/blk/)

C$OMP DO LASTPRIVATE(XX) omp_set_lock(lck)
57 1 2 3

nthrds = OMP_GET_NUM_PROCS()
9 , 2010

MPI/OpenMP



OpenMP : #pragma omp -[ [ [,]]...] : #pragma omp parallel default (none) shared (i,j) : barrier taskwait flush : threadprivate
9 , 2010

MPI/OpenMP

58 1 2 3



: #pragma omp -[ [ [,]]...] { } : .
#pragma omp parallel { ... mainloop: res[id] = f (id); if (res[id] != 0) goto mainloop; ... exit (0); }
9 , 2010

#pragma omp parallel { ... mainloop: res[id] = f (id); ... } if (res[id] != 0) goto mainloop;
MPI/OpenMP

59 1 2 3


OpenMP-
Fork-Join : (master) (team) . .


9 , 2010

MPI/OpenMP

60 1 2 3


( PARALLEL)

#pragma omp parallel [ [ [, ] ] ...] : default(shared | none) private(list) firstprivate(list) shared(list) reduction(operator: list) if(scalar-expression) num_threads(integer-expression) copyin(list)
9 , 2010

MPI/OpenMP

61 1 2 3


OpenMP

001 001 001

9 , 2010

MPI/OpenMP

62 1 2 3


OpenMP

... = i + 1;

statii =int i = 0; c1

#pragma omp flush (i) i=1

#pragma omp flush (i) i =i = +11; i i= 0 0 001

1 001

... = i + 2; // ?

9 , 2010

MPI/OpenMP

63 1 2 3


OpenMP
: 0 - write(var) 0 ­ flush (var) 1 ­ flush (var) 1 ­ read (var) flush: #pragma omp flush [(list)] - !$omp flush [(list)] -

9 , 2010

MPI/OpenMP

64 1 2 3


OpenMP
1. , flush, , flush , ( ). 2. , flush, , flush, , , . 3. , flush, , flush ( ).

9 , 2010

MPI/OpenMP

65 1 2 3



: · SHARED (shared) : · : COMMON , SAVE , MODULE · : file scope, static · (ALLOCATE, malloc, new) ... · (), , PRIVATE. · , , . · , for parallel for.
9 , 2010

MPI/OpenMP

66 1 2 3


OpenMP
double Array1[100]; int main() { int Array2[100]; #pragma omp parallel work(Array2); printf("%d\n", Array2[0]); } TempArray Array1, Array2, count Array1, Array2, count extern double Array1[10]; void work(int *Array) { double TempArray[10]; static int count; ... }

TempArray TempArray

9 , 2010

MPI/OpenMP

67 1 2 3



: SHARED ( ) PRIVATE ( ) FIRSTPRIVATE ( ) LASTPRIVATE ( ) THREADPRIVATE ( ) DEFAULT (PRIVATE | SHARED | NONE)

9 , 2010

MPI/OpenMP

68 1 2 3


PRIVATE
«private(var)» «var» . OpenMP 2.5 «var»

void w rong() { int tmp = 0; #pragma omp for private(tmp) for (int j = 0; j < 1000; ++j) tmp += j; printf("%d\n", tmp); }

tmp tmp: 0 3.0, 2.5

9 , 2010

MPI/OpenMP

69 1 2 3


FIRSTPRIVATE
«Firstprivate» «private». (master) .

void w rong() { int tmp = 0; #pragma omp for firstprivate(tmp) for (int j = 0; j < 1000; ++j) tmp += j; printf("%d\n", tmp); }

tmp 0 tmp: 0 3.0, 2.5

9 , 2010

MPI/OpenMP

70 1 2 3


LASTPRIVATE
Lastprivate , .

void almost_right int tmp = 0; #pragma omp for for (int j = 0; j tmp += j; printf("%d\n", }

() { firstprivate(tmp) lastprivate (tmp) < 1000; ++j) tmp);
tmp , (j=999)

tmp 0

9 , 2010

MPI/OpenMP

71 1 2 3


THREADPRIVATE
PRIVATE: PRIVATE THREADPRIVATE ­ #pragma omp threadprivate (Var) , Var = 1 ... = Var , .

Var = 2

... = Var

9 , 2010

MPI/OpenMP

72 1 2 3


DEFAULT
: DEFAULT (SHARED) ­ DEFAULT (PRIVATE) ­ Fortran DEFAULT (NONE) ­

itotal = 100 #pragma omp parallel private(np,each) { np = omp_get_num_threads() each = itotal/np ......... }
9 , 2010

itotal = 100 #pragma omp parallel default(none) private(np,each) shared (itotal) { np = omp_get_num_threads() each = itotal/np ......... }

MPI/OpenMP

73 1 2 3


( PARALLEL)

#pragma omp parallel [ [ [, ] ] ...] : default(shared | none) private(list) firstprivate(list) shared(list) reduction(operator: list) if(scalar-expression) num_threads(integer-expression) copyin(list)
9 , 2010

MPI/OpenMP

74 1 2 3



1 4.0


0

4.0 (1+x2)

dx =



F(x) = 4.0/(1+x2)

:
2.0


0.0

N

F(xi)x



i=0

X

1.0

x F(xi)
75 1 2 3

9 , 2010

MPI/OpenMP


. .
#include int main () { int n =100000, i; double pi, h, sum, x; h = 1.0 / (double) n; sum = 0.0; for (i = 1; i <= n; i ++) { x = h * ((double)i - 0.5); sum += (4.0 / (1.0 + x*x)); } pi = h * sum; printf("pi is approximately %.16f", pi); return 0; }
9 , 2010

MPI/OpenMP

76 1 2 3


.
#include #include #define NUM_THREADS 32 int main () { int n =100000, i; double pi, h, sum[NUM_THREADS], x; h = 1.0 / (double) n; #pragma omp parallel default (none) private (i,x) shared (n,h,sum) { int id = omp_get_thread_num(); int numt = omp_get_num_threads(); for (i = id + 1, sum[id] = 0.0; i <= n; i=i+numt) { x = h * ((double)i - 0.5); sum[id] += (4.0 / (1.0 + x*x)); } } for(i=0, pi=0.0; i
, 2010

77 1 2 3


. reduction
#include #include int main () { int n =100000, i; double pi, h, sum, x; h = 1.0 / (double) n; sum = 0.0; #pragma omp parallel default (none) private (i,x) shared (n,h) reduction(+:sum) { int id = omp_get_thread_num(); int numt = omp_get_num_threads(); for (i = id + 1; i <= n; i=i+numt) { x = h * ((double)i - 0.5); sum += (4.0 / (1.0 + x*x)); } } pi = h * sum; printf("pi is approximately %.16f", pi); return 0; } 9 MPI/OpenMP

, 2010

78 1 2 3


reduction
reduction(operator:list) list . operator (, 0 «+»). . .
+ * & | ^ && ||
9 , 2010

0 1 0 ~0 0 0 1 0 79 1 2 3

MPI/OpenMP


num_threads
num_threads(integer-expression)
integer-expression , #include int main() { int n = 0; printf("Enter the number of intervals: (0 quits) "); scanf("%d",&n); omp_set_dynamic(1); #pragma omp parallel num_threads(10) { int id = omp_get_thread_num (); func (n, id); } return 0; }
9 , 2010

MPI/OpenMP

80 1 2 3


copyin
copyin(list)
threadprivate- list, master-
#include float* work; int size; float tol; #pragma omp threadprivate(work,size,tol) void build() { int i; work = (float*)malloc( sizeof(float)*size ); for( i = 0; i < size; ++i ) work[i] = tol; } int main() { read_from_file (&tol, &size); #pragma omp parallel copyin(tol,size) build(); } 9
, 2010

MPI/OpenMP

81 1 2 3



( for) ( single)

9 , 2010

MPI/OpenMP

82 1 2 3



1 4.0


0

4.0 (1+x2)

dx =



F(x) = 4.0/(1+x2)

:
2.0


0.0

N

F(xi)x



i=0

X

1.0

x F(xi)
83 1 2 3

9 , 2010

MPI/OpenMP


. .
#include int main () { int n =100000, i; double pi, h, sum, x; h = 1.0 / (double) n; sum = 0.0; for (i = 1; i <= n; i ++) { x = h * ((double)i - 0.5); sum += (4.0 / (1.0 + x*x)); } pi = h * sum; printf("pi is approximately %.16f", pi); return 0; }
9 , 2010

MPI/OpenMP

84 1 2 3


OpenMP
int main () { int n =100, i; double pi, h, sum, x; h = 1.0 / (double) n; sum = 0.0; #pragma omp parallel default (none) private (i,x) shared (n,h) reduction(+:sum) { int iam = omp_get_thread_num(); int numt = omp_get_num_threads(); int start = iam * n / numt + 1; int end = (iam + 1) * n / numt; for (i = start; i <= end; i++) { x = h * ((double)i - 0.5); sum += (4.0 / (1.0 + x*x)); } } pi = h * sum; printf("pi is approximately %.16f", pi); return 0; } 9 MPI/OpenMP

, 2010

85 1 2 3


OpenMP
#include #include int main () { int n =100, i; double pi, h, sum, x; h = 1.0 / (double) n; sum = 0.0; #pragma omp parallel default (none) private (i,x) shared (n,h) reduction(+:sum) { #pragma omp for schedule (static) for (i = 1; i <= n; i++) { x = h * ((double)i - 0.5); sum += (4.0 / (1.0 + x*x)); } } pi = h * sum; printf("pi is approximately %.16f", pi); return 0; }
9 , 2010

MPI/OpenMP

86 1 2 3



#pragma omp for [[[,]] ... ] for (init-expr; test-expr; incr-expr) : · private(list) · firstprivate(list) · lastprivate(list) · reduction(operator: list) · schedule(kind[, chunk_size]) · collapse(n) · ordered · now ait

9 , 2010

MPI/OpenMP

87 1 2 3


. schedule
#pragma omp parallel for schedule(static, 10) for(int i = 1; i <= 100; i++) 4- : 0 1-10, 41-50, 81-90. 1 11-20, 51-60, 91-100. 2 21-30, 61-70. 3 31-40, 71-80



9 , 2010

MPI/OpenMP

88 1 2 3


. schedule
#pragma omp parallel for schedule(dynamic, 15) for(int i = 1; i <= 100; i++) 4- : 0 1-15. 1 16-30. 2 31-45. 3 46-60. 3 . 3 61-75. 2 . 2 76-90. 0 . 0 91-100.



9 , 2010

MPI/OpenMP

89 1 2 3


. schedule
___ = max(__/omp_get_num_threads(), _) #pragma omp parallel for schedule(guided, 10) for(int i = 1; i <= 100; i++) 4- . 0 1-25. 1 26-44. 2 45-59. 3 60-69. 3 . 3 70-79. 2 . 2 80-89. 3 . 3 90-99. 1 . 1 100 .
9 , 2010

MPI/OpenMP

90 1 2 3


. schedule
#pragma omp parallel for schedule(runtime) for(int i = 1; i <= 100; i++) /* */ : csh: setenv OMP_SCHEDULE "dynamic,4" ksh: export OMP_SCHEDULE="static,10" Windows: set OMP_SCHEDULE=auto : void omp_set_schedule(omp_sched_t kind, int modifier);

9 , 2010

MPI/OpenMP

91 1 2 3


. schedule
#pragma omp parallel for schedule(auto) for(int i = 1; i <= 100; i++) . .

9 , 2010

MPI/OpenMP

92 1 2 3


. nowait
void example(int n, float *a, float *b, float *, float *z) { int i; float sum = 0.0; #pragma omp parallel { #pragma omp for schedule(static) nowait reduction (+: sum) for (i=0; i 9 , 2010

MPI/OpenMP

93 1 2 3


( single).
#pragma omp single [[[,] ] ...] #include float x, y; #pragma omp threadprivate(x, y) : void init(float a, float b ) { private(list) #pragma omp single copyprivate(a,b,x,y) firstprivate(list) scanf("%f %f %f %f", &a, &b, &x, &y); copyprivate(list) } nowait int main () { #pragma omp parallel . { , float x1,y1; NOWAIT. init (x1,y1); parallel_work (); } }
9 , 2010

MPI/OpenMP

94 1 2 3



master critical barrier flush

9 , 2010

MPI/OpenMP

95 1 2 3


master
#pragma omp master /* MASTER- . */ #include void init(float *a, float *b ) { #pragma omp master scanf("%f %f", a, b); #pragma omp barrier } int main () { float x,y; #pragma omp parallel { init (&x,&y); parallel_work (x,y); } 9 , 2010 }

MPI/OpenMP

96 1 2 3



. int i=0; #pragma omp parallel { i++; } 1 2 3 4 5 6 store i (i = 1) Thread0 load i (i = 0) incr i (i = 1) -> load i (i = 0) incr i (i = 1) store i (i = 1)
. .
9 , 2010

MPI/OpenMP

97 1 2 3



: ; , , , ; ( ).

9 , 2010

MPI/OpenMP

98 1 2 3



1 4.0


0

4.0 (1+x2)

dx =



F(x) = 4.0/(1+x2)

:
2.0


0.0

N

F(xi)x



i=0

X

1.0

x F(xi)
99 1 2 3

9 , 2010

MPI/OpenMP


. .
int main () { int n =100000, i; double pi, h, sum, x; h = 1.0 / (double) n; sum = 0.0; for (i = 1; i <= n; i ++) { x = h * ((double)i - 0.5); sum += (4.0 / (1.0 + x*x)); } pi = h * sum; printf("pi is approximately %.16f", pi); return 0; }
9 , 2010

MPI/OpenMP

1 0 0 12 3


OpenMP
#include #include #pra g ma omp critic int main () { int n =100000, i; double pi, h, sum, x; h = 1.0 / (double) n; sum = 0.0; #pragma omp parallel default (none) private (i,x) shared (n,h,sum) { double local_sum = 0.0; #pragma omp for for (i = 1; i <= n; i++) { x = h * ((double)i - 0.5); local_sum += (4.0 / (1.0 + x*x)); } #pragma omp critical sum += local_sum; } pi = h * sum; printf("pi is approximately %.16f", pi); return 0; 9 MPI/OpenMP } , 2010

al [(name)]

1 0 1 12 3


barrier
, , .

#pragma omp barrier
: parallel; (for, single, sections) , nowait. int size; #pragma omp parallel { #pragma omp master { scanf("%d",&size); } #pragma omp barrier process(size); }

9 , 2010

MPI/OpenMP

1 0 2 12 3


flush
#pragma omp flush [( )] (#pragma omp flush): parallel, critical ordered. (for, single, sections, workshare), nowait. omp_set_lock omp_unset_lock. omp_test_lock, omp_set_nest_lock, omp_unset_nest_lock omp_test_nest_lock, . atomic #pragma omp flush(x), x ­ , atomic.

9 , 2010

MPI/OpenMP

1 0 3 12 3


OpenMP.
, OpenMP- (ICV-Internal Control Variables). / ICV-. .

9 , 2010

MPI/OpenMP

1 0 4 12 3


Internal Control Variables.
: nthreads-var thread-limit-var dyn-var nest-var max-active-levels-var : run-sched-var def-sched-var : stacksize-var wait-policy-var
9 , 2010

MPI/OpenMP

1 0 5 12 3


Internal Control Variables. nthreads-var
. : . : C shell: setenv OMP_NUM_THREADS 16 Korn shell: export OMP_NUM_THREADS=16 Windows: set OMP_NUM_THREADS=16 void omp_set_num_threads(int num_threads); : int omp_get_max_threads(void);

9 , 2010

MPI/OpenMP

1 0 6 12 3


Internal Control Variables. run-sched-var
, schedule(runtime). : . . : C shell: setenv OMP_SCHEDULE "guided,4" Korn shell: export OMP_SCHEDULE "dynamic,5" Windows: set OMP_SCHEDULE=static void omp_set_schedule(omp_sched_t kind, int modifier); : void omp_get_schedule(omp_sched_t * kind, int * modifier );
9 , 2010

typedef enum omp_sched_t { omp_sched_static = 1, omp_sched_dynamic = 2, omp_sched_guided = 3, omp_sched_auto = 4 } omp_sched_t;

MPI/OpenMP

1 0 7 12 3


OpenMP.
int omp_get_num_threads(void); - #include void work(int i); void test() { int np; np = omp_get_num_threads(); /* np == 1*/ #pragma omp parallel private (np) { np = omp_get_num_threads(); #pragma omp for schedule(static) for (int i=0; i < np; i++) work(i); } }
9 , 2010

MPI/OpenMP

1 0 8 12 3


OpenMP.
int omp_get_thread_num(void); - [0: omp_get_num_threads()-1] #include void work(int i); void test() { int iam; iam = omp_get_thread_num(); /* iam == 0*/ #pragma omp parallel private (iam) { iam = omp_get_thread_num(); work(iam); } }

9 , 2010

MPI/OpenMP

1 0 9 12 3


OpenMP.
int omp_get_num_procs(void); - , #include void work(int i); void test() { int nproc; nproc = omp_get_num_ procs(); #pragma omp parallel num_threads(nproc) { int iam = omp_get_thread_num(); work(iam); } }

9 , 2010

MPI/OpenMP

11 0 12 3


OpenMP-.
double omp_get_wtime(void); , . , . , , , . double start; double end; start = omp_get_wtime(); /*... work to be timed ...*/ end = omp_get_wtime(); printf("Work took %f seconds\n", end - start); double omp_get_wtick(void); - ( ).
9 , 2010

MPI/OpenMP

111 12 3



: ftp://ftp.keldysh.ru/K_student/MSU2010/MSU2010_MPI_OpenMP.pdf

9 , 2010

MPI/OpenMP

11 2 12 3



OpenMP Application Program Interface Version 3.0, May 2008. http://w w w .openmp.org/mp-documents/spec30.pdf MPI: A Message-Passing Interface Standard Version 2.2, September 2009. http://w w w .mpi-forum.org/docs/mpi-2.2/mpi22-report.pdf .. OpenMP: .-.: - , 2009. http://parallel.ru/info/parallel/openmp/OpenMP.pdf .. MPI: .-.: - , 2004. http://parallel.ru/tech/tech_dev/MPI/mpibook.pdf .., .. . ­ .: -, 2002. . , . . . . ­ . , 2003
9 , 2010

MPI/OpenMP

11 3 12 3



, - , . .. , bakhtin@keldysh.ru

9 , 2010

MPI/OpenMP

11 4 12 3


MPI
MPI : int MPI_Init ( int *agrc, char ***argv ) MPI-. . MPI : int MPI_Finalize (void)


9 , 2010

MPI/OpenMP

11 5 12 3



: int MPI_Comm_size ( MPI_Comm comm, int *size ). : int MPI_Comm_rank ( MPI_Comm comm, int *rank ).


9 , 2010

MPI/OpenMP

11 6 12 3



- : int MPI_Isend(void *buf, int count, MPI_Datatype type, int dest, int tag, MPI_Comm comm, MPI_Request *request), · buf - , , · count - , · type - , · dest - , , · tag - -, , · comm - , . - : int MPI_Irecv(void *buf, int count, MPI_Datatype type, int source, int tag, MPI_Comm comm, MPI_Status *status, MPI_Request *request), ·buf, count, type - , MPI_Send, ·source - , , ·tag - , , ·comm - , , ·status - .
9 , 2010

MPI/OpenMP

11 7 12 3


MPI_Waitall
: int MPI_Waitall( int count, MPI_Request array_of_requests[], MPI_Status array_of_statuses[])



9 , 2010

MPI/OpenMP

11 8 12 3


MPI_Cart_create
() MPI: int MPI_Cart_create(MPI_Comm oldcomm, int ndims, int *dims, int *periods, int reorder, MPI_Comm *cartcomm), : · oldcomm - , · ndims - , · dims - ndims, , · periods - ndims, , , · reorder - , · cartcomm - .


9 , 2010

MPI/OpenMP

11 9 12 3


MPI_Cart_shift
: int MPI_Card_shift(MPI_Comm comm, int dir, int disp, int *source, int *dst) (source) (dst) (comm) dir disp.


9 , 2010

MPI/OpenMP

1 2 0 12 3


MPI_Card_coords
: int MPI_Card_coords(MPI_Comm comm,int rank,int ndims,int *coords), : ·comm - , · rank - , , · ndims - , · coords - .


9 , 2010

MPI/OpenMP

1 2 1 12 3


MPI_Type_vector
MPI : · , · , . , , · , , · .

int MPI_Type_vector(int count, int blocklen, int stride, MPI_Data_type oldtype, MPI_Datatype *new type),
·count - , ·blocklen - , ·stride - , ·oldtype - , ·newtype - .
9 , 2010

MPI/OpenMP

1 2 2 12 3


MPI_Type_commit
: int MPI_Type_commit (MPI_Datatype *type ) : int MPI_Type_free (MPI_Datatype *type ).


9 , 2010

MPI/OpenMP

1 2 3 12 3