Документ взят из кэша поисковой машины. Адрес оригинального документа : http://www.parallel.ru/sites/default/files/ftp/computers/beowulf/apps2.pdf
Дата изменения: Wed Nov 2 11:53:59 2011
Дата индексирования: Tue Oct 2 03:27:33 2012
Кодировка:
Beowulf Applications and User Experiences
Daniel S. Katz
Daniel.S.Katz@jpl.nasa.gov

High Performance Computing Group Imaging and Spectrometry Systems Techn ology Section

J


Beowulf System at JPL (Hyglac)
l l l l

J

16 Pentium Pro PCs, each with 2.5 Gbyte disk, 128 Mbyte memory, Fast Ethernet card. Connected using 100Base-T network, through a 16-way crossbar switch. Theoretical peak: 3.2 GFLOP/s Sustained: 1.26 GFLOP/s
Six applications analyzed in paper by Katz, et. al, Advances in Engineering Software, v.26, August 1998

High Performance Computing Group

Daniel S. Katz


Hyglac Cost
l

Hardware cost:

$54,200 (as built, 9/96) $22,000 (estimate, 4/98)

» 16 (CPU, disk, memory, cables) » 1 (16-way switch, monitor, keyboard, mouse)
l

Software cost:

$600 ( + maintainance)

J

» Absoft Fortran compilers (should be $900) » NAG F90 compiler ($600) » public domain OS, compilers, tools, libraries
High Performance Computing Group Daniel S. Katz


Beowulf System at Caltech (Naegling)
l l

l l

J

~120 Pentium Pro PCs, each with 3 Gbyte disk, 128 Mbyte memory, Fast Ethernet card. Connected using 100Base-T network, through two 80-way switches, connected by a 4 Gbit/s link. Theoretical peak: ~24 GFLOP/s Sustained: 10.9 GFLOP/s
High Performance Computing Group Daniel S. Katz


Naegling Cost
l

Hardware cost:

$190,000 (as built, 9/97) $154,000 (estimate, 4/98)

» 120 (CPU, disk, memory, cables) » 1 (switch, front-end CPU, monitor, keyboard, mouse)
l

Software cost:

$0 ( + maintainance)

J

» Absoft Fortran compilers (should be $900) » public domain OS, compilers, tools, libraries
High Performance Computing Group Daniel S. Katz


Performance Comparisons
Hyglac CPU Speed (MHz) Peak Rate (MFLOP/s) Memory (Mbyte) Communication Latency (µs) Communication Throughput (Mbit/s) 200 200 128 150 66 Naegling 200 200 128 322 78 T3D 150 300 64 35 225 T3E600 300 600 128 18 1200

(Communication results are for MP I code)

J

High Performance Computing Group

Daniel S. Katz


Message-Passing Methodology
l

l

l

J

Issue (non-blocking) receive calls: CALL MPI_IRECV(...) Issue (synchronous) send calls: CALL MPI_SSEND(...) Issue (blocking) wait calls (wait for receives to complete): CALL MPI_WAIT(...)
High Performance Computing Group Daniel S. Katz


Finite-Difference Time-Domain Application

Images produced at U of Colorado's Comp. EM Lab. by Matt Larson using SGI's LC FDTD code

J

Time steps of a gaussian pulse, travelling on a microstrip, showing coupling to a neighboring strip, and crosstalk to a crossing strip. Colors showing currents are relative to the peak current on that strip. Pulse: rise time = 70 ps, freq. 0 to 30 GHz. Grid dimensions = 282 в 362 в 102 cells. Cell size = 1 mm3.
High Performance Computing Group Daniel S. Katz


FDTD Algorithm
l l

Classic time marching PDE solver Parallelized using 2-dimensional domain decomposition method with ghost cells.
Standard Domain Decomposition Required Ghost Cells

J

Interior Cells Ghost Cells High Performance Computing Group Daniel S. Katz


FDTD Results
Number of Naegling Processors 1 2.44 - 0.00 4 2.46 - 0.097 16 2.46 - 0.21 64 2.46 - 0.32 T3D 2.71 2.79 2.79 2.74 0.000 0.026 0.024 0.076 T3E-600 0.851 0.859 0.859 0.859 0.000 0.019 0.051 0.052

Time (wall clock seconds / time step), scaled problem size (69 в 69 в 76 cells / processor), times are: computation - communi cation

J

l

Initial tests indicate 20% computational speed-up with 300 MHz Pentium II
High Performance Computing Group Daniel S. Katz


FDTD Conclusions
l

l

On all numbers of processors, Beowulfclass computers perform similarly to T3D, and worse than T3E, as expected. A few large messages each time step take significantly longer on the Beowulf, but do not make an overall difference.

J

High Performance Computing Group

Daniel S. Katz


PHOEBUS Application (D. Katz, T. Cwik)
Choke Ring

Finite Element Region Integral Equation Boundary

Radiation Pa ttern from JPL Circular Waveguid e
(from C. Zuffada, et. al., IEEE AP-S paper 1/97)

Integral Equation Boundary

Finite Element Region

J

Typical Applications:

Radar Cross Section of a dielectric cylinder

High Performance Computing Group

Daniel S. Katz


PHOEBUS Coupled Equations
K C 0

C 0 ZM

0 H 0 Z0 M = 0 Z J J Vinc



l

This matrix problem is filled and solved by PHOEBUS
» The K submatrix is a sparse finite element matrix » The Z submatrices are integral equation matrices. » The C submatrices are coupling matrices between the FE and IE matrices.
High Performance Computing Group Daniel S. Katz

J


PHOEBUS Solution Process
K C 0

C 0 ZM

0 H 0 Z0 M = 0 Z J J V



l

H =- K -1CM


-C K -1C ZM

Z Z

0 J

M 0 = J V



l

J

Find -CK-1C using QMR on each row of C, building x rows of K-1C, and multiplying with -C. Solve reduced system as a dense matrix.

High Performance Computing Group

Daniel S. Katz


PHOEBUS Algorithm
l l l l l

J

l l

Assemble complete matrix Reorder to minimize and equalize row bandwidth of K Partition matrices in slabs Distribute slabs among processors Solve sparse matrix equation (step 1) Solve dense matrix equation (step 2) Calculate observables
High Performance Computing Group Daniel S. Katz


PHOEBUS Matrix Reordering

Original System

J

System after Reordering for Minimum Band width

Non-zero structure of matrices, using SPARSPAK's GENRCM Reordering Routine
High Performance Computing Group Daniel S. Katz


PHOEBUS Matrix-Vector Multiply
Communication from processor to left

Rows

X Columns

Local processor's rows Local processor's rows

Local processor's rows Communication from processor to right

J

High Performance Computing Group

Daniel S. Katz


PHOEBUS Solver Timing
Model: dielectric cyli nder with 43,791 edge s, radius = 1 cm, height = 10 cm, permitti vity = 4.0, at 5.0 GHz
Number of Processors Matrix-Vector Multiply Computation Matrix-Vector Multiply Communication Other Work Total T3D (shmem) 1290 114 407 1800 T3D (MPI) 1290 272 415 1980 Naegling (MPI) 1502 1720 1211 4433

J

Time of Convergence (CPU secon ds), solving using 16 processors, pseudo-block QMR algorithm for 116 right hand sides.

High Performance Computing Group

Daniel S. Katz


PHOEBUS Solver Timing
Model: dielectric cyli nder with 100,694 edg es, radius = 1 cm, height = 10 cm, permitti vity = 4.0, at 5.0 GHz
Number of Processors Matrix-Vector Multiply Computation Matrix-Vector Multiply Communication Other Work Total T3D (shmem) 868 157 323 1348 T3D (MPI) 919 254 323 1496 Naegling (MPI) 1034 2059 923 4016

J

Time of Convergence (CPU secon ds), solving using 64 processors, pseudo-block QMR algorithm for 116 right hand sides.

High Performance Computing Group

Daniel S. Katz


PHOEBUS Conclusions
l l l l

J

Beowulf is 2.4 times slower than T3D on 16 nodes, 3.0 times slower on 64 nodes Slowdown will continue to increase for larger numbers of nodes T3D is about 3 times slower than T3E Cost ratio between Beowulf and other machines determines balance points
High Performance Computing Group Daniel S. Katz


Physical Optics Application (D. Katz, T. Cwik)

J

DSN antenna - 34 meter main

MIRO antenna - 30 cm main

High Performance Computing Group

Daniel S. Katz


Physical Optics Algorithm
1
Main reflector (faceted into M triangles)

2

3

4

Feed Horn

J

5
Sub-reflector (faceted into N triangles)

Create mesh with N triang les on sub-reflector. Compute N currents on sub-reflector due to feed horn (or read currents from file) Create mesh with M triangl es on main reflector Compute M currents on main reflector due to currents on subreflector Compute antenna pa ttern due to currents on main reflector (or write currents to file)

High Performance Computing Group

Daniel S. Katz


Parallelization of PO Algorithm
l l l l

Distribute (M) main reflector currents over all (P) processors Store all (N) sub-reflector currents redundantly on all (P) processors Creation of triangles is sequential, but computation of geometry information on triangles is parallel, so 1 and 3 are partially parallel Computation of currents (2, 4, and 5) is parallel, though communication is required in 2 (MPI_Allgatherv) and 5 (MPI_Reduce).

l

Timing:
» Part I: » Part II: » Part III: Algorithm:
1 2 3 4 5

Read input files, perform step 3 Perform steps 1, 2, and 4 Perform step 5 and write output files

l

J

Create mes h with N triangle s on sub-refl ector. Compute N c urrents on su b-reflector d ue to feed ho rn (or read currents from file) Create mes h with M triang les on main re flector Compute M currents on m ain reflector due to curre nts on sub-re flector Compute an tenna patter n due to curr ents on main reflector (or write currents to file)
High Performance Computing Group Daniel S. Katz


Physical Optics Results (Two Beowulf Compilers)
N umber of P roces s o rs 1 4 16 P art I 0.0850 0.0515 0.0437 P art II 64.3 16.2 4.18 P art III 1.64 0.431 0.110 T o tal 66.0 16.7 4.33

Time (minutes) on Hyglac, usi ng gnu (g77 -O2 -fno-automatic)
N umber of P roces s o rs 1 4 16 P art I 0.0482 0.0303 0.0308 P art II 46.4 11.6 2.93 P art III 0.932 0.237 0.0652 T o tal 47.4 11.9 3.03

J

Time (minutes) on Hyglac, usi ng Absoft (f77 -O -s)

M = 40,000 N = 4,900
High Performance Computing Group Daniel S. Katz


Physical Optics Results
N umber of P ro ce sso rs 4 16 64 Nae gling 95.5 24.8 7.02 T 3D 102 26.4 7.57 T 3E -600 35.1 8.84 2.30

Time (minutes), N=160,00 0, M=10,000

l

J

Initial tests indicate 40% computational speed-up with 300 MHz Pentium II

High Performance Computing Group

Daniel S. Katz


PO Conclusions
l

l l

J

Performance of codes with very small amounts of communication is determined by CPU speed. Naegling results are between T3D and T3E. This is close to the best that can be attained with Beowulf-class computers.

High Performance Computing Group

Daniel S. Katz


Incompressible Fluid Flow Solver (John Lou)

Image: Vorticity projections in streamwise-vertical planes Flow Problem: 3-D driven cavity flow, Re=2,500 Grid Size: 256 x 256 x 256 Algorithm: Second order projection method with a multigrid full V-cycle kernel Computer: Cray T3D with 256 processors

J

High Performance Computing Group

Daniel S. Katz


Incompressible Fluid Flow Solver (John Lou)
Grid Size 128 в 128 256 в 256 512 в 512 Number of Processors 1 4 16 Beowulf Time 6.4 - 6.4 - 0.0 22.2 - 7.0 - 15.2 36.6 - 7.3 - 29.3 T3D Time 13.8 - 13.8 - 0.0 19.1 - 14.7 - 4.4 22.7 - 15.4 - 7.3 T3E Time 5.8 - 5.8 - 0.0 7.8 ­ 5.9 ­ 1.9 9.6 ­ 6.0 ­ 3.6

Times are run times in seconds (total - computation - communication)
Grid Size 128 в 128 512 в 512 2048 в 2048 Number of Processors 64 64 64 Beowulf Time 21.2 52.7 230 T3D Time 5.0 11.5 75.0 T3E Time 2.1 5.2 31.0

J
l

Times are total run times in seconds

Initial tests indicate 40% computational speed-up with 300 MHz Pentium II
High Performance Computing Group Daniel S. Katz


Incompressible Fluid Flow Solver (John Lou)
l

l

J

l

As the number of processors increases, Beowulf performance drops, compared with T3D and T3E. For a fixed number of processors, Beowulf performance increases with local problem size Beowulf memory would need to grow as the number of processors increases to get scalable performance, relative to T3D/E.
High Performance Computing Group Daniel S. Katz


General Conclusions
l l

Key factor in predicting code performance: amount of communication Beowulf has a place at JPL/Caltech
» Each machine should have:
­ Small numbers of processors ­ Limited number of codes/users

J

l

Not a replacement for institutional supercomputers
High Performance Computing Group Daniel S. Katz