Документ взят из кэша поисковой машины. Адрес оригинального документа : http://classic.chem.msu.su/gran/gamess/qdpt2.pdf
Дата изменения: Sun Jun 25 16:15:22 2006
Дата индексирования: Mon Oct 1 19:33:36 2012
Кодировка:
The efficient implementation of the multi-reference perturbation theories at second order
Alex A. Granovsky

Laboratory of Chemical Cybernetics, M.V. Lomonosov Moscow State University, Moscow, Russia
September 21, 2005


Canonical single-reference MP
MP2:
E
mp 2

1 = 4


ijab

|< ij || ab >|2 ijab

Parameters: N, Nocc, Nvir Integral transformation - N5 step Only minor overhead due to PT power series summation itself (N4 step)

MP3 and above:
Integral transformation Intermediate quantities individual terms of the As in the case of MP2, N4 for MP3) - N5 step (amplitudes entering into numerators of the PT series) calculations - N6 and above PT summation itself has better scaling (e.g.,


Multi-reference (MR) MBPT Multi-reference theories
Additional parameters:
Nact, Ndet (Ncsf), Neff

More complex expressions both for energy correction itself and for computational costs
Third and higher orders of various formulations of the multi-reference (MR) MBPT
Calculation of various intermediates is the most computationally-demanding stage

Non-contracted and partially contracted MR-MBPT theories at second order
Most of the computational efforts are typically due to summation of the individual terms of the PT series themselves, especially for the case of large active spaces


Horrible MCQDPT2 example (H. Nakano, 1993)


Why the costs of PT summation are important?
Straightforward implementation of the summation of the PT series is very inefficient on modern computer architectures because:
At least one (or more) slow and typically not pipelined divide operation is required to calculate each individual term of the PT series Summation runs over large amount of data involving some combinations of transformed two-electron integrals, so that it is typically not processor cache-friendly


Our goals
Address both these problems:
1. Reformulate the rules of the summation of the PT series to completely eliminate the slow divide operations by
A. Removing redundant divides by replacing most of the work to be done by the fast matrix multiplications of some intermediate quantities B. Removing non-redundant divides by replacing them by few fast addition and multiplication operations

2. At the same time, develop efficient families of cachefriendly algorithms by introducing the appropriate intermediates and restructuring the order of loops used for summation of the PT series


The source of divides
Separate calculation of the contribution due to each separate term of PT
Myriad of terms - myriad of divides

Normally, we do not need to know the value of each separate term, only their sum of some kind
Way to reduce the number of divide operations


Redundant divides
Number of different numerators is greater than number of different denominators A simple example: a ij a ij j = S= bi ij bi i More realistic example (MCQDPT2):


Bpq

< A | E pq | B >


i

uiqu

pi B

p - i + E


1A. Redundant divides removal Redundant


Bpq

< A | E pq | B >


Bip q


Bip

u

pi


q

p - i + E uiqu pi < A | E pq | B > = p - i + E B
i



uiqu

pi B

=

uiq < A | E pq | B > =
B


Bip

u pi v

p - i + E
ABpi B

p - i + E

;v

ABpi

=


q

uiq < A | E pq B >


1B. Non-redundant divides removal Non-redundant

i n

ai a1 a2 an = + + ... + bi b1 b2 bn

a1 a2 a1b2 + a2b1 + = b1 b2 b1b2

ai -1 ai ai -1bi + ai bi += bi -1 bi bi -1bi

-1

Let us define:
A0 = 0; B0 = 1; Ai = Ai -1bi + Bi -1ai ; Bi = Bi -1bi

Then:


i

n

ai An = bi Bn

3 multiplications for 1 divide


2. Cache-friendly approach and loops restructuring


Non cache-friendly code sample
S=


B

C B C

B


ijab

Loop over B
Loop over i
Loop over j
· Loop over a

(ia | jb)[2(ia | jb) - ( ja | ib)] a + b - i - j + E B

­ Sum over b:

T =T +


b

(ia | jb)[2(ia | jb) - ( ja | ib)] a + b - i - j + EB

· End loop over a

End loop over j

End loop over i Accumulate S

End loop over B


Code sample analysis
Why this code sample is bad?
There is no data reuse in the inner loops

Our data:
huge very number of 2-e integrals; small number of orbital energies

How to improve the code
Restructuring the code in order to allow data reuse


Cache-friendly code version
S=


B

C B C

B


ijab

(ia | jb)[2(ia | jb) - ( ja | ib)] a + b - i - j + E B

Loop over i
Loop over j
Loop over a
· Calculate tb = (ia | jb)[2(ia | jb) - ( ja | ib)] · Loop over B
­ Calculate ­ Sum over b: ­ Accumulate

= a - i - j + EB
W =W +


b

tb b +

· End loop over B

End loop over a

End loop over j

End loop over i


Main results
Up to 50-100 times faster MCQDPT2