PCM Code Porting and Optimization

PCM Background

PCM is a tightly coupled earth system model, without flux correction, contained in a single executable. It is supported by DOE and collaborated by LANL, NPS, NCAR and NPACI. It’s components are

PCM’s Current Goal

(From http://www.scd.ucar.edu/dir/cas98/19980702cas98/tsld014.htmreach 1.0 wall clock hours per model year )

Our Goal

Porting on the IBM SP2

Porting notes:

Porting Suggestions:

  • find whether the data files are single or double precission and how is
    it effected by the -qrealsize=8 flag.
  • making use of /scratch space
  • minimize register spills by splitting loops etc

Verification Plot:

The bellow plot shows the difference value of a reference variable in time for different computing platforms. The graph shows that the installation on the IBM SP platform has in the same error range with the other platforms which have been already verified.

wpe4.gif (26285 bytes)

 

Flat Profile:

wpe5.gif (11836 bytes)

Function Call Summary:

wpe6.gif (12873 bytes)

 

Library Statistics:

wpe7.gif (8468 bytes)

Call Graph Profile:

wpe8.gif (5291 bytes)

Original Code Performance Profile and Scalability Plots

    The bellow graph shows the scalability measure based on the timings for the 16 and 64 nodes computations. The percentage value is obtained by dividing the 16 node time by 64 node time, again by dividing it by 4 (the ratio of 64 to 16).100% can be considered as ideal scalability. Some of the fcd timings are over 100% which may be because the scalability measure is obtained based on the 16 node timing, not by  the single node timing.

    16 and 64 PE bars show the measured time for different parts of the codes. Each time is divided by the total time for their runs. The timing bars will show the major time consuming  part of the PCM code. With the scalability value next to the timing value, the graph helps to order any part of the code to put efforts to optimize with.

    "ocn" (or ocean code POP) is a good candiate to sart to optimize because the code has poor scalability (less than 50 %) and spends significat of time (40%). Second choice of the code will be "atm" with 50% of the scalability and 20% of  the time. "fcd" and "ice" will be the last ones to be considered for optimization because they are already optimized well and uses least time.

wpe9.gif (9253 bytes)

Single PE Optimization for POP Ocean Code

A kernel was chosen from the POP code. The kernel is taking 8% of the total time. There are at least three parts in POP  which has the same coding struture with the kernel.

The kernel is optimized by rewriting inefficient f90 code to f77 code (mostly array operations and intrinsic) Some conditional statements are eliminated. Overall, the performance is increased from 16 Mflops (0.25 elapsed second) to 140 Mflops (0.057 elapsed second). 

Kernel taken from the POP code

       do n = 1,nt
 
c        mt2 = min(n,size(VDC,DIM=4))
        mt2 = 2
        A = afac_t(1)*VDC(:,:,1,mt2)
        D = hfac_t(1) + A
        E(:,:,1) = A/D
        B = hfac_t(1)*E(:,:,1)
        F(:,:,1) = hfac_t(1)*TRACER(:,:,1,n,newtime)/D
 
        do k=2,km
 
          C = A
          A = afac_t(k)*VDC(:,:,k,mt2)
          D = merge(hfac_t(k)+B, hfac_t(k)+A+B, k == KMT)
           where (k .le. KMT)
            E(:,:,k) = A/D
            B = (hfac_t(k) + B)*E(:,:,k)
            F(:,:,k) = (hfac_t(k)*TRACER(:,:,k,n,newtime) 
     &               + C*F(:,:,k-1))/D
          elsewhere
            F(:,:,k)  = c0
          endwhere
 
        enddo
 
        do k=km-1,1,-1
          where (k .lt. KMT)
            F(:,:,k) = F(:,:,k) + E(:,:,k)*F(:,:,k+1)
          endwhere
        enddo
 
        do k = 1,km
          TRACER(:,:,k,n,newtime) = merge(TRACER(:,:,k,n,oldtime) + 
     &                                    F(:,:,k), c0, k .le. KMT)
        enddo
 
      enddo
        


Single PE Optimization of the Kernel

   do n = 1,nt
       do j = 1,jmt
       do i = 1,imt
c        mt2 = min(n,size(VDC,DIM=4))
        mt2 = 2
        A = afac_t(1)*VDC(i,j,1,mt2)
        D = hfac_t(1) + A
        E(1) = A/D
        B = hfac_t(1)*E(1)
        F(1) = hfac_t(1)*TRACER(i,j,1,n,newtime)/D
 
          C = A
        do k=2,km
 
          A = afac_t(k)*VDC(i,j,k,mt2)
 
          D=hfac_t(k)+B+A*kmflg1(k,i,j)
 
          if(k .le. KMT(i,j)) then
            E(k) = A/D
            B = (hfac_t(k) + B)*E(k)
            F(k) = (hfac_t(k)*TRACER(i,j,k,n,newtime)
     &               + C*F(k-1))/D
          else
            F(k)  = c0(i,j)
          endif
c 
         C = A
 
        enddo
 
        do k=km-1,1,-1
          if (k .lt. KMT(i,j)) then
            F(k) = F(k) + E(k)*F(k+1)
          endif
        enddo
 
        do k = 1,km
          TRACER(i,j,k,n,newtime) =(1-kmflg2(k,i,j))* 
     &  (TRACER(i,j,k,n,oldtime)+F(k))+kmflg2(k,i,j)*c0(i,j)
        enddo
 
         enddo
   enddo