PCMs Current GoalPCM is a tightly coupled earth system model, without flux correction, contained in a single executable. It is supported by DOE and collaborated by LANL, NPS, NCAR and NPACI. Its components are
- Atmosphere: CCM3.2
- Ocean: POP (384x288)
- Ice: ICE based on Zhang et. al. (320x640)
- Coupler: FCD originally based on CSM coupler
- PCM validated on T3E, Origin, HP SPP, & SP
- Currently, best 64PE performance is 6.5 hours/model year (no SP data yet)
(From http://www.scd.ucar.edu/dir/cas98/19980702cas98/tsld014.htmreach 1.0 wall clock hours per model year )Our Goal
- reach 1.0 wall clock hours per model year
- CCM only scales to 64 PE, need 2D version
- Rest of PCM scales well, 0.58 hours/year on 256PE T3E900 with atmospheric forcing.
- Goal attainable with 256 PE, some tuning, and using 2D decomposition of CCM which is under development.
- Higher resolution, 1000+ processors possible
Porting on the IBM SP2
- Port PCM and validate the porting on the current IBM SP2 and the coming teraflop machine
- Tune and optimize PCM components for the teraflop cluster environment
- Collaborate with other PCM effort to port 2D version of the CCM
- Benchmark the performance on the above systems
Porting notes:Porting Suggestions:
- link essl library
- number of total and I/O processors should be defined in makefile
- increase the spill buffer size by using -NS1024 flag
- ifdef directive should be
#ifdef fcd_coupled instead of #if fcd_coupled- timer routine, timef, is modified
- fcd.f does not compile with -dummy_rtm flag on and -O3 opt
level.- namellist format in.dat.atm is changed from &end to /
- find whether the data files are single or double precission and how is
it effected by the -qrealsize=8 flag.- making use of /scratch space
- minimize register spills by splitting loops etc
Verification Plot:The bellow plot shows the difference value of a reference variable in time for different computing platforms. The graph shows that the installation on the IBM SP platform has in the same error range with the other platforms which have been already verified.
Flat Profile:
Function Call Summary:
Library Statistics:
Call Graph Profile:
The bellow graph shows the scalability measure based on the timings for the 16 and 64 nodes computations. The percentage value is obtained by dividing the 16 node time by 64 node time, again by dividing it by 4 (the ratio of 64 to 16).100% can be considered as ideal scalability. Some of the fcd timings are over 100% which may be because the scalability measure is obtained based on the 16 node timing, not by the single node timing.
16 and 64 PE bars show the measured time for different parts of the codes. Each time is divided by the total time for their runs. The timing bars will show the major time consuming part of the PCM code. With the scalability value next to the timing value, the graph helps to order any part of the code to put efforts to optimize with.
"ocn" (or ocean code POP) is a good candiate to sart to optimize because the code has poor scalability (less than 50 %) and spends significat of time (40%). Second choice of the code will be "atm" with 50% of the scalability and 20% of the time. "fcd" and "ice" will be the last ones to be considered for optimization because they are already optimized well and uses least time.
A kernel was chosen from the POP code. The kernel is taking 8% of the total time. There are at least three parts in POP which has the same coding struture with the kernel.
The kernel is optimized by rewriting inefficient f90 code to f77 code (mostly array operations and intrinsic) Some conditional statements are eliminated. Overall, the performance is increased from 16 Mflops (0.25 elapsed second) to 140 Mflops (0.057 elapsed second).
Kernel taken from the POP code
do n = 1,nt
c mt2 = min(n,size(VDC,DIM=4))
mt2 = 2
A = afac_t(1)*VDC(:,:,1,mt2)
D = hfac_t(1) + A
E(:,:,1) = A/D
B = hfac_t(1)*E(:,:,1)
F(:,:,1) = hfac_t(1)*TRACER(:,:,1,n,newtime)/D
do k=2,km
C = A
A = afac_t(k)*VDC(:,:,k,mt2)
D = merge(hfac_t(k)+B, hfac_t(k)+A+B, k == KMT)
where (k .le. KMT)
E(:,:,k) = A/D
B = (hfac_t(k) + B)*E(:,:,k)
F(:,:,k) = (hfac_t(k)*TRACER(:,:,k,n,newtime)
& + C*F(:,:,k-1))/D
elsewhere
F(:,:,k) = c0
endwhere
enddo
do k=km-1,1,-1
where (k .lt. KMT)
F(:,:,k) = F(:,:,k) + E(:,:,k)*F(:,:,k+1)
endwhere
enddo
do k = 1,km
TRACER(:,:,k,n,newtime) = merge(TRACER(:,:,k,n,oldtime) +
& F(:,:,k), c0, k .le. KMT)
enddo
enddo
Single PE Optimization of the Kerneldo n = 1,nt do j = 1,jmt do i = 1,imt c mt2 = min(n,size(VDC,DIM=4)) mt2 = 2 A = afac_t(1)*VDC(i,j,1,mt2) D = hfac_t(1) + A E(1) = A/D B = hfac_t(1)*E(1) F(1) = hfac_t(1)*TRACER(i,j,1,n,newtime)/D C = A do k=2,km A = afac_t(k)*VDC(i,j,k,mt2) D=hfac_t(k)+B+A*kmflg1(k,i,j) if(k .le. KMT(i,j)) then E(k) = A/D B = (hfac_t(k) + B)*E(k) F(k) = (hfac_t(k)*TRACER(i,j,k,n,newtime) & + C*F(k-1))/D else F(k) = c0(i,j) endif c C = A enddo do k=km-1,1,-1 if (k .lt. KMT(i,j)) then F(k) = F(k) + E(k)*F(k+1) endif enddo do k = 1,km TRACER(i,j,k,n,newtime) =(1-kmflg2(k,i,j))* & (TRACER(i,j,k,n,oldtime)+F(k))+kmflg2(k,i,j)*c0(i,j) enddo enddo enddo