PoP - Syracuse University Physics Computational Cluster

PoP - Performance

Single node

Alan Middleton has timed several of his simulation codes on a range of Intel and Alpha machines running Linux, including the PoP nodes. Have a look at his comparison table.
Whetstone
For what it is worth, I ran the Whetstone benchmark on a pop node (2 300MHz PII, fort77 v1.14a (invokes f2c), gcc v 2.7.2.3, 10 outer loops, 10000 inner loops, double precision version) and got:
No optimization flag (equiv -O0)
  Single job      ~119Mwhet/s
  Two jobs        ~119Mwhet/s  each  -> ~238Mwhet/s total
With optimization (-O3)
  Single job      ~177Mwhet/s
  Two jobs        ~177Mwhet/s  each  -> ~354Mwhet/s total

This benchmark has a small loop of CPU bound math functions and so scales very well and overall perfomance is fine even when Linux has to deal with more jobs than processors. Optimization makes a big difference!

There is a Mac biased comparison table available which shows some very strangely low performance figures for Intel processors but quotes 149Mwhet/s for a G3 processor @ 317MHz, and 169Mwhet/s for a 604e processor at 350MHz. They don't say anything about the compiler.

Also found a C version of the whetstone benchmark which has two of the 12 original sections removed. Under gcc v 2.7.2.3 I got:

With optimization (-O3)
  Single job	  ~227Mwhet/s 
  Two jobs	  ~227Mwhet/s  each  -> ~454Mwhet/s total
This is, however, a slighly different code so we can't use the numbers to compare with the Fortran numbers.
Linpack
After some messing with the timing I compiled the C version of the single processor linpack benchmark (C translation Bonnie Toy 5/88, bug fix Jack Dongarra 25/2/94). Using gcc 2.7.2.3, with -O4 I got:
  • Double precision, rolled ~25Mflops
  • Double precision, unrolled ~28Mflops
  • Single precision, rolled ~62Mflops
  • Single precision, unrolled ~85Mflops
These numbers are a little suspect since the timing seems very inaccurate (upto 20% variation between runs). However, they give an approximation.

Overall

Watch this space for parallel benchmarks...
Aggregates
From whetstone at 227Mwhet/s/proc with 32 processors we have ~7.2Gwhet/s.

From linpack at 85Mflops/proc with 32 processors we have ~2.7Gflops.


Written by Simeon Warner, maintained by Dan Kirkpatrick
Last updated 06 August 2010