Friday 21 May 2010

Performance enhancing shrugs, featuring Lizard

So you've just set up your brand spanking new cluster, and it all seems to work as designed. Diagnostic tests pass with flying colours, and you've run a handful of small test jobs through to get your eye in.
But could it perform better?
Well that's a good question, and before you start to look for the answer you need to think about how you define current performance, how best to baseline your cluster.
Personally I perform a set of standard, well defined tests. The results of these, coupled with good awareness of expected results for the type of hardware, not only give benchmarks for the cluster, but also indicate whether everything is behaving itself. The tests I use are:
1. mpipingpong
2. High Performance Linpack
3. An appropriate ISV application benchmark
As you can see they will provide fairly high level results, and aim to increase confidence rather than troubleshoot. Let's look at them a little more closely.


mpipingpong
mpipingpong is used to analyse the latency and bandwidth when passing a message between two processes on one or more computers using MPI. This is a lightweight test, which completes in a short time, and a couple of simple runs are available in the diagnostic tests suite which comes as part of Windows HPC Server. Result! Simply run the
MPI Ping-Pong: Lightweight Throughput and MPI Ping-Pong: Quick Check tests across all cluster nodes to produce bandwidth and latency results respectively.


High Performance Linpack (HPL)
HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. This is the application used to quantify SuperComputer performance for Top500 qualification, and is generally regarded as the industry standard HPC benchmark. That's not to say that it shows how your cluster performs under your real world workloads, but it certainly allows for analysis of performance when compared against other, similar machines. Once again, Microsoft have come up trumps and provide a packaged version of HPL wrapped in a marvellous application called Lizard. Lizard de-stresses the HPL run process by:
1. Providing a consistent, compiled HPL executable. If you've ever tried to compile HPL yourself you'll know exactly what a benefit this is.
2. Automatically tweaking HPL input parameters in order to obtain the best possible result for your cluster configuration. There are many Linpack parameters, and automation makes the tuning process very simple.


ISV application
This is one that for your environment you'll almost certainly will have more knowledge of than me. Let's just say that firing a known real world workload across your cluster will give excellent feedback, particularly if you're able to compare results against other machines you own, or published benchmarks. As an example, in the Engineering field Ansys publish Fluent benchmark results online, which give independent comparisons to onsite test runs.

But what do the results mean? Well, let's think about them one by one.
mpipingpong
Obviously the results you achieve will depend on the hardware configuration of your machine, but for guidance, you should expect the following:

Network Type      Throughput        Latency
GigE                     ~110MB/S      40-50 microseconds
10GigE                 ~800MB/S      10-15 microseconds
DDR IB                ~1400MB/S    <2 microseconds
QDR IB                ~3000MB/S    <2 microseconds

If you're way off these numbers you should start troubleshooting.

Lizard
Lizard will provide both an actual performance number, in Flops, and a cluster efficiency number, which should give you a good idea how well your cluster is performing against expected results (based on comparison with your head node processor). This is a good starting figure, but it's worth digging a bit deeper to determine how well your cluster is performing. There are lots of resources out there which will tell you the optimum result for your processor type, but these should be taken with a pinch of salt, as you will lose performance through inefficiencies outside the processor (memory / interconnect etc.).

ISV application
Many ISVs publish well defined benchmark figures which can be used as comparisons. Just be aware that it's unlikely that they will have benchmarked a hardware configuration exactly the same as yours. It's also good value to run a job on (for example) a powerful workstation or alternative cluster. This will help form a good view of where your cluster should be performing.

So what's next?
Well, that all depends on your results. Are they looking good? Great, keep an eye on things, let the users loose, and ask for their feedback. Users are quick to notice when their jobs are not running as they like. Slightly concerned about your benchmark results? Nice, we're into performance troubleshooting and diagnosis, which I'll cover in separate post.

One last thing - make benchmarking a regular thing, it's not only a good thing to report on, but it can be a good warning system for cluster issues.

No comments:

Post a Comment