Wednesday 31 March 2010

Damn there are some clever people out there

The technical computing world is populated with some scarily intelligent people, seemingly brought up from an early age on complex algorithms and deep mathematical theory. Unfortunately I am not, and will never be, one of these people no matter how hard I try, so I have to look for other ways to make myself useful.
Luckily, skills developed in the enterprise world are also useful when managing Windows HPC Server, as a lot of the building blocks are common. Aside from a technical tool belt armed with AD, SQL Server, WDS, DNS etc etc, some of the softer skills can be very useful too. Knowledge and experience of support processes, project management, incident resolution, dealing with customers, and service delivery are all as applicable in the HPC world as any other environment.
After all, your carefully constructed and configured cluster is providing a service to your customers which is often as critical as any enterprise offering.

Tuesday 30 March 2010

Receive-Side Scaling & MPI

I recently spent an interesting few days discovering, diagnosing & resolving an MPI performance issue on a small Gigabit Ethernet interconnect based cluster. The problem first came to light after running the MPI quick check and Throughput diagnostics from within the HPC Cluster Manager diagnostics suite. Results were significantly down on expected values, returning an average of just over 200 microseconds latency and just over 60MB/s throughput. Expected figures are closer to 50 micoseconds and 105MB/s, so something was quite obviously amiss.
After checking that all appropriate firmware and drivers were up to date the issue was still apparent. Time for some good old fashioned detective work, driving the problem in to an ever smaller box until the answer popped out. Following many combinations of driver version and network setting configuration, the culprit turned out to be the Receive-Side Scaling feature when enabled on newer versions of the driver in question. When turned on and configured to use <1 queue performance was degraded. When turned off, or turned on but configured to use a single queue, performance was as expected. Interestingly when using older versions of the driver RSS could be on and configured to use multiple queues without any performance degradation.

During the investigation I spoke to Xavier Pillons, a Windows Server performance guru at Microsoft, and he came up with some very useful tips which I'm sure he won't mind me sharing:

1.Check the driver release version.
2. Check the TCP Global parameters (use the command line Netsh int tcp show global).
3. On Windows server 2008 try to disable RSS, and play with Chimney Off/On.
4. For a better latency you can disable Interrupt Moderation Rate on the Network Interface.

Problem solved, or at least a workaround found. Happy Days! :)

You get out what you put in

I'm a frequent contributor to the Microsoft HPC forums (http://social.microsoft.com/Forums/en-US/category/windowshpc). There are diverse and interesting threads to follow, and from a completely selfish perspective I've found getting involved there to be an extremely useful learning experience. It's also not uncommon to have a question answered by a member of the HPC product group, who are obviously amongst the most knowledgeable people available.
Basically, if you're embarking on a journey into Windows HPC you can do worse than hang out there!

Friday 19 March 2010

High Performance is Relative

Whenever you read big announcements and stories about HPC in the press, they tend to concentrate on mega clusters dedicated to solving some of the giant problems of our times. It may be climate modelling, genome research, finding cures for terrible diseases, or unlocking the secrets of the universe. All very worthy causes I'm sure, but chances are your problems are slightly smaller scale, yet still require heroic solutions. Maybe you need to design your new product in half the time, or you need to make your current product 50% more efficient. You might be looking to calculate insurance premium returns, or recover from the global downturn in as short a time as possible.
Whatever you hope to achieve from HPC, the results can be heroic on a personal level, even if it just means you get a well done from the boss!

Tuesday 16 March 2010

Windows HPC Server Users Facebook Group

Have you ever felt the urge to join a Windows HPC Facebook group? If so you're in luck.
I should say that I'm the owner of this group so I apologise if this comes across as blatant advertising.
It'd be awesome to see you there, always nice to get together with other users of Windows HPC Server.

Trust the product group

This one speaks for itself. My experience of the Windows HPC Product Group has been consistently excellent.  The official Windows HPC team blog is a goldmine of information, and questions to the Windows HPC forums are often investigated & answered by team members. If you have a question or issue, don't hesitate to ask on the forums.
These guys have always been very willing to help wherever possible, and for that they deserve credit.

Monday 15 March 2010

Diagnostics in HPC Server 2008 R2

A very interesting  post on the Windows HPC Team blog regarding diagnostics in HPC Server V3. I'm really looking forward to third party custom diagnostic tests, should be a very powerful feature indeed. Much of our work is based on management and control of individual jobs and groups of linked jobs, so I'm sure we'll be looking to create appropriate custom tests based on this framework.
I can certainly imagine a scenario where a Systems Center Operations Manager management pack kicks off in house custom tests, then alerts warnings and errors back to operations staff via the SCOM console where appropriate.

Increase your generic High Performance Computing smarts

Many new Windows HPC Sysadmins will be heading in from the Enterprise, having expertise in Microsoft platforms but relatively little knowledge of HPC. This is certainly the route I took, having worked in large enterprise (50K+ users) in previous roles.
The good news is that Windows HPC Server is built around standard, common, Microsoft (and other) technologies. If you have good knowledge of Active Directory, Windows Deployment Services, Windows Server, data storage and the like then you've got at least the platform component of the cluster under your belt.
The flip side is that HPC demands knowledge of technology which is rarely used in the enterprise world. Primary amongst those is Message Passing Interface (MPI), a protocol which is used extensively for distributed memory type jobs running across multiple compute nodes. While it's possible to maintain a Windows HPC cluster without in depth knowledge of MPI, from experience I can say that the time spent learning about how this area works is a very worth while investment. If I had to specify the information I've found most useful in this area it would be knowledge of process placement and affinity.
I've also found that a little knowledge of non Windows HPC technology is useful. There are some very interesting and innovative solutions out there.
I use a couple of HPC specific news sites on a semi regular basis:
http://insidehpc.com/
and
http://www.hpcwire.com/
These are informative, and give an overview of trends and developments in the field.

Friday 12 March 2010

The Hit Parade

So, I've been thinking about some stuff about my time as a Windows HPC Sysadmin which I'd like to share, & the more I think, the more top level bullet points I come up with. I thought I'd kick off this blog with a couple of top tens. The first are somewhat abstract topics, the second specific to Windows HPC Server. I'll then follow it up with some more detailed blog posts which dive a little deeper into each of them.

So, without further ado, I give to you:
The Ballpark, and The Strike Zone

The Ballpark

Increase your generic High Performance Computing smarts.
Many new Windows HPC Sysadmins will be heading in from the Enterprise, having expertise in Microsoft platforms but relatively little knowledge of HPC.

Trust the product group.
You know, no-one's perfect, but the HPC product group come pretty close.

High Performance is relative.
You don't necessarily need a multi-thousand core set up to produce heroic results.

You get out what you put in.
Contributing to the Windows HPC forum has been a very useful process.

Damn there are some clever people out there.
Sometimes I just feel inadequate, desperately trying to wrap my feeble brain around complex problems discussed by some of my peers.

Do you really need it?
There's a bunch of people out there intent on selling you a bunch of kit, but is it right for you?

The methodical process of troubleshooting.
My advice - start from the bottom & work up... or start at the top and work down.

Don't forget the basics.
There's pleasure to be found in the simple things in life. Or, if you want to be blunt, keep it simple, stupid!

The business is (nearly) always right.
I find there's direct correlation between good business reporting and being left alone to get on with work.

Try to get to SuperComputing.
No, really, it's a great show, with great people.


The Strike Zone

Know your underlying infrastructure.
You only need to look at the Top 500 results to know that alot of HPC kit does not currently run Windows HPC Server. This means that many Windows HPC Sysadmins will be transferring their expertise from Linux (or other OS) based systems.

It's all in the name
How does Windows HPC handle name resolution?

Digging through the versions
SP1? Well, yes, but SP1 of what? And R2? Is that beta or RTM? So you want to install what on where now?

Performance enhancing shrugs
Well, it seems to run OK, but is OK good enough?

The database interloper
SQL Server? In every Windows HPC Server deployment? Surely not.

PowerShell is powerful
OK it might sound like an obvious statement, but it's true! At the risk of offending those not listed I have some favourite HPC PowerShell Commandlets I'd like to share.

Network topology - choices choices
RRAS, dedicated router, managed firewall and the like.

Monitor Lizard, the thinking person's test rig
If you've ever tried to compile HPL to run on Windows HPC you'll absolutely love Lizard. Performance results aside, it's a great starting point for cluster verification.

Node deployment
If one person with one DVD takes 2 hours to build one compute node, how long would it take that person to build a tree house?

A job a day keeps the user at bay
Become one with your users, know what they know, type what they type.