Archive for January, 2007

Fileservers rebooted

Monday, January 22nd, 2007

To correct a potential kernel bug affecting the SCSI driver in the storage nodes we had to build patched Linux kernels and reboot into them today.

This should not have affected any running jobs or sessions, they would just have paused waiting for the NFS servers to come back.

Status Update - 8:50pm

Wednesday, January 17th, 2007

Brecca, Wexstan and Edda are all running jobs as normal.

Status Update - 7pm

Wednesday, January 17th, 2007

We are now bringing up nodes on Edda by hand, we expect to have all clusters up and running jobs in the near future.

We are limiting jobs at present to a 2 month maximum walltime as we are expecting to be conducting a purchase at that point to replace Brecca and that is likely to require a complete shutdown of all clusters to integrate it.

Status Update - 5pm

Wednesday, January 17th, 2007

Brecca is now up and accepting jobs, but is not starting them yet.

Wexstan is now up and accepting jobs, but is not starting them yet.

Edda is still being worked on, we have managed to migrate all nodes to the existing management controller but the cluster management software must be upgraded to cope with this.

Hardware problems on Brecca

Wednesday, January 17th, 2007

The management node on Brecca is displaying hardware problems that are preventing us from bringing it up reliably. We are working on getting this going, as well as a work around to allow us to bring up the other clusters in its absence.

Status Update

Wednesday, January 17th, 2007

We are working to bring the clusters on line, but we have hardware problems with one of the management controllers that lets us bring Edda up. We are working to migrate the systems the failed HMC manages onto the working one, but this has involved upgrading the software on the HMC (and downloading two install DVDs from the IBM website).

We have taken this opportunity to upgrade our storage nodes from Fedora Core 4 to Ubuntu 6.06 LTS and are defragmenting the filesystems on them which we believe has caused poor performance for users home directories recently.

More news as we have it.

VPAC clusters down - Victorian Power Outages

Wednesday, January 17th, 2007

The VPAC clusters were taken down late last night because of forecast power outages, the situation today appears very uncertain with the electricity regulator forecasting more blackouts across the state due to excessive demand.

Because of the chance of more blackouts the systems team at VPAC are currently working flat out to do essential systems maintenance and are monitoring the situation to decide when to bring the clusters back up.

Apologies for having to do this, the extraordinary circumstances are beyond our control