2007/04/14 03:42: Servers and such

It's pretty bad when your laptop has better uptime than your servers.

I've been planning a new storage cluster server for a while. I finally bought the hardware and had done the design testing and was about to load test it when the storage array on my main server begins to fail.

This server's been out of storage for a while, so this was something I was going to be putting in anyway, and planning to do it just the way I am.

So, I bump up plans to put the server in, having made sure the design of the system was sound, I figured the stress-test was just me seeing how much faster it could go than my current machine.

I come up to Ouray in the middle of the night and put the new server in the rack, move the data over to it and go home. Total downtime? 5 hours. So far, so good.

The only problem? The new NFS server to handle the storage locks up under heavy IO on occasion.

Back to Ouray.

Update some things. Try to find the source of the lockup. It's an obscure, possibly SMP-only related bug in the Linux kernel — or my hardware. I get "NMI Watchdog detected lockup on CPU", and one of the CPUs in the system is frozen, and tasks running on it killed. Then the other one goes some time later.

Apply a few trial fixes. Seems stable for hours. Go home.

Server's frozen.

Wash. Rinse. Repeat.

I gotta learn to drive.

This morning, same thing. Hitch to Ouray. I'm changing kernels now, and with luck, that'll take care of it. I think I'll just spend my day up here. Ouray's nice in the off-season.

Comments