One of my clients had been having ongoing issues with their Xserve G5. Test after test, fix after fix and not only couldn’t I seem to get it permanently fixed, I couldn’t seem to find the root of the problem.
Here’s what I ended up putting in our SR:
[The Client]‘s Xserve is running 10.4.9 and connected to and Xserve RAID with the latest firmware. It is connected to the network via gigabit ethernet and is usually connected to and external LaCie backup drive via firewire 800.
With increasing frequency, the server has been freezing up. Users are unable to connect to file services. The server does not respond to VNC, SSH or ping connections. When a keyboard, mouse and monitor are connected the mouse will move but Aqua and all programs will remain unresponsive, not even the beach ball. By this point, all of the system fans are running at full speed and the only way restore the server is to force it to power down from the front panel.
The server will quickly come back up. There is rarely anything to be found via the console. Crashdump almost never gets a chance to write and entry. Occasionally, Retrospect will warn than a crash or power interruption may have interfered with a backup.
The far majority of these crashes happen at night. Nearly every one can be traced back to the time at which the backup runs. When the backup is manually run during the day it will crash within the first fifteen minutes nearly every time.
During normal operation for [the Client] they is very rarely a crash during business hours.
While trying various backup methods I have noticed what I believe to be the problem.
During sustained read/writes the System Controller Vcore voltages can be seen to increasingly fluctuate as time goes on from Apple Server Monitor.
Normally, it sits at a steady 1.70V, but on this xserve they begin to drop after a while. Then, as they meander up and down, if it drops below 1.00V the server locks up completely.
It seems that the added overhead of local disk writes (including FW and USB) pushes it over the edge during those sustained read/writes.
Network usage (which seems to includes the Fibre Channel for the RAID) does not have quite the overhead so it never (or hasn’t) dropped below 1.00V.
I believe that the System Controller or some associated component is defective and causing an improper drain on the system. When core voltages drop too low, the systems freezes.
Earlier in the day, I was able to run a 50 GB sustained duplication via the network to a backup HD located on another machine. I was also able to run a 200 GB sustained duplication to a second array on the Xserve RAID.
While running both of those I watched the System Controller Vcore voltages fall below 1.70V and fluctuate down to as low as 1.50V but it remained online. Unfortunately, while writing this, I have lost contact with the server as it was trying to run one of the alternate backups that I had hoped would get around the problem.
On another client’s Xserve G5 with a similar setup I ran a sustained duplication while watching the System Controller Vcore and the voltages never once wavered from 1.70V.

(bad server on the left, good one on the right)
Apple would not absolutely confirm my diagnosis without sending someone onsite. They would send us a new Logic Board, for a price. The server was well out of warranty, but the cost of the LB was much better than a new server.
Once the new board was in, I ran several stress tests on the server. With transfers well over 3GB an minute for over an hour, the System Controller Vcore never once waivered from 1.70V.
I wish I would have found this one sooner, but at least it is fixed now.
technorati tags:work, Apple, Xserve
Blogged with Flock
