Page 1 of 1

Hardware monitoring on Xserve

PostPosted: 25 Jul 2006, 21:54
by nelsonjm
When you run 10.4/10.3 on an Xserve you can get a fairly comprehensive log of hardware issues by reading /var/log/hwmond

Is there a comparable hardware monitor for YLD for noting ecc, core voltage, and other errors?

PostPosted: 26 Jul 2006, 14:40
by StarKnight83
if you look around there might be a gkrellm plugin for that (i understand the framework for plugins is faily easy so you might be able to make ur own) mainly you just cat the file w/ a few modifiers like a txt based one (grep and the likes)

PostPosted: 27 Jul 2006, 20:28
by nelsonjm
Thanks for the suggestion.

I am actually looking for something that will write to a log or provide another cli method of informing me that voltages/fan speeds/ecc ram is out of whack.

One of the plugins that gkrellm has is for lm_sensors which seems to provide partially what I want.. if it works... however I still have not found a YLD compatible ecc error reporter. Do you know of any?

PostPosted: 28 Jul 2006, 08:36
by StarKnight83
unfortunatly no; out of my 7 computers not one of them is ecc capable so no help here. What you could do is look at what lm_sensors outputs (it can be a plaintext file) and just do string analsys upon it to get just the info you need it certain perateres are out of range then email adim. a bash script and cron job could do this (ill even try to help you on it if you'd like - just email me off-board and ill see what i can do)

PostPosted: 24 May 2007, 16:58
by nelsonjm
I got in contact with an IBM ppc kernel hacker who was very helpful in working with me on this problem. We currently have a rudimentary edac based ecc driver that from my experience works quite well. This hopefully will be merged into the kernel some time after version 2.6.21.

My next plan is to get everything with 5.0.1/5.1 setup then get the system identifer light working and write a userland app that checks smart errors, temperatures (whose sensors still need some work to report correct locations), and ecc errors to write to the logs... :)