Search This Blog

10 May 2009

Server Died


So I got home last night to find the server had power but was not responding (mouse, keyboard, ssh, anything). I tried rebooting, but it kept handing at "Hostname: serveris" and wouldn't go any further (even in single-user mode). I saw some chatlogs online that suggested adding '-k -a -d verbose' and using '/dev/null' to the answer of any questions (like /etc/system replacement)...

I tried that and got this far (see left)...

Looking around some more, I saw that if I changed the '-k' to '-kd' it would drop it into debug mode. At that point, I did the following:



[0] moddebug/W 80000000
[0] :c

This allowed me to see a few more details.... (sorry for the blurriness of the pic - it was about 3am)

After trying to find anything online that would help (and the IRC channel) I finally said screw it and decided I would reinstall opensolaris on the root mirror.

I downloaded the USB version of OpenSolaris 0906 111a, but evidentially my quad core machine does not have the option of booting from USB (WTF?). I reburned the CD version and installed it. One thing that confused me is that although my old system was 10/08 upgraded to 111a and the new version was supposed to be 0906 111a, it now says 101b.

Trying to boot the new one, it again hung. At a different position, but... I was starting to think it was a hardware problem. I let it try to boot overnight and the next morning it was finally at the login prompt... with the old install.

The logs showed that it had tried to load the Belkin UPS a few steps after where it locked up, so I unplugged the UPS. I went ahead and applied all updates and rebooted. It took about 5 hours for it to finally boot again (though it did). It still says I am using 101b and that there are no new updates.

The xVM instance is there and I was able to start it. The whole root zone however is gone. The ZFS partition is there, and empty. zoneadm doesn't show anything but global. So, I am going to try to recreate the global zone, but... I still don't know what happened. I am also concerned that it currently takes about 5 hours to boot.

6 comments:

  1. I've just experienced the same problem you did! My UPS died overnight (caused by the mains power fuse I suspect as changing the power cable to the UPS appears to have fixed it).

    Booted OpenSolaris (2009.06) this morning and hang. Did the "-v -m verbose" addition to the grub command line and watched it boot, hanging at:

    lx_systrace0 is /pseudo/lx_systrace@0

    After a couple of reboots, I left it to see if it was doing something. Came back 30 minutes later and OpenSolaris had finished whatever it was doing and had completed the boot.

    Then found your blog post... ;-)

    ReplyDelete
  2. Yeah, I still haven't figured out what is causing it. Every time I reboot now it takes 4-5 hours. Really sucks.

    ReplyDelete
  3. I just hit the same problem with my home machine as well. Everything was working fine until I removed the ide dvd rom I used to install 2009.11. Luckily I can still boot to the original BE once a couple of times. This is very annoying nonetheless - will keep searching...

    ReplyDelete
  4. I don't know if either of you are still watching this, but it was just resolved. I updated the notes on the recent blog post.

    Hanging During Boot

    ReplyDelete
  5. Hi Malachi

    Thanks for the update to this problem. At least this explains the problem (I had a significant number of auto created snapshots on several filesystems).

    Cheers

    JR

    ReplyDelete