Something a bit sad, here. One of my happiest, best running file servers (white box) running Solaris 11.1 (x86) has seemingly failed me after about 78 days of uptime since the last patch. I'll explain the background:
* Board: GA-Z77-UD5H, LGA1150
* RAM: 16GB
* Inte Core i5 current gen (well, one back from Haswell)
* Dual 7200 RPM 2.5" SATA boot drives in a ZFS rpool mirror
* OS: Solaris 11.1 with support repositories.
Box had been up for about 76 days. Thought it was sane to give it a patch and a reboot!
pkg image-update -v etc etc.
Made a new boot environment, and off we go.
Reboots now hang at banner, consistently. Even when I rip all the disk out of the system, swapped all the DIMM's out, removed any extra PCI-E network devices (there were none), reset CMOS checksum defaults, upgraded the BIOS, tried booting from the CD...still, hangs at banner. No matter what I do, it hangs at banner. Driven me nuts all day. I managed to boot Windows and ubuntu just fine. Wondering what on earth is going on here, to that end. Started to assume hardware, but I don't think that's the case.
When I add a -v to my boot args, I see:
SMBIOS v2.7 loaded (10333 bytes)initialized model-specific module 'cpu_ms.GenuineIntel' on chip 0 core 0 strand 0
root nexus = i86pc
pseudo0 at root
pseudo0 is /pseudo
scsi_vhci0 at root
scsi_vhci0 is /scsi_vhci
npe0 at root: space 0 offset 0
npe0 is /pci@0,0
Sorry, I don't know. If I could boot it at all, I could simply run a pkg info entire and it would tell me.
We can assume it's a very current SRU, on the basis that I kept it very well updated and patched from the support repository on a monthly/bi-monthly basis. Sorry,I just don't remember the SRU string off the top of my head .
I wonder if somehow finding an older Solaris 11 disk would help it?
Nar, unfortunately, it doesn't seem to help. I flip back to the previous BE and the behaviour is the same. I actually now believe it's hardware related. I got the SRU Out of it in the end, too. It's 188.8.131.52.1
I have a screenshot of it hanging on the npe driver module load, if that's useful?
Yes, I'll try to look up the npe driver message later but if this is a hardware issue, it probably won't help.
If the problem is one of your root pool disks, I would expect a more obvious error message. There might be
a better way but if you can boot from media or an install server and attempt to import the root pool, then that
would give us more clues.
Tried that. No media now boots at all on the host. It gets to exactly the same point trying to load the npe driver, then hangs the whole host, even from a live disc, even when all HDD's are unplugged, even with a full BIOS reset. Pushing the BIOS back one or two revisions made it sort-of-kind-of-work, but it didn't help. Still hangs and panics. I think it might be hw related, but I cannot be sure.
Did it panic before? I thought it just hung. What is the panic string? I haven't found anything about npe hanging. There was a bug with an error: WARNING: npe1: no ranges property but this isn't it. There was an issue with some x86 systems needing ACPI disabled but you would have had issues previously.
Now, swapped out motherboard, with a completely different, less complex model that a friend had, handy - and even now, with no HDD's plugged in, booting Solaris 11.1 Live or text, I still hang at the npe0 error. This is making absolutely no sense. So now I don't suspect the hardware, but I *do* suspect that ORacle have done something to the compatibility with current generation Z77 series motherboards/Ivy Bridge IOCH/MPH controllers.
Man, this is getting confusing .