From d7v@RedBrick.DCU.IE Fri Jan 22 14:56:41 1999 Path: Mother.RedBrick.DCU.IE!d7v From: d7v@RedBrick.DCU.IE () Newsgroups: redbrick.news Subject: Redbrick Downtime and stuff Date: 22 Jan 1999 14:55:11 GMT Organization: DCU Networking Society Lines: 201 Distribution: local Message-ID: NNTP-Posting-Host: mother User-Agent: slrn/0.9.5.4 (UNIX) Xref: Mother.RedBrick.DCU.IE redbrick.news:3536 As you've probably noticed by now, redbrick is back after the most of 2 days of downtime. The reason for this downtime was to replace the existing dual pentium motherboard with a dual ppro motherboard which i loaned to redbrick for a while, probably until sometime in febuary maybe march. It was obvious to admins and members alike that redbrick was pretty fucked up for a while. There were 3 suspects, the motherboard/processors/ram, the disks/scsi cables and the actual installation of solaris. The first was the primary suspect due to the recent cpu fan failure, which caused serious overheating, and may have made the cpus unreliable. So the decision was taken to get a temporary motherboard and install it. Needless to say, this did not go according to plan, primary because solaris is shite. The exact details for the interested will be layed out below. First, for the less technically inclined, here is the basic situation. 1. Redbrick will stay up if the fault was in the motherboard/cpu/ram. If the drives/scsi cables are bad, then it will crash. If the installation is corrupt, then it will crash. 2. It is currently running on a dual PPro 166MHz board, with 160MB ram. In performance terms, this is both good and bad. It is good because the CPUs are a _lot_ faster than the old pentium 200s. It is good because the CPUs have internal L2 cache, 512KB each, which is much faster than the shared L2 cache the old board had. It is bad however because 160MB ram is only enough to handle about 50 users before swapping. Once swapping starts getting serious, probably about 80 users, maybe 100, then swapping will make things crawl. Before anyone asks, no we cannot take the ram from the old board because it is very different than that needed for the new board. If anyone has 32MB or greater of buffered ECC EDO 60ns DIMMs lying around gathering dust, now would be the time to donate them. 3. Chat is down and will probably stay that way. In order to resurrect mother, the 2940 scsi card in nurse was needed. No nurse = no chat server running on nurse. Unless one of the admins can convince the chat server to run on mother, which it doesnt like for some crap reason. 4. The system will probably need to be taken down for an hour or so, to fix up some lingering remains of the brute force approach to making the thing work. ` The Team: The team assembled consisted of orthan, valen, spinal, grimnar and myself. Orthan and myself were there for the entire saga, valen for most of it, with spinal and grimnar popping in occasionally, to laugh mostly. The details below dont specify who did or thought of what, cause i cant remember for most of it. The Plan: Remove the old dual Pentium board and replace with my nice known to be working dual PPro board. Solaris should reconfigure itself fairly easily, as the onboard scsi was almost the same as on the old board. The Details: 1. Mother was already down, so we started (about 11 ish wednesday). The motherboard was removed, the spacers adjusted to hold the new board and the new board fitted. Problem: the new board had only 68pin onboard SCSI, but the system drive was 50pin SCSI. A scsi shape-shifter was looked for, but to no avail. Solution: Remove the 2940 scsi card from nurse, and connect the system drive to that. Boot machine. Hardware fine. Everything detected at boot time. So far so good. There would be a minor problem, since the one of the drives would be detected on controller 0 and one on controller 1. So vfstab would need to be fixed. But this was before we learned something important: 2. Solaris is shite. 3. Let solaris boot. It notices that something has changed. Good. Detects all the new stuff on the new board. Finds the system drive. Gives us a boot menu, both drives, the cdrom and the network. Select the correct drive. Makes an attempt to boot. Complains that it can't find any of the filesystems. Wants to reboot. Demands a -b flag. At this point we confirm: 4. Solaris is shite. 5. Boot with -b. Struggles to boot. Mounts / read-only. Do an ls, get 'ls: not found'. Not good. Look in /bin to determine available weaponary. /bin is a link to /usr/bin. /lib is a link to /usr/lib. Total of available commands are echo * in lieu of an ls and contents of /sbin. Not a lot. Now convinced that: 6. Solaris is shite. 7. Its obviously confused by the presence of 2 scsi controllers instead of one. Get it to rescan for devices during bootup and disable the on board controller in the solaris boot configuration proggy. Still doesnt work. Getting annoyed. Get hassle from members demanding instant fixing. Ask for suggestions. None offered. Get more annoyed. Hard to deny: 8. Solaris is shite. 9. Seriously out-gunned by solaris's suckyness, reinforcement was sought. Sent out requests for a solaris install CD and decided to construct a modified copy of toms boot/root disk, the emergency linux distribution for when stuff will not work, like now. Went home. Downloaded a copy of toms boot/root disk, compiled a new kernel with scsi support and support for solaris filesystems. Assembled new toms boot/root as needed for mother. Pondering the fact that: 10. Solaris is shite. 11. It is now thursday morning, about 11. Battle commences again. Boot of toms boot/root disk. Linux sees everything fine. All devices confirmed as working. Wish mother was a linux installation. Mount mothers / and have a look. Fuck all useful there. Mount mothers /usr and bludgeon together a few useful commands, ls, mknod, fsck, and most importantly drvconfig, the autoconfiguration tool, used when drives and controllers change. Compared to linux: 12. Solaris is shite. 13. Boot into solaris. Moans alot during bootup. Run drvconfig. Wont work. Fix libraries by booting into linux. Still doesnt work, because it cant write to /etc/path_to_inst. Annoyed. Can't remount / rw because the device node doesnt work and can't get working device node without a rw /. Vicious circle detector goes off. Cant make nodes under linux, cause linux UFS support doesn't do devices nodes. Don't know the major and minor numbers anyway. Did i mention: 14. Solaris is shite. 15. Find a solaris 2.6 install CD. Disconnect /home drive just in case. Mangle a shell out of it. Mount mothers 8/. Tar up the devices nodes from the successful boot and untar them into mothers /devices. Missing device node for onboard controller. Seems to ignore it cause no drives are connected. Connect back /home and hope not to damage it. Rinse and reboot. Make new set of devices nodes, with both controllers this time. They look ok. Not confident though. Make links from /dev/dsk and /dev/rdsk to the appropriate places. Looking good. Try a reboot and confirm: 16. Solaris is shite. 17. For some reason the same devices nodes do not work when booting of the hard disk. Not impressed and at this stage, not even surprised. Dont know why. Reboot of the CD. Mount mothers /. Do a drvconfig with an option to specify the location of root. Semi-ignores the parameter and tries to write to /etc not /mnt/etc. Still sort of creates the /mnt/devices dir. Sort of. The evidence mounts: 18. Solaris is shite. 19. Formulate a plan. Mount mothers / over the CD's /etc, which was readonly, hence the problem. Run drvconfig. Complains about missing files in /etc. Unmount mothers /, mount as /mnt, copy file and remount as /etc. Complains about another file. Curse at machine. Remount mothers / as /mnt. Copy all of /etc to /mnt/etc.temp. Remount mothers / as /etc. Copy each file as drvconfig complains. Manage to get a valid /etc/path_to_inst file which matched the device nodes. Carefully copy into /etc/etc and hope for the best. Working (hopefully) devices nodes are now in mother /pci@0,0/... hierarchy. Expect more problems, cause after all: 20. Solaris is shite. 21. Reboot into solaris. Try to mount the device nodes as / and /usr manually. Demands an fsck. Try to fsck, claims not needed on UFS filesystems. Quelle surprise. Reboot in to linux to fix, cause: 22. Solaris is shite. 23. Add links from /dev/dsk and /dev/rdsk to the new device nodes in /pci@0,0/whatever. Uncomment /usr and /var in /etc/vfstab. Reboot. For some reason, the auto fsck on startup works and mounts /, /usr and /var ok. Confused as to why. Remember that: 24. Solaris is shite. 25. Add links to devices nodes for /usr/local and now everything on the system disk should work. Fsck fails on /usr/local for some reason. Reboot again. Manual fsck works for some reason. All now working except for /home. Nothing seems to work on on-board controller. Disconnect UTP to prevent users and mail getting in. Not impressed. Run drvconfig again. Doesnt SFA. Delete entire hierarchy of device nodes for first controller. Drvconfig again. Makes device nodes for the /home disk. Make links from /dev/dsk and /dev/rdsk. Still dont work. Fuck. Notice typo in links. Make links again. Works. /home is now mounted. Happy. Retire undefeated from the world of solaris administration, cause no matter what anyone says about it, on x86: 26. Solaris is SHITE. BTW, in case anyone has any bright suggestions how we should have done it, fuck off, cause we probably tried it and it didnt work. Tony.