syslog.txt 7Jun98 8pm ##### PoP lives! ##### All nodes (pop01..14) have RH5.0 with 2.0.32 SMP kernel. pop05 only has one processor, all others have 2. 10Jun98 4pm pop03 seems to have fallen over at around 12noon. ping doesn't respond, no display to local screen, CTRL/ALT/DEL has no effect, hit system reset button. Booted okay, /var/log/messages shows that timed election was the last thing to happen before crash. Saved copy of messages to /var/log/messages.10Jun98. CPU temp was shown at 32degC when pop03 rebooted. -- All nodes up again (pop01..14), pop05 still has only 1 processor. 13Jun98 3pm re-partitioned disk on pop01 to be just like the other nodes: [root@pop01]# fdisk Using /dev/hda as default device! Command (m for help): p Disk /dev/hda: 128 heads, 63 sectors, 620 cylinders Units = cylinders of 8064 * 512 bytes Device Boot Begin Start End Blocks Id System /dev/hda1 1 1 254 1024096+ 83 Linux native /dev/hda2 255 255 620 1475712 5 Extended /dev/hda5 255 255 589 1350688+ 83 Linux native /dev/hda6 590 590 620 124960+ 82 Linux swap Note that to make a swap filesystm you need to use mkswap since mkfs -t swap just ain't valid (didn't seem obvious to me). -- All nodes up (pop01..14), pop05 still has only 1 processor. 15Jun98 7pm Replacement processor arrived for pop05 (300MHz PII). Simply did /sbin/shutdown -h, pulled the plugs, fitted the fan, put in the processor, put the box back in the rack and ... hey presto! pop05 boots with 596BogoMIPS. top also now shows two processes at 99%. -- All nodes up (pop01..14) with 2 processors. 16Jun98 7:30pm Power to the PoP rack (including rock) was lost sometime after 9pm on 15Jun98, a little after Alan had submitted enough jobs to have both CPUs in operation on each node. Visible symptom was the trip gone on the Digital voltage conditioner (VC). When resetting this it seems that the power strip for pop09--pop14 might be faulty (not thoroughly looked into yet). This was replaced with a new one. The other possibility is that all nodes working was enough to trip the 20A trip in the VC so I rewired to have rock and pop01--pop08 on the 20A VC outlet and pop08--pop14 into a wall socket. Then powered up the switch and all the nodes. Switch didn't boot in reasonable time and it seemed to want to get the command `boot' from the serial port. Taking power away again then restoring it didn't create the same problem, strange! Nodes pop01--pop04, pop06--pop08 and pop10 booted fine, the others didn't: pop05 - got hung up on the BIOS CDROM detection claiming not ATAPI compatible. Set BIOS to quick boot and rebooted then okay. pop09 - tried reboot again, no good. Ran fsck /dev/hda5 to fix a few things then okay. pop11 - fsck /dev/hda5 said that it couldn't fix the superblock. Ran fdisk to rewrite the partition table then fsck again to fix other stuff then okay. pop12 - fsck /dev/hda5 then okay. pop13 - fsck /dev/hda5 then okay. pop14 - fsck /dev/hda5 then okay. -- [root@pop01]# date; pop.uptime Tue Jun 16 19:44:16 EDT 1998 pop01 7:44pm up 4:58, 2 users, load average: 0.00, 0.00, 0.00 pop02 7:44pm up 4:46, 0 users, load average: 0.00, 0.00, 0.00 pop03 7:44pm up 4:46, 0 users, load average: 0.00, 0.00, 0.00 pop04 7:44pm up 4:46, 0 users, load average: 0.00, 0.00, 0.00 pop05 7:44pm up 4:09, 1 user, load average: 0.00, 0.00, 0.00 pop06 7:44pm up 4:46, 0 users, load average: 0.00, 0.00, 0.00 pop07 7:44pm up 4:46, 1 user, load average: 2.14, 2.00, 1.92 pop08 7:44pm up 4:46, 1 user, load average: 2.00, 2.00, 1.92 pop09 7:44pm up 24 min, 0 users, load average: 0.00, 0.00, 0.00 pop10 7:44pm up 4:46, 0 users, load average: 0.00, 0.00, 0.00 pop11 7:44pm up 23 min, 0 users, load average: 0.00, 0.00, 0.00 pop12 7:44pm up 21 min, 0 users, load average: 0.00, 0.00, 0.00 pop13 7:44pm up 20 min, 0 users, load average: 0.00, 0.00, 0.00 pop14 7:44pm up 18 min, 0 users, load average: 0.07, 0.02, 0.00 -- All nodes up (pop01..14). 20Jun98--28Jun98 Several thunderstorms causing power problems. Whole machine fell over with one outage of ~5s. Nodes 9--14 fell over more times as they are not on the power conditioner. pop14 reported hard disk problems. (reboots etc. by Alan Middleton) 30Jun98 3:30pm Alan Middleton had found that pop14 would not reboot after outages caused by thunderstorms last week. Today I tried to boot it and the BIOS complained of a hard disk fault. I noticed that the HD light on the front panel was permanently on which had happened before when the HD connectors weren't properly plugged in. Took off lid and pushed all connectors home, rebooted okay then. (Simeon Warner) 30Jun98 4:30pm New power cables arrived and 20A (NEMA 5-20P) plugs fitted to used the 20A outlets on the DEC power conditioner. All systems halted using /root/bin/pop.shutdown script, cables swapped, pop01 rebooted and then all others. All nodes came up okay. (Simeon Warner) 4Aug98 3pm Extra 8.4GB disk added to pop09 as disk09a and NFS mounted on other nodes (just restarted nfsd on them, didn't have to reboot). The disk was purchased by smc. (Simeon Warner) 31Aug98 2pm Extra 384MB RAM added to pop12, 13 and 14 for aam bringing each machine to a total of 512MB RAM. Of course had to take machines down to do this and also managed to accidentally reboot pop01 (oops!), /etc/lilo.conf files edited and /sbin/lilo run. Some problems getting networking working in reboot, connectors in the network cards of pop13 and pop14 seem very fussy about being carefully reconnected. Otherwise most machines up for 2 months now: pop01 2:15pm up 1:48, 3 users, load average: 1.00, 0.96, 0.65 pop02 2:15pm up 61 days, 21:45, 1 user, load average: 2.08, 2.02, 2.01 pop03 2:15pm up 61 days, 21:43, 0 users, load average: 2.08, 2.02, 2.01 pop04 2:15pm up 61 days, 21:42, 0 users, load average: 2.00, 2.00, 2.00 pop05 2:15pm up 61 days, 21:43, 1 user, load average: 2.08, 2.02, 2.01 pop06 2:15pm up 61 days, 21:42, 1 user, load average: 1.99, 1.97, 1.99 pop07 2:15pm up 61 days, 21:43, 1 user, load average: 2.00, 2.00, 2.00 pop08 2:16pm up 61 days, 21:42, 1 user, load average: 2.00, 2.00, 2.00 pop09 2:16pm up 26 days, 21:06, 0 users, load average: 2.00, 2.00, 2.00 pop10 2:16pm up 61 days, 21:41, 0 users, load average: 2.21, 2.15, 2.10 pop11 2:16pm up 61 days, 21:43, 0 users, load average: 2.00, 2.00, 1.99 pop12 2:16pm up 42 min, 1 user, load average: 0.00, 0.00, 0.00 pop13 2:16pm up 29 min, 0 users, load average: 0.00, 0.00, 0.00 pop14 2:16pm up 25 min, 0 users, load average: 0.00, 0.00, 0.00 28Sep98 PoP was up for 87 days until electrical outage on Sat 26Sep98. Rebooted on Sun 27Sep98 and all okay. Outage again at 9am Mon 28Sep98. Rebooted again at 2pm Mon 28Sep98 (order pop01, pop09, rest, mount -a on pop01). pop05 had problems with CDROM detection, needed second reboot but worked without changes. pop10 complained about disk and fsck lost a couple of files on /tmp. These were files currently open when the power was lost. (Simeon Warner) 29Sep98 Extra 13.6GB disk added to pop08 as disk08a and NFS mounted to other nodes. Did not partition, just 'mkfs -t ext2 /dev/hdb1'. (Simeon Warner) 16Oct98 Changed backup strategy to use do_backup.pl script. (Simeon Warner) 18Dec98 Modified /etc/rc.d/init.d/atd to start atd with -l 1.8 option as requested by Dave McNamara. (Simeon Warner) -------------------------------------------------------------------------------