PDA

View Full Version : HP Z800 problem



Bok
12-24-16, 12:05 PM
I picked up an HPZ800 off Ebay last week. 2xX5670's, 48Gb Ram and 2x160Gb Drives.

Plugged it in, installed my goto Centos7. All good, got boinc installed and running. Next day it locked up. Could not get it to reboot into linux. Re-installed Centos & again, up and running. Next day locked up again.

So, I disabled the RAID, re-installed Centos7 again yesterday, got boinc up and running and left a 'top' window open. Well today, it locked up again and very bizarrely, it shows up for 1 day and 0min. Exactly 1 day..I suspect the others were too but don't know for sure. Temperatures weren't too bad

top - 10:32:50 up 1 day, 0 min, 3 users, load average: 24.82, 24.80, 24.84
Tasks: 416 total, 2 running, 414 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.0 sy, 99.9 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 s
KiB Mem: 49277756 total, 3696072 used, 45581684 free, 2728 buffers
KiB Swap: 24780796 total, 0 used, 24780796 free. 1571584 cached Mem

Going to install a different distro, but this is very odd. Any ideas??

zombie67
12-24-16, 06:32 PM
Exactly 24 hours every time? Sounds like it is somehow related to scheduled activities. Something in cron? Or maybe checking for updates?

scole of TSBT
12-24-16, 09:23 PM
Maybe something like this?
https://www.reddit.com/r/linuxadmin/comments/392jlw/centos_7_server_freezing_everyday/

Bok
12-26-16, 10:52 AM
I've re-installed again and it's been up for 22hrs now, will see what 24hrs brings. I have two terminals open with 'top' running in one and tail -f /var/log/messages in the other.

There are some errors in there that I haven't quite figured out yet. Every 10mins there is a systemd-logind failure

Dec 26 10:50:01 hpz800-1 systemd: Created slice user-0.slice.
Dec 26 10:50:01 hpz800-1 systemd: Starting user-0.slice.
Dec 26 10:50:01 hpz800-1 systemd: Started Session 159 of user root.
Dec 26 10:50:01 hpz800-1 systemd: Starting Session 159 of user root.
Dec 26 10:50:01 hpz800-1 systemd: Removed slice user-0.slice.
Dec 26 10:50:01 hpz800-1 systemd: Stopping user-0.slice.
Dec 26 10:50:01 hpz800-1 systemd-logind: Failed to remove runtime directory /run/user/0: Device or resource busy

Bok
12-26-16, 12:50 PM
And it has locked up... I did apply that khugepaged fix too. This is just truly bizarre. 1 day and 2mins of uptime. Nothing more in the syslog. No cron tasks running.

top - 12:45:49 up 1 day, 2 min, 4 users, load average: 0.00, 0.01, 0.05
Tasks: 310 total, 1 running, 309 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 s
KiB Mem : 49277716 total, 45547228 free, 753644 used, 2976844 buff/cache
KiB Swap: 15626236 total, 15626236 free, 0 used. 48058280 avail Mem


PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15167 root 20 0 157840 2452 1568 R 0.3 0.0 2:10.93 top
1 root 20 0 46084 6508 3904 S 0.0 0.0 0:08.16 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.04 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:00.37 ksoftirq+
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/+
6 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/+
8 root rt 0 0 0 0 S 0.0 0.0 0:00.14 migratio+
9 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
10 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/0
11 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/1
12 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/2
13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/3
14 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/4
15 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/5
16 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/6
17 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/7
18 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/8
19 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/9

scole of TSBT
12-26-16, 02:07 PM
Is there anything in the BIOS related to components, like video, NIC or SATA, going to sleep or being able enable or disable that stuff?

Bok
12-26-16, 04:27 PM
I'll take another look in the BIOS, I had reset it to defaults prior to this install.

Oddly enough it's still responding on the attached monitor, waiting for a login, but it's almost as if the <enter> key is stuck as it lets you type one letter only. The clock still shows on the screen though, up to date.

Bok
12-26-16, 04:46 PM
hmmm. Brought the GUI down to commandline (ctrl+alt+F2) and could login. It had lost it's network settings somehow. I stopped and disabled NetworkManager then restarted network and can log back in remotely.

I guess I'll leave it another day and see if it stays up.

MindCrime
01-05-17, 05:58 PM
Ever get this worked out? If it's still acting up I was going to suggest another OS and see if that could help you decide between hardware and software.

Bok
01-05-17, 08:52 PM
Yes, looks like that NetworkManager is the culprit on Centos7. Been running fine.

i had tried a few other OS's but couldn't get the ones I had available to even load for some reason.

Was even about to go back to Gentoo at one stage which was my go to choice back in the day, these days I'm lazy and stick with the RH derivatives.