PDA

View Full Version : free-dc: Drive crash



zombie67
07-17-14, 12:15 PM
Just FYI. Bok is aware, may not be able to address until tomorrow.

Mumps
07-17-14, 12:22 PM
Is that another one of those SSD drives?

Maxwell
07-17-14, 12:23 PM
Cool. Fortunately for me, I'll be obsessively tracking my WUProp hours for the next couple days rather than my stats on Free-DC, so this is convenient. And that's what matters - my convenience. :p

Bok
07-17-14, 12:35 PM
Managed to step out of my meeting.

yes it's an SSD. Been fairly stable for a year or so since I implemented a round robin approach. I have hot spares thanks to donations from last year (3 of them), but am in Dallas through tomorrow. I may be able to create the database copy on the 3rd SSD existing though. Try that tonight.

Bok
07-17-14, 12:45 PM
I just hotspotted my laptop so I could log on. Might be worse. The other main SSD is also not responding now. So that might point to motherboard as too coincidental that both would go down at same time. Trying a remote reboot.

Bok
07-17-14, 08:28 PM
UPDATE. Bizarrely, I remote rebooted the box and remounted the drives and they seem fine. fsck shows no issues. I've ran the stats update on one side of the database and it's running on the other right now.

Not sure if this worries me more, but once this finishes I'll switch them back on.

DrPop
07-18-14, 01:08 AM
BOK, thank you for your efforts as always. If the server ends up needing anything, I think I can safely say we have your back. Please just keep us posted.

zombie67
07-18-14, 01:12 AM
Friendly reminder for our team, there is a donate button on the bottom of the left hand column of all the stats pages.

Duke of Buckingham
07-18-14, 05:02 AM
I would like to help posting the link and image to Donate but I don't know how to do that. If one of you can help me on this I will post on TeAm Stones of The Day.

Thank you.

Duke of Buckingham
07-18-14, 09:49 AM
I would like to help posting the link and image to Donate but I don't know how to do that. If one of you can help me on this I will post on TeAm Stones of The Day.

Thank you.

Just a note to say that I really don't know if it is possible or legal to add such a post on the thread.

Bok
07-21-14, 04:30 PM
Well, it just happened again, but at least I'm home this time.

I'm still puzzled as to what it is though. Rebooted and the drives again are just fine.

This is the log excerpt.


Jul 20 03:47:02 dbase kernel: imklog 4.6.2, log source = /proc/kmsg started.Jul 20 03:47:02 dbase rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="1380" x-info="http://www.rsyslog.com"] (re)start
Jul 21 15:21:54 dbase kernel: ata8.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6 frozen
Jul 21 15:21:54 dbase kernel: ata8.00: failed command: WRITE FPDMA QUEUED
Jul 21 15:21:54 dbase kernel: ata8.00: cmd 61/08:00:87:03:53/00:00:0c:00:00/40 tag 0 ncq 4096 out
Jul 21 15:21:54 dbase kernel: res 40/00:04:3f:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jul 21 15:21:54 dbase kernel: ata8.00: status: { DRDY }
Jul 21 15:21:54 dbase kernel: ata8.00: failed command: WRITE FPDMA QUEUED
Jul 21 15:21:54 dbase kernel: ata8.00: cmd 61/08:08:07:d5:52/00:00:0c:00:00/40 tag 1 ncq 4096 out
Jul 21 15:21:54 dbase kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 21 15:21:54 dbase kernel: ata8.00: status: { DRDY }
Jul 21 15:21:54 dbase kernel: ata8.00: failed command: WRITE FPDMA QUEUED
Jul 21 15:21:54 dbase kernel: ata8.00: cmd 61/08:10:4f:e5:52/00:00:0c:00:00/40 tag 2 ncq 4096 out
Jul 21 15:21:54 dbase kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 21 15:21:54 dbase kernel: ata8.00: status: { DRDY }
Jul 21 15:21:54 dbase kernel: ata8.00: failed command: WRITE FPDMA QUEUED
Jul 21 15:21:54 dbase kernel: ata8.00: cmd 61/08:18:9f:dc:52/00:00:0c:00:00/40 tag 3 ncq 4096 out
Jul 21 15:21:54 dbase kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 21 15:21:54 dbase kernel: ata8.00: status: { DRDY }
Jul 21 15:21:54 dbase kernel: ata8.00: failed command: WRITE FPDMA QUEUED
Jul 21 15:21:54 dbase kernel: ata8.00: cmd 61/08:20:f7:ec:52/00:00:0c:00:00/40 tag 4 ncq 4096 out
Jul 21 15:21:54 dbase kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 21 15:21:54 dbase kernel: ata8.00: status: { DRDY }
Jul 21 15:21:54 dbase kernel: ata8.00: failed command: WRITE FPDMA QUEUED
Jul 21 15:21:54 dbase kernel: ata8.00: cmd 61/08:28:a7:e4:52/00:00:0c:00:00/40 tag 5 ncq 4096 out
Jul 21 15:21:54 dbase kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 21 15:21:54 dbase kernel: ata8.00: status: { DRDY }
Jul 21 15:21:54 dbase kernel: ata8.00: failed command: WRITE FPDMA QUEUED
Jul 21 15:21:54 dbase kernel: ata8.00: cmd 61/08:30:1f:d9:52/00:00:0c:00:00/40 tag 6 ncq 4096 out
Jul 21 15:21:54 dbase kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 21 15:21:54 dbase kernel: ata8.00: status: { DRDY }
Jul 21 15:21:54 dbase kernel: ata8.00: failed command: WRITE FPDMA QUEUED
Jul 21 15:21:54 dbase kernel: ata8.00: cmd 61/10:38:bf:e4:52/00:00:0c:00:00/40 tag 7 ncq 8192 out
Jul 21 15:21:54 dbase kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 21 15:21:54 dbase kernel: ata8.00: status: { DRDY }
Jul 21 15:21:54 dbase kernel: ata8.00: failed command: WRITE FPDMA QUEUED
Jul 21 15:21:54 dbase kernel: ata8.00: cmd 61/08:40:4f:d7:52/00:00:0c:00:00/40 tag 8 ncq 4096 out
Jul 21 15:21:54 dbase kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 21 15:21:54 dbase kernel: ata8.00: status: { DRDY }
Jul 21 15:21:54 dbase kernel: ata8.00: failed command: WRITE FPDMA QUEUED
Jul 21 15:21:54 dbase kernel: ata8.00: cmd 61/08:48:67:da:52/00:00:0c:00:00/40 tag 9 ncq 4096 out
Jul 21 15:21:54 dbase kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
<snip>
Jul 21 15:22:41 dbase kernel: ata8.00: device reported invalid CHS sector 0
Jul 21 15:22:41 dbase kernel: ata8: hard resetting link
Jul 21 15:22:41 dbase kernel: ata8: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jul 21 15:22:41 dbase kernel: ata8: EH complete
Jul 21 15:22:41 dbase kernel: sd 7:0:0:0: [sdd] Unhandled error code
Jul 21 15:22:41 dbase kernel: sd 7:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 21 15:22:41 dbase kernel: sd 7:0:0:0: [sdd] CDB: Write(10): 2a 00 0c 52 fd 8f 00 00 08 00
Jul 21 15:22:41 dbase kernel: end_request: I/O error, dev sdd, sector 206765455
Jul 21 15:22:41 dbase kernel: Buffer I/O error on device sdd1, logical block 25845674
Jul 21 15:22:41 dbase kernel: lost page write due to I/O error on sdd1
Jul 21 15:22:41 dbase kernel: sd 7:0:0:0: [sdd] Unhandled error code
Jul 21 15:22:41 dbase kernel: sd 7:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 21 15:22:41 dbase kernel: sd 7:0:0:0: [sdd] CDB: Write(10): 2a 00 0c 52 db 6f 00 00 08 00
Jul 21 15:22:41 dbase kernel: end_request: I/O error, dev sdd, sector 206756719
Jul 21 15:22:41 dbase kernel: Buffer I/O error on device sdd1, logical block 25844582
Jul 21 15:22:41 dbase kernel: lost page write due to I/O error on sdd1
Jul 21 15:22:41 dbase kernel: sd 7:0:0:0: [sdd] Unhandled error code
Jul 21 15:22:41 dbase kernel: sd 7:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 21 15:22:41 dbase kernel: sd 7:0:0:0: [sdd] CDB: Write(10): 2a 00 0c 52 b2 bf 00 00 08 00
Jul 21 15:22:41 dbase kernel: end_request: I/O error, dev sdd, sector 206746303
Jul 21 15:22:41 dbase kernel: Buffer I/O error on device sdd1, logical block 25843280
Jul 21 15:22:41 dbase kernel: lost page write due to I/O error on sdd1
Jul 21 15:22:41 dbase kernel: sd 7:0:0:0: [sdd] Unhandled error code
Jul 21 15:22:41 dbase kernel: sd 7:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 21 15:22:41 dbase kernel: sd 7:0:0:0: [sdd] CDB: Write(10): 2a 00 0c 52 da 57 00 00 08 00
Jul 21 15:22:41 dbase kernel: end_request: I/O error, dev sdd, sector 206756439
Jul 21 15:22:41 dbase kernel: Buffer I/O error on device sdd1, logical block 25844547


Lots of that then similar for sdc1

I'm going to at least put in the other 2 drives I have spare right now.

STE\/E
07-21-14, 04:46 PM
Again it looks like, Bok's on it though ... :)

Bok
07-21-14, 04:55 PM
Running through some tests, drives appear totally fine, smartctl show no errors on any of the drives.

Got to be the board/controllers right ?

*EDIT* Could it possibly be a PSU issue?

Bryan
07-21-14, 05:08 PM
Running through some tests, drives appear totally fine, smartctl show no errors on any of the drives.

Got to be the board/controllers right ?

*EDIT* Could it possibly be a PSU issue?

Aren't PSUs always guilty until proven innocent? :D

Mumps
07-21-14, 05:17 PM
Personally I'd be most suspicious of the controller. Possibly drivers for same. Do you keep the Linux patches current? May be a problem that's been seen by others or fixed. (Or a regression if you recently updated.)

I did have one BOINC/Crunching only system that kept doing something similar. It would declare the drive Read-Only after a flurry of I/O errors. I replaced the hard drive multiple times with ones I had on hand, yet the problem kept recurring. Only when I finally went from HD to SSD did the problem go away. (Knock on wood.) Didn't even help to try moving to a different SATA port. But still using the same SATA controller for the SSD without issue now for many months.

Bok
07-21-14, 05:30 PM
I only really update if there is anything that need serious fixing. These have been running just fine since the last round of failures almost a year ago though. I just yanked the machine out and it certainly needs a good clean, so I will do that at the same time I add two spare drives in tonight.

myshortpencil
07-21-14, 07:45 PM
Don't know if it will help, but when Free-DC starts going down, the stats that stop working first for me are the country stats from Portugal. The table won't load, yet individual and team stats do.

Bok
07-21-14, 08:02 PM
Don't know if it will help, but when Free-DC starts going down, the stats that stop working first for me are the country stats from Portugal. The table won't load, yet individual and team stats do.

The pages are cached on the web server for a short time, so that's likely a manifestation of that only.

Bryan
07-21-14, 08:26 PM
There is one minor problem in the reporting. If you look at Combined Team stats it is missing position/team #10. I noticed that a week ago and forgot about it, but it is still a problem :D I think that position is held by Taiwan.

Bok
07-21-14, 08:55 PM
There is one minor problem in the reporting. If you look at Combined Team stats it is missing position/team #10. I noticed that a week ago and forgot about it, but it is still a problem :D I think that position is held by Taiwan.

Ah, good spot. This is a bug, it's a secondary table which I use for the #projects a team has done, simple with teamname and count, and it's recreated often from the main data.

BUT, and it's something I don't really like about mysql, the default is for fields to be non case sensitive and this is one of them. There is a BOINC@Taiwan team who are in 10th place and there is also a BOINC@TAIWAN, this one was taking precedence and being created in this table, the other was not as it appeared to be the same. The sql for listing the top teams joins to this table and didn't find an entry for BOINC@Taiwan so did not display it.

Fixed, so it should show up fairly soon (as long as the server holds up!)

Duke of Buckingham
07-22-14, 02:46 AM
Team forum and FreeDC take an eternity to load at least for me.

Fire$torm
07-22-14, 03:26 AM
Is the controller on-board or expansion bus type? If it's an on-board unit, via the Northbridge, it could be a heat issue, especially if it's X58 chipset or earlier. If the controller is bus type, then it could be component failure. I'd also look over PSU connectors that feed the MB. A visual inspection might reveal oxidized connectors.

DrPop
07-22-14, 11:17 AM
BOK, did you change anything yesterday evening or last night? For some reason the pages are loading way faster for me this morning. :-bd

E-30
07-22-14, 11:18 AM
Yea seams better mow

Bok
07-22-14, 11:21 AM
I'm monitoring, only thing I did was to move one of the databases across to a different drive. No errors so far, but then I got none from thursday evening through yesterday afternoon either, so we shall see...

Duke of Buckingham
07-23-14, 07:08 AM
Is the controller on-board or expansion bus type? If it's an on-board unit, via the Northbridge, it could be a heat issue, especially if it's X58 chipset or earlier. If the controller is bus type, then it could be component failure. I'd also look over PSU connectors that feed the MB. A visual inspection might reveal oxidized connectors.

I made a cleaning on my mint and lost some of the controllers I am trying to fix it all, things are getting better now, sometimes the computer seem lost on some Internet pages, the MB and PSU have about 6 months, so I dont think the problem is in there but I looked anyway, some heat problems may have some influence once I was using hypertrading and OC the computer.

AHA the CPU is an I7 with GPU on board I need to look if there is any problem with the drivers.

Thanks F$.