I have a Linux system (kernel 2.6.32.16) with two 500 GB disks
configured as mirrored disks (raid1) using the md driver.
That has previously worked flawlessly.
However, if I now copy a large amount of data (e.g., 200 files of
about 5 MB each) to that md disk, the copying stops several times for
a long while (in the order of a minute), and then suddenly proceeds.
In syslog I found several occurrences of the following:
>Sep 29 19:27:02 nuser kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
>Sep 29 19:27:02 nuser kernel: ata2.00: port_status 0x20200000
>Sep 29 19:27:02 nuser kernel: ata2.00: failed command: FLUSH CACHE EXT
>Sep 29 19:27:02 nuser kernel: ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>Sep 29 19:27:02 nuser kernel: ata2.00: status: { DRDY ERR }
>Sep 29 19:27:02 nuser kernel: ata2.00: error: { ABRT }
But the md driver still has both drives active in the raid array,
e2fsck finds no problems, a raid consistency check finds no
differences between the two disks, and the 200 files compare equal to
the ones they were copied from. So I suspect that the driver has
eventually succeeded in writing the data correctly to both disks.
When there is no intensive disk writing, there seem to be no problems
at all.
But why does this happen? Is it a problem with disk 2 or with the
SATA controller (a Promise PCI card)? Or with something else, such as
the power supply?
The strange thing is that this system has been running without
problems for a long time; I replaced the power supply (due to a
failing fan) in February, but since then there have been no hardware
changes or kernel changes or problems until this started a couple of
weeks ago.
Any help would be greatly appreciated. At the moment I cannot figure
out whether I need to replace disk2 or the SATA controller or the power
supply or something else.
--
Jesper Dybdal, Denmark.
http://www.dybdal.dk (in Danish).
|
|
0
|
|
|
|
Reply
|
jdunetnospam (3)
|
10/5/2011 8:39:14 PM |
|
On 2011-10-05, Jesper Dybdal <jdunetnospam@u10.dybdal.dk> wrote:
> I have a Linux system (kernel 2.6.32.16) with two 500 GB disks
> configured as mirrored disks (raid1) using the md driver.
>
> That has previously worked flawlessly.
>
> However, if I now copy a large amount of data (e.g., 200 files of
> about 5 MB each) to that md disk, the copying stops several times for
> a long while (in the order of a minute), and then suddenly proceeds.
>
> In syslog I found several occurrences of the following:
>
>>Sep 29 19:27:02 nuser kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
>>Sep 29 19:27:02 nuser kernel: ata2.00: port_status 0x20200000
>>Sep 29 19:27:02 nuser kernel: ata2.00: failed command: FLUSH CACHE EXT
>>Sep 29 19:27:02 nuser kernel: ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>>Sep 29 19:27:02 nuser kernel: ata2.00: status: { DRDY ERR }
>>Sep 29 19:27:02 nuser kernel: ata2.00: error: { ABRT }
>
> But the md driver still has both drives active in the raid array,
> e2fsck finds no problems, a raid consistency check finds no
> differences between the two disks, and the 200 files compare equal to
> the ones they were copied from. So I suspect that the driver has
> eventually succeeded in writing the data correctly to both disks.
>
> When there is no intensive disk writing, there seem to be no problems
> at all.
>
> But why does this happen? Is it a problem with disk 2 or with the
> SATA controller (a Promise PCI card)? Or with something else, such as
> the power supply?
It looks suspiciously like a drive issue. I notice that you don't state
whether you used the drive vendor's testing utilities to test the drive;
that'd be the next logical step to take. Short of doing that, you might
see what smartmontools says about the drives. If you do suspect the
controller, you'd want to test the other components, especially the
drives, hanging off of a different controller or in a different machine
altogether.
--keith
--
kkeller-usenet@wombat.san-francisco.ca.us
(try just my userid to email me)
AOLSFAQ=http://www.therockgarden.ca/aolsfaq.txt
see X- headers for PGP signature information
|
|
0
|
|
|
|
Reply
|
kkeller-usenet (1289)
|
10/5/2011 9:58:59 PM
|
|
Keith Keller <kkeller-usenet@wombat.san-francisco.ca.us> wrote:
>It looks suspiciously like a drive issue. I notice that you don't state
>whether you used the drive vendor's testing utilities to test the drive;
>that'd be the next logical step to take.
Thanks for your response.
You're right - I will do that.
>Short of doing that, you might
>see what smartmontools says about the drives.
I've now upgraded my (obviously too) old smartctl and now it does say
something. It has logged 16 errors on that drive. They all look like
this:
>Error 16 occurred at disk power-on lifetime: 29142 hours (1214 days + 6 hours)
> When the command that caused the error occurred, the device was active or idle.
>
> After command completion occurred, registers were:
> ER ST SC SN CL CH DH
> -- -- -- -- -- -- --
> 04 51 00 38 df f7 a7
>
> Commands leading to the command that caused the error were:
> CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
> -- -- -- -- -- -- -- -- ---------------- --------------------
> ea 00 00 00 00 00 26 08 20:49:17.284 FLUSH CACHE EXT
> 35 00 48 df 83 02 00 08 20:49:17.281 WRITE DMA EXT
> ca 00 08 97 01 e4 09 08 20:49:17.263 WRITE DMA
> ca 00 08 27 01 e4 09 08 20:49:17.262 WRITE DMA
> ca 00 08 ff 40 ca 09 08 20:49:17.262 WRITE DMA
One possibly interesting point is that all the 5 errors for which SMART
has kept the details, have the same values of CL, CH, DH (location on
the disk). On the other hand, if it were just a bad sector, I would
have expected a new sector to have been assigned to replace it.
In case anybody happens to know about problems with that specific disk
type, here are the details:
>Model Family: Western Digital RE2 Serial ATA
>Device Model: WDC WD5001ABYS-01YNA0
>Serial Number: WD-WCAS85509795
>LU WWN Device Id: 5 0014ee 256763869
>Firmware Version: 59.01D01
--
Jesper Dybdal, Denmark.
http://www.dybdal.dk (in Danish).
|
|
0
|
|
|
|
Reply
|
jdunetnospam (3)
|
10/6/2011 5:51:00 PM
|
|
I wrote:
>>Error 16 occurred at disk power-on lifetime: 29142 hours (1214 days + 6 hours)
>> When the command that caused the error occurred, the device was active or idle.
>>
>> After command completion occurred, registers were:
>> ER ST SC SN CL CH DH
>> -- -- -- -- -- -- --
>> 04 51 00 38 df f7 a7
>>
>> Commands leading to the command that caused the error were:
>> CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
>> -- -- -- -- -- -- -- -- ---------------- --------------------
>> ea 00 00 00 00 00 26 08 20:49:17.284 FLUSH CACHE EXT
>> 35 00 48 df 83 02 00 08 20:49:17.281 WRITE DMA EXT
>> ca 00 08 97 01 e4 09 08 20:49:17.263 WRITE DMA
>> ca 00 08 27 01 e4 09 08 20:49:17.262 WRITE DMA
>> ca 00 08 ff 40 ca 09 08 20:49:17.262 WRITE DMA
I have now replaced that disk, and the problem seems to have
disappeared. So it was a problem with the disk.
--
Jesper Dybdal, Denmark.
http://www.dybdal.dk (in Danish).
|
|
0
|
|
|
|
Reply
|
jdunetnospam (3)
|
10/16/2011 9:40:56 PM
|
|
|
3 Replies
28 Views
(page loaded in 0.085 seconds)
|