thr3ads.net - Lustre discuss - [Lustre-discuss] Getting weird disk errors, no apparent impact [Aug 2010]

If this information is useful, please help other people find it:
Share via:

David Noriega

2010-Aug-12 16:58 UTC

[Lustre-discuss] Getting weird disk errors, no apparent impact

We just setup a lustre system, and all looks good, but there is this
nagging error thats floating about. When I reboot any of the nodes, be
it a OSS or MDS, I will get this:

[root at meta1 ~]# dmesg | grep sdc
sdc : very big device. try to use READ CAPACITY(16).
SCSI device sdc: 4878622720 512-byte hdwr sectors (2497855 MB)
sdc: Write Protect is off
sdc: Mode Sense: 77 00 10 08
SCSI device sdc: drive cache: write back w/ FUA
sdc : very big device. try to use READ CAPACITY(16).
SCSI device sdc: 4878622720 512-byte hdwr sectors (2497855 MB)
sdc: Write Protect is off
sdc: Mode Sense: 77 00 10 08
SCSI device sdc: drive cache: write back w/ FUA
 sdc:end_request: I/O error, dev sdc, sector 0
Buffer I/O error on device sdc, logical block 0
end_request: I/O error, dev sdc, sector 0

This doesn''t seem to affect anything. fdisk -l doesn''t even
report the
device. The same(thought of course different block device sdd, sde, on
the OSSs), happens on all the nodes.

If I run pvdisplay or lvdisplay, I''ll get this:
/dev/sdc: read failed after 0 of 4096 at 0: Input/output error

Any ideas?
David
-- 
Personally, I liked the university. They gave us money and facilities,
we didn''t have to produce anything! You''ve never been out of
college!
You don''t know what it''s like out there! I''ve worked
in the private
sector. They expect results. -Ray Ghostbusters

Wojciech Turek

2010-Aug-13 09:31 UTC

head link

[Lustre-discuss] Getting weird disk errors, no apparent impact

Hi David,

I have seen simmilar errors given out by some storage arrays. There were
caused by arrays exporting volumes via more then a single path without multi
path driver installed or configured properly. Some times the array
controllers requires a special driver to be installed on Linux host (for
example RDAC mpp driver) to properly present and handle configured volumes
in the OS. What sort of disk raid array are you using?

Best gerads,

Wojciech

On 12 August 2010 17:58, David Noriega <tsk133 at my.utsa.edu> wrote:
> We just setup a lustre system, and all looks good, but there is this
> nagging error thats floating about. When I reboot any of the nodes, be
> it a OSS or MDS, I will get this:
>
> [root at meta1 ~]# dmesg | grep sdc
> sdc : very big device. try to use READ CAPACITY(16).
> SCSI device sdc: 4878622720 512-byte hdwr sectors (2497855 MB)
> sdc: Write Protect is off
> sdc: Mode Sense: 77 00 10 08
> SCSI device sdc: drive cache: write back w/ FUA
> sdc : very big device. try to use READ CAPACITY(16).
> SCSI device sdc: 4878622720 512-byte hdwr sectors (2497855 MB)
> sdc: Write Protect is off
> sdc: Mode Sense: 77 00 10 08
> SCSI device sdc: drive cache: write back w/ FUA
>  sdc:end_request: I/O error, dev sdc, sector 0
> Buffer I/O error on device sdc, logical block 0
> end_request: I/O error, dev sdc, sector 0
>
> This doesn''t seem to affect anything. fdisk -l doesn''t
even report the
> device. The same(thought of course different block device sdd, sde, on
> the OSSs), happens on all the nodes.
>
> If I run pvdisplay or lvdisplay, I''ll get this:
> /dev/sdc: read failed after 0 of 4096 at 0: Input/output error
>
> Any ideas?
> David
> --
> Personally, I liked the university. They gave us money and facilities,
> we didn''t have to produce anything! You''ve never been out
of college!
> You don''t know what it''s like out there! I''ve
worked in the private
> sector. They expect results. -Ray Ghostbusters
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>


-- 
Wojciech Turek

Senior System Architect

High Performance Computing Service
University of Cambridge
Email: wjt27 at cam.ac.uk
Tel: (+)44 1223 763517
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100813/451dfe53/attachment.html

David Noriega

2010-Aug-13 15:51 UTC

head link

[Lustre-discuss] Getting weird disk errors, no apparent impact

We have three Sun StorageTek 2150, one connected to the metadata
server and two crossconnected to the two data storage nodes. They are
connected via fiber using the qla2xxx driver that comes with CentOS
5.5.  The multipath daemon has the following config:

defaults {
        udev_dir                /dev
        polling_interval        10
        selector                "round-robin 0"
        path_grouping_policy    multibus
        getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
        prio_callout "/sbin/mpath_prio_rdac /dev/%n"
        path_checker            rdac
        rr_min_io               100
        max_fds                 8192
        rr_weight               priorities
        failback                immediate
        no_path_retry           fail
        user_friendly_names     yes
}

Comment out from multipath.conf file:

blacklist {
        devnode "*"
}


On Fri, Aug 13, 2010 at 4:31 AM, Wojciech Turek <wjt27 at cam.ac.uk>
wrote:> Hi David,
>
> I have seen simmilar errors given out by some storage arrays. There were
> caused by arrays exporting volumes via more then a single path without
multi
> path driver installed or configured properly. Some times the array
> controllers requires a special driver to be installed on Linux host (for
> example RDAC mpp driver) to properly present and handle configured volumes
> in the OS. What sort of disk raid array are you using?
>
> Best gerads,
>
> Wojciech
>
> On 12 August 2010 17:58, David Noriega <tsk133 at my.utsa.edu> wrote:
>>
>> We just setup a lustre system, and all looks good, but there is this
>> nagging error thats floating about. When I reboot any of the nodes, be
>> it a OSS or MDS, I will get this:
>>
>> [root at meta1 ~]# dmesg | grep sdc
>> sdc : very big device. try to use READ CAPACITY(16).
>> SCSI device sdc: 4878622720 512-byte hdwr sectors (2497855 MB)
>> sdc: Write Protect is off
>> sdc: Mode Sense: 77 00 10 08
>> SCSI device sdc: drive cache: write back w/ FUA
>> sdc : very big device. try to use READ CAPACITY(16).
>> SCSI device sdc: 4878622720 512-byte hdwr sectors (2497855 MB)
>> sdc: Write Protect is off
>> sdc: Mode Sense: 77 00 10 08
>> SCSI device sdc: drive cache: write back w/ FUA
>> ?sdc:end_request: I/O error, dev sdc, sector 0
>> Buffer I/O error on device sdc, logical block 0
>> end_request: I/O error, dev sdc, sector 0
>>
>> This doesn''t seem to affect anything. fdisk -l
doesn''t even report the
>> device. The same(thought of course different block device sdd, sde, on
>> the OSSs), happens on all the nodes.
>>
>> If I run pvdisplay or lvdisplay, I''ll get this:
>> /dev/sdc: read failed after 0 of 4096 at 0: Input/output error
>>
>> Any ideas?
>> David
>> --
>> Personally, I liked the university. They gave us money and facilities,
>> we didn''t have to produce anything! You''ve never been
out of college!
>> You don''t know what it''s like out there!
I''ve worked in the private
>> sector. They expect results. -Ray Ghostbusters
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
>
> --
> Wojciech Turek
>
> Senior System Architect
>
> High Performance Computing Service
> University of Cambridge
> Email: wjt27 at cam.ac.uk
> Tel: (+)44 1223 763517
>


-- 
Personally, I liked the university. They gave us money and facilities,
we didn''t have to produce anything! You''ve never been out of
college!
You don''t know what it''s like out there! I''ve worked
in the private
sector. They expect results. -Ray Ghostbusters

LaoTsao

2010-Aug-13 16:05 UTC

head link

[Lustre-discuss] Getting weird disk errors, no apparent impact

U mean stk 2540?
Iirc one can download drivers from oracle sun site

------- Original message -------> From: David Noriega <tsk133 at my.utsa.edu>
> To: lustre-discuss at lists.lustre.org
> Sent: 13.8.''10,  11:51
> 
> We have three Sun StorageTek 2150, one connected to the metadata
> server and two crossconnected to the two data storage nodes. They are
> connected via fiber using the qla2xxx driver that comes with CentOS
> 5.5.  The multipath daemon has the following config:
> 
> defaults {
>         udev_dir                /dev
>         polling_interval        10
>         selector                "round-robin 0"
>         path_grouping_policy    multibus
>         getuid_callout          "/sbin/scsi_id -g -u -s
/block/%n"
>         prio_callout "/sbin/mpath_prio_rdac /dev/%n"
>         path_checker            rdac
>         rr_min_io               100
>         max_fds                 8192
>         rr_weight               priorities
>         failback                immediate
>         no_path_retry           fail
>         user_friendly_names     yes
> }
> 
> Comment out from multipath.conf file:
> 
> blacklist {
>         devnode "*"
> }
> 
> 
> On Fri, Aug 13, 2010 at 4:31 AM, Wojciech Turek <wjt27 at cam.ac.uk>
wrote:
>> Hi David,
>>
>> I have seen simmilar errors given out by some storage arrays. There
were
>> caused by arrays exporting volumes via more then a single path without
multi
>> path driver installed or configured properly. Some times the array
>> controllers requires a special driver to be installed on Linux host
(for
>> example RDAC mpp driver) to properly present and handle configured
volumes
>> in the OS. What sort of disk raid array are you using?
>>
>> Best gerads,
>>
>> Wojciech
>>
>> On 12 August 2010 17:58, David Noriega <tsk133 at my.utsa.edu>
wrote:
>>>
>>> We just setup a lustre system, and all looks good, but there is
this
>>> nagging error thats floating about. When I reboot any of the nodes,
be
>>> it a OSS or MDS, I will get this:
>>>
>>> [root at meta1 ~]# dmesg | grep sdc
>>> sdc : very big device. try to use READ CAPACITY(16).
>>> SCSI device sdc: 4878622720 512-byte hdwr sectors (2497855 MB)
>>> sdc: Write Protect is off
>>> sdc: Mode Sense: 77 00 10 08
>>> SCSI device sdc: drive cache: write back w/ FUA
>>> sdc : very big device. try to use READ CAPACITY(16).
>>> SCSI device sdc: 4878622720 512-byte hdwr sectors (2497855 MB)
>>> sdc: Write Protect is off
>>> sdc: Mode Sense: 77 00 10 08
>>> SCSI device sdc: drive cache: write back w/ FUA
>>> ?sdc:end_request: I/O error, dev sdc, sector 0
>>> Buffer I/O error on device sdc, logical block 0
>>> end_request: I/O error, dev sdc, sector 0
>>>
>>> This doesn''t seem to affect anything. fdisk -l
doesn''t even report the
>>> device. The same(thought of course different block device sdd, sde,
on
>>> the OSSs), happens on all the nodes.
>>>
>>> If I run pvdisplay or lvdisplay, I''ll get this:
>>> /dev/sdc: read failed after 0 of 4096 at 0: Input/output error
>>>
>>> Any ideas?
>>> David
>>> --
>>> Personally, I liked the university. They gave us money and
facilities,
>>> we didn''t have to produce anything! You''ve never
been out of college!
>>> You don''t know what it''s like out there!
I''ve worked in the private
>>> sector. They expect results. -Ray Ghostbusters
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>>
>> --
>> Wojciech Turek
>>
>> Senior System Architect
>>
>> High Performance Computing Service
>> University of Cambridge
>> Email: wjt27 at cam.ac.uk
>> Tel: (+)44 1223 763517
>>
> 
> 
> 
> -- 
> Personally, I liked the university. They gave us money and facilities,
> we didn''t have to produce anything! You''ve never been out
of college!
> You don''t know what it''s like out there! I''ve
worked in the private
> sector. They expect results. -Ray Ghostbusters
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Wojciech Turek

2010-Aug-13 22:33 UTC

head link

[Lustre-discuss] Getting weird disk errors, no apparent impact

Hi,

I don''t think you should use rdac path checker in your multipath.conf.
I
would suggest to use tur pathchecker

path_checker            tur

Bes gerads,

Wojciech

On 13 August 2010 16:51, David Noriega <tsk133 at my.utsa.edu> wrote:
> We have three Sun StorageTek 2150, one connected to the metadata
> server and two crossconnected to the two data storage nodes. They are
> connected via fiber using the qla2xxx driver that comes with CentOS
> 5.5.  The multipath daemon has the following config:
>
> defaults {
>        udev_dir                /dev
>        polling_interval        10
>        selector                "round-robin 0"
>        path_grouping_policy    multibus
>        getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
>        prio_callout "/sbin/mpath_prio_rdac /dev/%n"
>        path_checker            rdac
>        rr_min_io               100
>        max_fds                 8192
>        rr_weight               priorities
>        failback                immediate
>        no_path_retry           fail
>        user_friendly_names     yes
> }
>
> Comment out from multipath.conf file:
>
> blacklist {
>        devnode "*"
> }
>
>
> On Fri, Aug 13, 2010 at 4:31 AM, Wojciech Turek <wjt27 at cam.ac.uk>
wrote:
> > Hi David,
> >
> > I have seen simmilar errors given out by some storage arrays. There
were
> > caused by arrays exporting volumes via more then a single path without
> multi
> > path driver installed or configured properly. Some times the array
> > controllers requires a special driver to be installed on Linux host
(for
> > example RDAC mpp driver) to properly present and handle configured
> volumes
> > in the OS. What sort of disk raid array are you using?
> >
> > Best gerads,
> >
> > Wojciech
> >
> > On 12 August 2010 17:58, David Noriega <tsk133 at my.utsa.edu>
wrote:
> >>
> >> We just setup a lustre system, and all looks good, but there is
this
> >> nagging error thats floating about. When I reboot any of the
nodes, be
> >> it a OSS or MDS, I will get this:
> >>
> >> [root at meta1 ~]# dmesg | grep sdc
> >> sdc : very big device. try to use READ CAPACITY(16).
> >> SCSI device sdc: 4878622720 512-byte hdwr sectors (2497855 MB)
> >> sdc: Write Protect is off
> >> sdc: Mode Sense: 77 00 10 08
> >> SCSI device sdc: drive cache: write back w/ FUA
> >> sdc : very big device. try to use READ CAPACITY(16).
> >> SCSI device sdc: 4878622720 512-byte hdwr sectors (2497855 MB)
> >> sdc: Write Protect is off
> >> sdc: Mode Sense: 77 00 10 08
> >> SCSI device sdc: drive cache: write back w/ FUA
> >>  sdc:end_request: I/O error, dev sdc, sector 0
> >> Buffer I/O error on device sdc, logical block 0
> >> end_request: I/O error, dev sdc, sector 0
> >>
> >> This doesn''t seem to affect anything. fdisk -l
doesn''t even report the
> >> device. The same(thought of course different block device sdd,
sde, on
> >> the OSSs), happens on all the nodes.
> >>
> >> If I run pvdisplay or lvdisplay, I''ll get this:
> >> /dev/sdc: read failed after 0 of 4096 at 0: Input/output error
> >>
> >> Any ideas?
> >> David
> >> --
> >> Personally, I liked the university. They gave us money and
facilities,
> >> we didn''t have to produce anything! You''ve never
been out of college!
> >> You don''t know what it''s like out there!
I''ve worked in the private
> >> sector. They expect results. -Ray Ghostbusters
> >> _______________________________________________
> >> Lustre-discuss mailing list
> >> Lustre-discuss at lists.lustre.org
> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> >
> >
> > --
> > Wojciech Turek
> >
> > Senior System Architect
> >
> > High Performance Computing Service
> > University of Cambridge
> > Email: wjt27 at cam.ac.uk
> > Tel: (+)44 1223 763517
> >
>
>
>
> --
> Personally, I liked the university. They gave us money and facilities,
> we didn''t have to produce anything! You''ve never been out
of college!
> You don''t know what it''s like out there! I''ve
worked in the private
> sector. They expect results. -Ray Ghostbusters
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>


-- 
Wojciech Turek

Senior System Architect

High Performance Computing Service
University of Cambridge
Email: wjt27 at cam.ac.uk
Tel: (+)44 1223 763517
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100813/b6df002a/attachment.html

LaoTsao 老曹

2010-Aug-13 22:37 UTC

head link

[Lustre-discuss] Getting weird disk errors, no apparent impact

https://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDS-CDS_SMI-Site/en_US/-/USD/ViewProductDetail-Start?ProductRef=RDACDVR60002500-09.03.0BC02.0013-LX-G-F
at CDS-CDS_SMI
rdac


On 8/13/2010 12:05 PM, LaoTsao wrote:> U mean stk 2540?
> Iirc one can download drivers from oracle sun site
>
> ------- Original message -------
>> From: David Noriega <tsk133 at my.utsa.edu>
>> To: lustre-discuss at lists.lustre.org
>> Sent: 13.8.''10,  11:51
>>
>> We have three Sun StorageTek 2150, one connected to the metadata
>> server and two crossconnected to the two data storage nodes. They are
>> connected via fiber using the qla2xxx driver that comes with CentOS
>> 5.5.  The multipath daemon has the following config:
>>
>> defaults {
>>         udev_dir                /dev
>>         polling_interval        10
>>         selector                "round-robin 0"
>>         path_grouping_policy    multibus
>>         getuid_callout          "/sbin/scsi_id -g -u -s
/block/%n"
>>         prio_callout "/sbin/mpath_prio_rdac /dev/%n"
>>         path_checker            rdac
>>         rr_min_io               100
>>         max_fds                 8192
>>         rr_weight               priorities
>>         failback                immediate
>>         no_path_retry           fail
>>         user_friendly_names     yes
>> }
>>
>> Comment out from multipath.conf file:
>>
>> blacklist {
>>         devnode "*"
>> }
>>
>>
>> On Fri, Aug 13, 2010 at 4:31 AM, Wojciech Turek <wjt27 at
cam.ac.uk> wrote:
>>> Hi David,
>>>
>>> I have seen simmilar errors given out by some storage arrays. There
>>> were
>>> caused by arrays exporting volumes via more then a single path 
>>> without multi
>>> path driver installed or configured properly. Some times the array
>>> controllers requires a special driver to be installed on Linux host
>>> (for
>>> example RDAC mpp driver) to properly present and handle configured 
>>> volumes
>>> in the OS. What sort of disk raid array are you using?
>>>
>>> Best gerads,
>>>
>>> Wojciech
>>>
>>> On 12 August 2010 17:58, David Noriega <tsk133 at
my.utsa.edu> wrote:
>>>>
>>>> We just setup a lustre system, and all looks good, but there is
this
>>>> nagging error thats floating about. When I reboot any of the
nodes, be
>>>> it a OSS or MDS, I will get this:
>>>>
>>>> [root at meta1 ~]# dmesg | grep sdc
>>>> sdc : very big device. try to use READ CAPACITY(16).
>>>> SCSI device sdc: 4878622720 512-byte hdwr sectors (2497855 MB)
>>>> sdc: Write Protect is off
>>>> sdc: Mode Sense: 77 00 10 08
>>>> SCSI device sdc: drive cache: write back w/ FUA
>>>> sdc : very big device. try to use READ CAPACITY(16).
>>>> SCSI device sdc: 4878622720 512-byte hdwr sectors (2497855 MB)
>>>> sdc: Write Protect is off
>>>> sdc: Mode Sense: 77 00 10 08
>>>> SCSI device sdc: drive cache: write back w/ FUA
>>>>  sdc:end_request: I/O error, dev sdc, sector 0
>>>> Buffer I/O error on device sdc, logical block 0
>>>> end_request: I/O error, dev sdc, sector 0
>>>>
>>>> This doesn''t seem to affect anything. fdisk -l
doesn''t even report the
>>>> device. The same(thought of course different block device sdd,
sde, on
>>>> the OSSs), happens on all the nodes.
>>>>
>>>> If I run pvdisplay or lvdisplay, I''ll get this:
>>>> /dev/sdc: read failed after 0 of 4096 at 0: Input/output error
>>>>
>>>> Any ideas?
>>>> David
>>>> -- 
>>>> Personally, I liked the university. They gave us money and
facilities,
>>>> we didn''t have to produce anything! You''ve
never been out of college!
>>>> You don''t know what it''s like out there!
I''ve worked in the private
>>>> sector. They expect results. -Ray Ghostbusters
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>
>>>
>>> -- 
>>> Wojciech Turek
>>>
>>> Senior System Architect
>>>
>>> High Performance Computing Service
>>> University of Cambridge
>>> Email: wjt27 at cam.ac.uk
>>> Tel: (+)44 1223 763517
>>>
>>
>>
>>
>> -- 
>> Personally, I liked the university. They gave us money and facilities,
>> we didn''t have to produce anything! You''ve never been
out of college!
>> You don''t know what it''s like out there!
I''ve worked in the private
>> sector. They expect results. -Ray Ghostbusters
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>-------------- next part --------------
A non-text attachment was scrubbed...
Name: laotsao.vcf
Type: text/x-vcard
Size: 221 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100813/42cd262b/attachment.vcf

Andrea Pieretti

2010-Aug-17 11:07 UTC

head link

[Lustre-discuss] Same Directory but different files on different clients

Hi,
we have the following problem.
In a cluster of 300 nodes ca. having a shared space via lustre 1.8.1.1,
we noticed  the following weird fact!

On one client (let''s call it Fred)  a directory showed several files
(we
issued an lfs getstripe on them, an md5sum a du -sh etc. etc.)
On  several other clients (let''s call them Barneys) the directory was
empty.
We created a new file on this directory on Fred but none of the Barneys 
showed it.
We then  created a new file in this directory from one of the Barneys 
and on Fred everything disappeared.
It''s seems a synchronism problem but the REAL problem  is: Why instead
showing the files of Fred on all the nodes, the Barneys won the match 
deleting the data?
Nothing has been reported on the log files of the MDS and of the OSTs


Best regards

Andrea

-- 

+-------------------------------------------------------
+  Andrea Pieretti
+  C.A.S.P.U.R.,
+  Via dei Tizii, 6b
+  00185 Roma ITALY
+  tel: +39-06-44486712
+  mob: +39-328-4280841
+  fax: +39-06-4957083
+  e-mail: a.pieretti at caspur.it
+  http://www.caspur.it
+-------------------------------------------------------
+
+          "Nitwit! Blubber! Oddment! Tweak!"
+
+             Albus Percival Wulfric Brian Dumbledore
+-------------------------------------------------------

Lustre discuss - Aug 2010 - Getting weird disk errors, no apparent impact

[Lustre-discuss] Getting weird disk errors, no apparent impact

[Lustre-discuss] Getting weird disk errors, no apparent impact

[Lustre-discuss] Getting weird disk errors, no apparent impact

[Lustre-discuss] Getting weird disk errors, no apparent impact

[Lustre-discuss] Getting weird disk errors, no apparent impact

[Lustre-discuss] Getting weird disk errors, no apparent impact

[Lustre-discuss] Same Directory but different files on different clients