David Noriega
2010-Aug-12 16:58 UTC
[Lustre-discuss] Getting weird disk errors, no apparent impact
We just setup a lustre system, and all looks good, but there is this nagging error thats floating about. When I reboot any of the nodes, be it a OSS or MDS, I will get this: [root at meta1 ~]# dmesg | grep sdc sdc : very big device. try to use READ CAPACITY(16). SCSI device sdc: 4878622720 512-byte hdwr sectors (2497855 MB) sdc: Write Protect is off sdc: Mode Sense: 77 00 10 08 SCSI device sdc: drive cache: write back w/ FUA sdc : very big device. try to use READ CAPACITY(16). SCSI device sdc: 4878622720 512-byte hdwr sectors (2497855 MB) sdc: Write Protect is off sdc: Mode Sense: 77 00 10 08 SCSI device sdc: drive cache: write back w/ FUA sdc:end_request: I/O error, dev sdc, sector 0 Buffer I/O error on device sdc, logical block 0 end_request: I/O error, dev sdc, sector 0 This doesn''t seem to affect anything. fdisk -l doesn''t even report the device. The same(thought of course different block device sdd, sde, on the OSSs), happens on all the nodes. If I run pvdisplay or lvdisplay, I''ll get this: /dev/sdc: read failed after 0 of 4096 at 0: Input/output error Any ideas? David -- Personally, I liked the university. They gave us money and facilities, we didn''t have to produce anything! You''ve never been out of college! You don''t know what it''s like out there! I''ve worked in the private sector. They expect results. -Ray Ghostbusters
Wojciech Turek
2010-Aug-13 09:31 UTC
[Lustre-discuss] Getting weird disk errors, no apparent impact
Hi David, I have seen simmilar errors given out by some storage arrays. There were caused by arrays exporting volumes via more then a single path without multi path driver installed or configured properly. Some times the array controllers requires a special driver to be installed on Linux host (for example RDAC mpp driver) to properly present and handle configured volumes in the OS. What sort of disk raid array are you using? Best gerads, Wojciech On 12 August 2010 17:58, David Noriega <tsk133 at my.utsa.edu> wrote:> We just setup a lustre system, and all looks good, but there is this > nagging error thats floating about. When I reboot any of the nodes, be > it a OSS or MDS, I will get this: > > [root at meta1 ~]# dmesg | grep sdc > sdc : very big device. try to use READ CAPACITY(16). > SCSI device sdc: 4878622720 512-byte hdwr sectors (2497855 MB) > sdc: Write Protect is off > sdc: Mode Sense: 77 00 10 08 > SCSI device sdc: drive cache: write back w/ FUA > sdc : very big device. try to use READ CAPACITY(16). > SCSI device sdc: 4878622720 512-byte hdwr sectors (2497855 MB) > sdc: Write Protect is off > sdc: Mode Sense: 77 00 10 08 > SCSI device sdc: drive cache: write back w/ FUA > sdc:end_request: I/O error, dev sdc, sector 0 > Buffer I/O error on device sdc, logical block 0 > end_request: I/O error, dev sdc, sector 0 > > This doesn''t seem to affect anything. fdisk -l doesn''t even report the > device. The same(thought of course different block device sdd, sde, on > the OSSs), happens on all the nodes. > > If I run pvdisplay or lvdisplay, I''ll get this: > /dev/sdc: read failed after 0 of 4096 at 0: Input/output error > > Any ideas? > David > -- > Personally, I liked the university. They gave us money and facilities, > we didn''t have to produce anything! You''ve never been out of college! > You don''t know what it''s like out there! I''ve worked in the private > sector. They expect results. -Ray Ghostbusters > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Wojciech Turek Senior System Architect High Performance Computing Service University of Cambridge Email: wjt27 at cam.ac.uk Tel: (+)44 1223 763517 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100813/451dfe53/attachment.html
David Noriega
2010-Aug-13 15:51 UTC
[Lustre-discuss] Getting weird disk errors, no apparent impact
We have three Sun StorageTek 2150, one connected to the metadata server and two crossconnected to the two data storage nodes. They are connected via fiber using the qla2xxx driver that comes with CentOS 5.5. The multipath daemon has the following config: defaults { udev_dir /dev polling_interval 10 selector "round-robin 0" path_grouping_policy multibus getuid_callout "/sbin/scsi_id -g -u -s /block/%n" prio_callout "/sbin/mpath_prio_rdac /dev/%n" path_checker rdac rr_min_io 100 max_fds 8192 rr_weight priorities failback immediate no_path_retry fail user_friendly_names yes } Comment out from multipath.conf file: blacklist { devnode "*" } On Fri, Aug 13, 2010 at 4:31 AM, Wojciech Turek <wjt27 at cam.ac.uk> wrote:> Hi David, > > I have seen simmilar errors given out by some storage arrays. There were > caused by arrays exporting volumes via more then a single path without multi > path driver installed or configured properly. Some times the array > controllers requires a special driver to be installed on Linux host (for > example RDAC mpp driver) to properly present and handle configured volumes > in the OS. What sort of disk raid array are you using? > > Best gerads, > > Wojciech > > On 12 August 2010 17:58, David Noriega <tsk133 at my.utsa.edu> wrote: >> >> We just setup a lustre system, and all looks good, but there is this >> nagging error thats floating about. When I reboot any of the nodes, be >> it a OSS or MDS, I will get this: >> >> [root at meta1 ~]# dmesg | grep sdc >> sdc : very big device. try to use READ CAPACITY(16). >> SCSI device sdc: 4878622720 512-byte hdwr sectors (2497855 MB) >> sdc: Write Protect is off >> sdc: Mode Sense: 77 00 10 08 >> SCSI device sdc: drive cache: write back w/ FUA >> sdc : very big device. try to use READ CAPACITY(16). >> SCSI device sdc: 4878622720 512-byte hdwr sectors (2497855 MB) >> sdc: Write Protect is off >> sdc: Mode Sense: 77 00 10 08 >> SCSI device sdc: drive cache: write back w/ FUA >> ?sdc:end_request: I/O error, dev sdc, sector 0 >> Buffer I/O error on device sdc, logical block 0 >> end_request: I/O error, dev sdc, sector 0 >> >> This doesn''t seem to affect anything. fdisk -l doesn''t even report the >> device. The same(thought of course different block device sdd, sde, on >> the OSSs), happens on all the nodes. >> >> If I run pvdisplay or lvdisplay, I''ll get this: >> /dev/sdc: read failed after 0 of 4096 at 0: Input/output error >> >> Any ideas? >> David >> -- >> Personally, I liked the university. They gave us money and facilities, >> we didn''t have to produce anything! You''ve never been out of college! >> You don''t know what it''s like out there! I''ve worked in the private >> sector. They expect results. -Ray Ghostbusters >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > -- > Wojciech Turek > > Senior System Architect > > High Performance Computing Service > University of Cambridge > Email: wjt27 at cam.ac.uk > Tel: (+)44 1223 763517 >-- Personally, I liked the university. They gave us money and facilities, we didn''t have to produce anything! You''ve never been out of college! You don''t know what it''s like out there! I''ve worked in the private sector. They expect results. -Ray Ghostbusters
LaoTsao
2010-Aug-13 16:05 UTC
[Lustre-discuss] Getting weird disk errors, no apparent impact
U mean stk 2540? Iirc one can download drivers from oracle sun site ------- Original message -------> From: David Noriega <tsk133 at my.utsa.edu> > To: lustre-discuss at lists.lustre.org > Sent: 13.8.''10, 11:51 > > We have three Sun StorageTek 2150, one connected to the metadata > server and two crossconnected to the two data storage nodes. They are > connected via fiber using the qla2xxx driver that comes with CentOS > 5.5. The multipath daemon has the following config: > > defaults { > udev_dir /dev > polling_interval 10 > selector "round-robin 0" > path_grouping_policy multibus > getuid_callout "/sbin/scsi_id -g -u -s /block/%n" > prio_callout "/sbin/mpath_prio_rdac /dev/%n" > path_checker rdac > rr_min_io 100 > max_fds 8192 > rr_weight priorities > failback immediate > no_path_retry fail > user_friendly_names yes > } > > Comment out from multipath.conf file: > > blacklist { > devnode "*" > } > > > On Fri, Aug 13, 2010 at 4:31 AM, Wojciech Turek <wjt27 at cam.ac.uk> wrote: >> Hi David, >> >> I have seen simmilar errors given out by some storage arrays. There were >> caused by arrays exporting volumes via more then a single path without multi >> path driver installed or configured properly. Some times the array >> controllers requires a special driver to be installed on Linux host (for >> example RDAC mpp driver) to properly present and handle configured volumes >> in the OS. What sort of disk raid array are you using? >> >> Best gerads, >> >> Wojciech >> >> On 12 August 2010 17:58, David Noriega <tsk133 at my.utsa.edu> wrote: >>> >>> We just setup a lustre system, and all looks good, but there is this >>> nagging error thats floating about. When I reboot any of the nodes, be >>> it a OSS or MDS, I will get this: >>> >>> [root at meta1 ~]# dmesg | grep sdc >>> sdc : very big device. try to use READ CAPACITY(16). >>> SCSI device sdc: 4878622720 512-byte hdwr sectors (2497855 MB) >>> sdc: Write Protect is off >>> sdc: Mode Sense: 77 00 10 08 >>> SCSI device sdc: drive cache: write back w/ FUA >>> sdc : very big device. try to use READ CAPACITY(16). >>> SCSI device sdc: 4878622720 512-byte hdwr sectors (2497855 MB) >>> sdc: Write Protect is off >>> sdc: Mode Sense: 77 00 10 08 >>> SCSI device sdc: drive cache: write back w/ FUA >>> ?sdc:end_request: I/O error, dev sdc, sector 0 >>> Buffer I/O error on device sdc, logical block 0 >>> end_request: I/O error, dev sdc, sector 0 >>> >>> This doesn''t seem to affect anything. fdisk -l doesn''t even report the >>> device. The same(thought of course different block device sdd, sde, on >>> the OSSs), happens on all the nodes. >>> >>> If I run pvdisplay or lvdisplay, I''ll get this: >>> /dev/sdc: read failed after 0 of 4096 at 0: Input/output error >>> >>> Any ideas? >>> David >>> -- >>> Personally, I liked the university. They gave us money and facilities, >>> we didn''t have to produce anything! You''ve never been out of college! >>> You don''t know what it''s like out there! I''ve worked in the private >>> sector. They expect results. -Ray Ghostbusters >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> >> -- >> Wojciech Turek >> >> Senior System Architect >> >> High Performance Computing Service >> University of Cambridge >> Email: wjt27 at cam.ac.uk >> Tel: (+)44 1223 763517 >> > > > > -- > Personally, I liked the university. They gave us money and facilities, > we didn''t have to produce anything! You''ve never been out of college! > You don''t know what it''s like out there! I''ve worked in the private > sector. They expect results. -Ray Ghostbusters > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Wojciech Turek
2010-Aug-13 22:33 UTC
[Lustre-discuss] Getting weird disk errors, no apparent impact
Hi, I don''t think you should use rdac path checker in your multipath.conf. I would suggest to use tur pathchecker path_checker tur Bes gerads, Wojciech On 13 August 2010 16:51, David Noriega <tsk133 at my.utsa.edu> wrote:> We have three Sun StorageTek 2150, one connected to the metadata > server and two crossconnected to the two data storage nodes. They are > connected via fiber using the qla2xxx driver that comes with CentOS > 5.5. The multipath daemon has the following config: > > defaults { > udev_dir /dev > polling_interval 10 > selector "round-robin 0" > path_grouping_policy multibus > getuid_callout "/sbin/scsi_id -g -u -s /block/%n" > prio_callout "/sbin/mpath_prio_rdac /dev/%n" > path_checker rdac > rr_min_io 100 > max_fds 8192 > rr_weight priorities > failback immediate > no_path_retry fail > user_friendly_names yes > } > > Comment out from multipath.conf file: > > blacklist { > devnode "*" > } > > > On Fri, Aug 13, 2010 at 4:31 AM, Wojciech Turek <wjt27 at cam.ac.uk> wrote: > > Hi David, > > > > I have seen simmilar errors given out by some storage arrays. There were > > caused by arrays exporting volumes via more then a single path without > multi > > path driver installed or configured properly. Some times the array > > controllers requires a special driver to be installed on Linux host (for > > example RDAC mpp driver) to properly present and handle configured > volumes > > in the OS. What sort of disk raid array are you using? > > > > Best gerads, > > > > Wojciech > > > > On 12 August 2010 17:58, David Noriega <tsk133 at my.utsa.edu> wrote: > >> > >> We just setup a lustre system, and all looks good, but there is this > >> nagging error thats floating about. When I reboot any of the nodes, be > >> it a OSS or MDS, I will get this: > >> > >> [root at meta1 ~]# dmesg | grep sdc > >> sdc : very big device. try to use READ CAPACITY(16). > >> SCSI device sdc: 4878622720 512-byte hdwr sectors (2497855 MB) > >> sdc: Write Protect is off > >> sdc: Mode Sense: 77 00 10 08 > >> SCSI device sdc: drive cache: write back w/ FUA > >> sdc : very big device. try to use READ CAPACITY(16). > >> SCSI device sdc: 4878622720 512-byte hdwr sectors (2497855 MB) > >> sdc: Write Protect is off > >> sdc: Mode Sense: 77 00 10 08 > >> SCSI device sdc: drive cache: write back w/ FUA > >> sdc:end_request: I/O error, dev sdc, sector 0 > >> Buffer I/O error on device sdc, logical block 0 > >> end_request: I/O error, dev sdc, sector 0 > >> > >> This doesn''t seem to affect anything. fdisk -l doesn''t even report the > >> device. The same(thought of course different block device sdd, sde, on > >> the OSSs), happens on all the nodes. > >> > >> If I run pvdisplay or lvdisplay, I''ll get this: > >> /dev/sdc: read failed after 0 of 4096 at 0: Input/output error > >> > >> Any ideas? > >> David > >> -- > >> Personally, I liked the university. They gave us money and facilities, > >> we didn''t have to produce anything! You''ve never been out of college! > >> You don''t know what it''s like out there! I''ve worked in the private > >> sector. They expect results. -Ray Ghostbusters > >> _______________________________________________ > >> Lustre-discuss mailing list > >> Lustre-discuss at lists.lustre.org > >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > > > > -- > > Wojciech Turek > > > > Senior System Architect > > > > High Performance Computing Service > > University of Cambridge > > Email: wjt27 at cam.ac.uk > > Tel: (+)44 1223 763517 > > > > > > -- > Personally, I liked the university. They gave us money and facilities, > we didn''t have to produce anything! You''ve never been out of college! > You don''t know what it''s like out there! I''ve worked in the private > sector. They expect results. -Ray Ghostbusters > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Wojciech Turek Senior System Architect High Performance Computing Service University of Cambridge Email: wjt27 at cam.ac.uk Tel: (+)44 1223 763517 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100813/b6df002a/attachment.html
LaoTsao 老曹
2010-Aug-13 22:37 UTC
[Lustre-discuss] Getting weird disk errors, no apparent impact
https://cds.sun.com/is-bin/INTERSHOP.enfinity/WFS/CDS-CDS_SMI-Site/en_US/-/USD/ViewProductDetail-Start?ProductRef=RDACDVR60002500-09.03.0BC02.0013-LX-G-F at CDS-CDS_SMI rdac On 8/13/2010 12:05 PM, LaoTsao wrote:> U mean stk 2540? > Iirc one can download drivers from oracle sun site > > ------- Original message ------- >> From: David Noriega <tsk133 at my.utsa.edu> >> To: lustre-discuss at lists.lustre.org >> Sent: 13.8.''10, 11:51 >> >> We have three Sun StorageTek 2150, one connected to the metadata >> server and two crossconnected to the two data storage nodes. They are >> connected via fiber using the qla2xxx driver that comes with CentOS >> 5.5. The multipath daemon has the following config: >> >> defaults { >> udev_dir /dev >> polling_interval 10 >> selector "round-robin 0" >> path_grouping_policy multibus >> getuid_callout "/sbin/scsi_id -g -u -s /block/%n" >> prio_callout "/sbin/mpath_prio_rdac /dev/%n" >> path_checker rdac >> rr_min_io 100 >> max_fds 8192 >> rr_weight priorities >> failback immediate >> no_path_retry fail >> user_friendly_names yes >> } >> >> Comment out from multipath.conf file: >> >> blacklist { >> devnode "*" >> } >> >> >> On Fri, Aug 13, 2010 at 4:31 AM, Wojciech Turek <wjt27 at cam.ac.uk> wrote: >>> Hi David, >>> >>> I have seen simmilar errors given out by some storage arrays. There >>> were >>> caused by arrays exporting volumes via more then a single path >>> without multi >>> path driver installed or configured properly. Some times the array >>> controllers requires a special driver to be installed on Linux host >>> (for >>> example RDAC mpp driver) to properly present and handle configured >>> volumes >>> in the OS. What sort of disk raid array are you using? >>> >>> Best gerads, >>> >>> Wojciech >>> >>> On 12 August 2010 17:58, David Noriega <tsk133 at my.utsa.edu> wrote: >>>> >>>> We just setup a lustre system, and all looks good, but there is this >>>> nagging error thats floating about. When I reboot any of the nodes, be >>>> it a OSS or MDS, I will get this: >>>> >>>> [root at meta1 ~]# dmesg | grep sdc >>>> sdc : very big device. try to use READ CAPACITY(16). >>>> SCSI device sdc: 4878622720 512-byte hdwr sectors (2497855 MB) >>>> sdc: Write Protect is off >>>> sdc: Mode Sense: 77 00 10 08 >>>> SCSI device sdc: drive cache: write back w/ FUA >>>> sdc : very big device. try to use READ CAPACITY(16). >>>> SCSI device sdc: 4878622720 512-byte hdwr sectors (2497855 MB) >>>> sdc: Write Protect is off >>>> sdc: Mode Sense: 77 00 10 08 >>>> SCSI device sdc: drive cache: write back w/ FUA >>>> sdc:end_request: I/O error, dev sdc, sector 0 >>>> Buffer I/O error on device sdc, logical block 0 >>>> end_request: I/O error, dev sdc, sector 0 >>>> >>>> This doesn''t seem to affect anything. fdisk -l doesn''t even report the >>>> device. The same(thought of course different block device sdd, sde, on >>>> the OSSs), happens on all the nodes. >>>> >>>> If I run pvdisplay or lvdisplay, I''ll get this: >>>> /dev/sdc: read failed after 0 of 4096 at 0: Input/output error >>>> >>>> Any ideas? >>>> David >>>> -- >>>> Personally, I liked the university. They gave us money and facilities, >>>> we didn''t have to produce anything! You''ve never been out of college! >>>> You don''t know what it''s like out there! I''ve worked in the private >>>> sector. They expect results. -Ray Ghostbusters >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >>> >>> -- >>> Wojciech Turek >>> >>> Senior System Architect >>> >>> High Performance Computing Service >>> University of Cambridge >>> Email: wjt27 at cam.ac.uk >>> Tel: (+)44 1223 763517 >>> >> >> >> >> -- >> Personally, I liked the university. They gave us money and facilities, >> we didn''t have to produce anything! You''ve never been out of college! >> You don''t know what it''s like out there! I''ve worked in the private >> sector. They expect results. -Ray Ghostbusters >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >-------------- next part -------------- A non-text attachment was scrubbed... Name: laotsao.vcf Type: text/x-vcard Size: 221 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100813/42cd262b/attachment.vcf
Andrea Pieretti
2010-Aug-17 11:07 UTC
[Lustre-discuss] Same Directory but different files on different clients
Hi, we have the following problem. In a cluster of 300 nodes ca. having a shared space via lustre 1.8.1.1, we noticed the following weird fact! On one client (let''s call it Fred) a directory showed several files (we issued an lfs getstripe on them, an md5sum a du -sh etc. etc.) On several other clients (let''s call them Barneys) the directory was empty. We created a new file on this directory on Fred but none of the Barneys showed it. We then created a new file in this directory from one of the Barneys and on Fred everything disappeared. It''s seems a synchronism problem but the REAL problem is: Why instead showing the files of Fred on all the nodes, the Barneys won the match deleting the data? Nothing has been reported on the log files of the MDS and of the OSTs Best regards Andrea -- +------------------------------------------------------- + Andrea Pieretti + C.A.S.P.U.R., + Via dei Tizii, 6b + 00185 Roma ITALY + tel: +39-06-44486712 + mob: +39-328-4280841 + fax: +39-06-4957083 + e-mail: a.pieretti at caspur.it + http://www.caspur.it +------------------------------------------------------- + + "Nitwit! Blubber! Oddment! Tweak!" + + Albus Percival Wulfric Brian Dumbledore +-------------------------------------------------------