Hi, I am a neophyte trying to benchmark new hardware for a Lustre setup. I got the Lustre IOkit noarch rpm and installed it onto the OSS. I edited the file /usr/bin/sgpdd-survey (attached) to meet the description of our OSS (RAM is 16Gb, sample new hardware detected at /dev/sdg). My initial running of sgpdd-survey produced the following errors: [root at oss4 ~]# sgpdd-survey Can''t find SG device for /dev/sdg1, testing for partition /usr/bin/sgpdd-survey: line 105: CAPACITY trying 0x200: syntax error in expression (error token is "trying 0x200") /usr/bin/sgpdd-survey: line 106: [: ==: unary operator expected Tue Jul 22 14:46:48 EDT 2008 sgpdd-survey on /dev/sdg1 from oss4.crew.local /usr/bin/sgpdd-survey: line 134: rsz*1024/bs: division by 0 (error token is "s") System Info: [root at oss4 ~]# uname -a Linux oss4.crew.local 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 SMP Sun Feb 17 08:38:44 EST 2008 x86_64 x86_64 x86_64 GNU/Linux [root at oss4 ~]# cat /etc/redhat-release CentOS release 5 (Final) The physical device is there. I did format it into two 7Tb partitions just to give it an earlier test. Do I have to un-format it for sgpdd-survey? [root at oss4 ~]# parted /dev/sdg GNU Parted 1.8.1 Using /dev/sdg Welcome to GNU Parted! Type ''help'' to view a list of commands. (parted) print Model: LSI MegaRAID 8888ELP (scsi) Disk /dev/sdg: 14.0TB Sector size (logical/physical): 512B/512B Partition Table: gpt Number Start End Size File system Name Flags 1 17.4kB 7000GB 7000GB ext3 sdg1 2 7000GB 14.0TB 6986GB ext3 sdg2 (parted) q Information: Don''t forget to update /etc/fstab, if necessary. Any advice for testing a new system with sgpdd-survey? Advice appreciated! megan -------------- next part -------------- A non-text attachment was scrubbed... Name: OSS4.sgpdd-survey.sh Type: application/x-sh Size: 6081 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080722/942035ed/attachment-0001.sh
I think it wants a raw device, not a partition. Also, I noticed you''re using two 7tb partitions on a single 14tb lun. You''re going to really hurt your performance by doing it that way. Just my $.02. On Tue, Jul 22, 2008 at 3:03 PM, Ms. Megan Larko <dobsonunit at gmail.com> wrote:> Hi, > > I am a neophyte trying to benchmark new hardware for a Lustre setup. > I got the Lustre IOkit noarch rpm and installed it onto the OSS. I > edited the file /usr/bin/sgpdd-survey (attached) to meet the > description of our OSS (RAM is 16Gb, sample new hardware detected at > /dev/sdg). > > My initial running of sgpdd-survey produced the following errors: > [root at oss4 ~]# sgpdd-survey > Can''t find SG device for /dev/sdg1, testing for partition > /usr/bin/sgpdd-survey: line 105: CAPACITY > trying > 0x200: syntax error in expression (error token is "trying > 0x200") > /usr/bin/sgpdd-survey: line 106: [: ==: unary operator expected > Tue Jul 22 14:46:48 EDT 2008 sgpdd-survey on /dev/sdg1 from oss4.crew.local > /usr/bin/sgpdd-survey: line 134: rsz*1024/bs: division by 0 (error token is > "s") > > System Info: > [root at oss4 ~]# uname -a > Linux oss4.crew.local 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 SMP Sun > Feb 17 08:38:44 EST 2008 x86_64 x86_64 x86_64 GNU/Linux > [root at oss4 ~]# cat /etc/redhat-release > CentOS release 5 (Final) > > The physical device is there. I did format it into two 7Tb partitions > just to give it an earlier test. Do I have to un-format it for > sgpdd-survey? > > [root at oss4 ~]# parted /dev/sdg > GNU Parted 1.8.1 > Using /dev/sdg > Welcome to GNU Parted! Type ''help'' to view a list of commands. > (parted) print > > Model: LSI MegaRAID 8888ELP (scsi) > Disk /dev/sdg: 14.0TB > Sector size (logical/physical): 512B/512B > Partition Table: gpt > > Number Start End Size File system Name Flags > 1 17.4kB 7000GB 7000GB ext3 sdg1 > 2 7000GB 14.0TB 6986GB ext3 sdg2 > > (parted) q > Information: Don''t forget to update /etc/fstab, if necessary. > > Any advice for testing a new system with sgpdd-survey? > > Advice appreciated! > megan > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080724/15208c09/attachment.html
On Tue, 2008-07-22 at 15:03 -0400, Ms. Megan Larko wrote:> Hi,Hi,> [root at oss4 ~]# sgpdd-survey > Can''t find SG device for /dev/sdg1, testing for partitionDid you load the sg module? b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080724/457fd65a/attachment.bin
> On Tue, Jul 22, 2008 at 3:03 PM, Ms. Megan Larko <dobsonunit at gmail.com> > wrote:I received info for a fix. Comments in-line. (megan)>> >> Hi, >> >> I am a neophyte trying to benchmark new hardware for a Lustre setup. >> I got the Lustre IOkit noarch rpm and installed it onto the OSS. I >> edited the file /usr/bin/sgpdd-survey (attached) to meet the >> description of our OSS (RAM is 16Gb, sample new hardware detected at >> /dev/sdg). >> >> My initial running of sgpdd-survey produced the following errors: >> [root at oss4 ~]# sgpdd-survey >> Can''t find SG device for /dev/sdg1, testing for partition >> /usr/bin/sgpdd-survey: line 105: CAPACITY >> trying >> 0x200: syntax error in expression (error token is "trying >> 0x200") >> /usr/bin/sgpdd-survey: line 106: [: ==: unary operator expected >> Tue Jul 22 14:46:48 EDT 2008 sgpdd-survey on /dev/sdg1 from >> oss4.crew.local >> /usr/bin/sgpdd-survey: line 134: rsz*1024/bs: division by 0 (error token >> is "s")The fix to run /usr/bin/sgpdd-survey was to add an argument to the two instances of sg_readcap that appear in the script. In /usr/bin/sgpdd-survey: ORIG Ex: bs=$((`sg_readcap -b ${devs[0]} | awk ''{print $2}''`)) NEW Ex: bs=$((`sg_readcap -b -16 ${devs[0]} | awk ''{print $2}''`)) This can be tested on the command line: [root at oss4 ~]# sg_readcap -b -16 /dev/sg5 I have not yet gotten any further with my benchmark testing. (Other fires to be extinguished today.) Thanks to Cliff White and Atul Vidwansa for the solution information. Later, megan>> >> System Info: >> [root at oss4 ~]# uname -a >> Linux oss4.crew.local 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 SMP Sun >> Feb 17 08:38:44 EST 2008 x86_64 x86_64 x86_64 GNU/Linux >> [root at oss4 ~]# cat /etc/redhat-release >> CentOS release 5 (Final)>> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > >
On Jul 24, 2008 11:56 -0400, Aaron Knister wrote:> I think it wants a raw device, not a partition. Also, I noticed you''re using > two 7tb partitions on a single 14tb lun. You''re going to really hurt your > performance by doing it that way. Just my $.02.In fact, Lustre filesystems larger than 8TB are known to have problems and should not be used. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Jul 24, 2008 14:52 -0400, Ms. Megan Larko wrote:> Okay. So I have a JBOD with 16 1Tb Hitachi Ultrastar sATA drives in > it connected to the OSS via an LSI 8888ELP controller card (with > Battery Back-up). The JBOD is passed to the OSS as raw space. My > partitions cannot exceed 8Tb, but splitting into two 7Tb will hurt > performance....The ideal layout would be to have two RAID-5 arrays with 8 data + 1 parity disks using 64kB or 128kB RAID chunk size, but you are short two disks... You may also consider using RAID-6 with 6 data + 2 parity disks. Having the RAID stripe width be 15+1 is probably quite bad because a single 1MB RPC will always need to recalculate the parity for that IO.> What do I do for Lustre set-up? I thought the fewer partitions the > better because one does has less "overhead" space. Do I put them out > as 16 single Tb partitions????? That seems like extra work for a > file system to track.Please see section 10.1 in the Lustre manual for more tips: http://manual.lustre.org/manual/LustreManual16_HTML/RAID.html Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Hello, I''m back to working on benchmarking the hardware prior to lustre fs installation. I have used the LSI 8888ELP WebBIOS utility to destroy the previous RAID configuration on one of our 16-bay JBOD units. A new set of RAID arrays were set up for benchmarking purposes. One RAID6 with parity disks, one RAID5 with parity disk and one single spindle passed directly to the OSS. The boot of CentOS 5 (2.6.18-53.1.13.el5_lustre.1.6.4.3smp) detects these new block devices as /dev/sdh, /dev/sdi and /dev/sdj respectively. The "sg_map" command detects the raw units. The lustre-IOkit is still unable to find them. Themodule "sg" is indeed loaded as is the megaraid_sas for the LSI 8888ELP card. Can lustre-IOkit actually benchmark anything on an LSI 8888ELP card? Should I use another tool like IOzone? Thanks, megan>From dmesg:sd 2:2:1:0: Attached scsi generic sg12 type 0 Vendor: LSI Model: MegaRAID 8888ELP Rev: 1.20 Type: Direct-Access ANSI SCSI revision: 05 sdg : very big device. try to use READ CAPACITY(16). SCSI device sdg: 11707023360 512-byte hdwr sectors (5993996 MB) sdg: Write Protect is off sdg: Mode Sense: 1f 00 00 08 SCSI device sdg: drive cache: write back, no read (daft) sdg: unknown partition table sd 2:2:2:0: Attached scsi disk sdg sd 2:2:2:0: Attached scsi generic sg13 type 0 Vendor: LSI Model: MegaRAID 8888ELP Rev: 1.20 Type: Direct-Access ANSI SCSI revision: 05 sdh : very big device. try to use READ CAPACITY(16). SCSI device sdh: 11707023360 512-byte hdwr sectors (5993996 MB) sdh: Write Protect is off sdh: Mode Sense: 1f 00 00 08 SCSI device sdh: drive cache: write back, no read (daft) sdh: unknown partition table sd 2:2:3:0: Attached scsi disk sdh sd 2:2:3:0: Attached scsi generic sg14 type 0 Vendor: LSI Model: MegaRAID 8888ELP Rev: 1.20 Type: Direct-Access ANSI SCSI revision: 05 SCSI device sdi: 1951170560 512-byte hdwr sectors (998999 MB) sdi: Write Protect is off sdi: Mode Sense: 1f 00 00 08 SCSI device sdi: drive cache: write back, no read (daft) SCSI device sdi: 1951170560 512-byte hdwr sectors (998999 MB) sdi: Write Protect is off sdi: Mode Sense: 1f 00 00 08 SCSI device sdi: drive cache: write back, no read (daft) sdi: unknown partition table sd 2:2:4:0: Attached scsi disk sdi sd 2:2:4:0: Attached scsi generic sg15 type 0 [root at oss4 ~]# sg_map /dev/sg0 /dev/sda /dev/sg1 /dev/scd0 /dev/sg2 /dev/sg3 /dev/sg4 /dev/sg5 /dev/sdb /dev/sg6 /dev/sdc /dev/sg7 /dev/sdd /dev/sg8 /dev/sg9 /dev/sg10 /dev/sg11 /dev/sde /dev/sg12 /dev/sdf /dev/sg13 /dev/sdg /dev/sg14 /dev/sdh /dev/sg15 /dev/sdi /dev/sg16 /dev/sdj /dev/sg17 /dev/sdk Lustre-IOkit sgpdd-survey: scsidevs=/dev/sg16 Result of sgpdd-survey: [root at oss4 log]# sgpdd-survey Can''t find SG device for /dev/sg16, testing for partition Can''t find SG device /dev/sg1. Do you have the sg module configured for your kernel? [root at oss4 log]# lsmod | grep sg sg 70056 0 scsi_mod 187192 11 ib_iser,libiscsi,scsi_transport_iscsi,ib_srp,sr_mod,libata,megaraid_sas,sg,3w_9xxx,usb_storage,sd_mod This is still using the "-16" option to each of the two sg_readcap calls. sg_readcap -b -16 ******************************************************************** On Thu, Jul 24, 2008 at 3:45 PM, Andreas Dilger <adilger at sun.com> wrote:> On Jul 24, 2008 14:52 -0400, Ms. Megan Larko wrote: >> Okay. So I have a JBOD with 16 1Tb Hitachi Ultrastar sATA drives in >> it connected to the OSS via an LSI 8888ELP controller card (with >> Battery Back-up). The JBOD is passed to the OSS as raw space. My >> partitions cannot exceed 8Tb, but splitting into two 7Tb will hurt >> performance.... > > The ideal layout would be to have two RAID-5 arrays with 8 data + 1 parity > disks using 64kB or 128kB RAID chunk size, but you are short two disks... > > You may also consider using RAID-6 with 6 data + 2 parity disks. Having > the RAID stripe width be 15+1 is probably quite bad because a single > 1MB RPC will always need to recalculate the parity for that IO. > >> What do I do for Lustre set-up? I thought the fewer partitions the >> better because one does has less "overhead" space. Do I put them out >> as 16 single Tb partitions????? That seems like extra work for a >> file system to track. > > Please see section 10.1 in the Lustre manual for more tips: > http://manual.lustre.org/manual/LustreManual16_HTML/RAID.html > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > >
Hi, Additional info. If I use "scsidevs=/dev/sdj" in /usr/bin/sgpdd-survey in place of the /dev/sg16 I receive the following result: Tue Jul 29 15:40:47 EDT 2008 sgpdd-survey on /dev/sdj from oss4.crew.local total_size 17487872K rsz 1024 crg 1 thr 4 write 1 failed read 1 failed total_size 17487872K rsz 1024 crg 1 thr 8 write 1 failed read 1 failed total_size 17487872K rsz 1024 crg 1 thr 16 write 1 failed read 1 failed total_size 17487872K rsz 1024 crg 2 thr 4 write 2 failed read 2 failed total_size 17487872K rsz 1024 crg 2 thr 8 write 2 failed read 2 failed total_size 17487872K rsz 1024 crg 2 thr 16 write 2 failed read 2 failed total_size 17487872K rsz 1024 crg 2 thr 32 write 2 failed read 2 failed total_size 17487872K rsz 1024 crg 4 thr 4 write 4 failed read 4 failed total_size 17487872K rsz 1024 crg 4 thr 8 write 4 failed read 4 failed total_size 17487872K rsz 1024 crg 4 thr 16 write 4 failed read 4 failed total_size 17487872K rsz 1024 crg 4 thr 32 write 4 failed read 4 failed total_size 17487872K rsz 1024 crg 4 thr 64 write 4 failed read 4 failed total_size 17487872K rsz 1024 crg 8 thr 8 write 8 failed read 8 failed total_size 17487872K rsz 1024 crg 8 thr 16 write 8 failed read 8 failed total_size 17487872K rsz 1024 crg 8 thr 32 write 8 failed read 8 failed total_size 17487872K rsz 1024 crg 8 thr 64 write 8 failed read 8 failed total_size 17487872K rsz 1024 crg 16 thr 16 write 16 failed read 16 failed total_size 17487872K rsz 1024 crg 16 thr 32 write 16 failed read 16 failed total_size 17487872K rsz 1024 crg 16 thr 64 write 16 failed read 16 failed total_size 17487872K rsz 1024 crg 32 thr 32 write 32 failed read 32 failed total_size 17487872K rsz 1024 crg 32 thr 64 write 32 failed read 32 failed total_size 17487872K rsz 1024 crg 64 thr 64 write 64 failed read 64 failed All writes and reads fail but it indicates that it found the device.... megan On Tue, Jul 29, 2008 at 3:38 PM, Ms. Megan Larko <dobsonunit at gmail.com> wrote:> Hello, > > I''m back to working on benchmarking the hardware prior to lustre fs > installation. > > I have used the LSI 8888ELP WebBIOS utility to destroy the previous > RAID configuration on one of our 16-bay JBOD units. A new set of > RAID arrays were set up for benchmarking purposes. One RAID6 with > parity disks, one RAID5 with parity disk and one single spindle > passed directly to the OSS. The boot of CentOS 5 > (2.6.18-53.1.13.el5_lustre.1.6.4.3smp) detects these new block devices > as /dev/sdh, /dev/sdi and /dev/sdj respectively. The "sg_map" command > detects the raw units. The lustre-IOkit is still unable to find > them. Themodule "sg" is indeed loaded as is the megaraid_sas for the > LSI 8888ELP card. Can lustre-IOkit actually benchmark anything on an > LSI 8888ELP card? Should I use another tool like IOzone? > > Thanks, > megan > > From dmesg: > sd 2:2:1:0: Attached scsi generic sg12 type 0 > Vendor: LSI Model: MegaRAID 8888ELP Rev: 1.20 > Type: Direct-Access ANSI SCSI revision: 05 > sdg : very big device. try to use READ CAPACITY(16). > SCSI device sdg: 11707023360 512-byte hdwr sectors (5993996 MB) > sdg: Write Protect is off > sdg: Mode Sense: 1f 00 00 08 > SCSI device sdg: drive cache: write back, no read (daft) > sdg: unknown partition table > sd 2:2:2:0: Attached scsi disk sdg > > sd 2:2:2:0: Attached scsi generic sg13 type 0 > Vendor: LSI Model: MegaRAID 8888ELP Rev: 1.20 > Type: Direct-Access ANSI SCSI revision: 05 > sdh : very big device. try to use READ CAPACITY(16). > SCSI device sdh: 11707023360 512-byte hdwr sectors (5993996 MB) > sdh: Write Protect is off > sdh: Mode Sense: 1f 00 00 08 > SCSI device sdh: drive cache: write back, no read (daft) > sdh: unknown partition table > sd 2:2:3:0: Attached scsi disk sdh > > sd 2:2:3:0: Attached scsi generic sg14 type 0 > Vendor: LSI Model: MegaRAID 8888ELP Rev: 1.20 > Type: Direct-Access ANSI SCSI revision: 05 > SCSI device sdi: 1951170560 512-byte hdwr sectors (998999 MB) > sdi: Write Protect is off > sdi: Mode Sense: 1f 00 00 08 > SCSI device sdi: drive cache: write back, no read (daft) > SCSI device sdi: 1951170560 512-byte hdwr sectors (998999 MB) > sdi: Write Protect is off > sdi: Mode Sense: 1f 00 00 08 > SCSI device sdi: drive cache: write back, no read (daft) > sdi: unknown partition table > sd 2:2:4:0: Attached scsi disk sdi > sd 2:2:4:0: Attached scsi generic sg15 type 0 > > [root at oss4 ~]# sg_map > /dev/sg0 /dev/sda > /dev/sg1 /dev/scd0 > /dev/sg2 > /dev/sg3 > /dev/sg4 > /dev/sg5 /dev/sdb > /dev/sg6 /dev/sdc > /dev/sg7 /dev/sdd > /dev/sg8 > /dev/sg9 > /dev/sg10 > /dev/sg11 /dev/sde > /dev/sg12 /dev/sdf > /dev/sg13 /dev/sdg > /dev/sg14 /dev/sdh > /dev/sg15 /dev/sdi > /dev/sg16 /dev/sdj > /dev/sg17 /dev/sdk > > Lustre-IOkit sgpdd-survey: > scsidevs=/dev/sg16 > > Result of sgpdd-survey: > [root at oss4 log]# sgpdd-survey > Can''t find SG device for /dev/sg16, testing for partition > Can''t find SG device /dev/sg1. > Do you have the sg module configured for your kernel? > [root at oss4 log]# lsmod | grep sg > sg 70056 0 > scsi_mod 187192 11 > ib_iser,libiscsi,scsi_transport_iscsi,ib_srp,sr_mod,libata,megaraid_sas,sg,3w_9xxx,usb_storage,sd_mod > > This is still using the "-16" option to each of the two sg_readcap calls. > sg_readcap -b -16 > > ******************************************************************** > On Thu, Jul 24, 2008 at 3:45 PM, Andreas Dilger <adilger at sun.com> wrote: >> On Jul 24, 2008 14:52 -0400, Ms. Megan Larko wrote: >>> Okay. So I have a JBOD with 16 1Tb Hitachi Ultrastar sATA drives in >>> it connected to the OSS via an LSI 8888ELP controller card (with >>> Battery Back-up). The JBOD is passed to the OSS as raw space. My >>> partitions cannot exceed 8Tb, but splitting into two 7Tb will hurt >>> performance.... >> >> The ideal layout would be to have two RAID-5 arrays with 8 data + 1 parity >> disks using 64kB or 128kB RAID chunk size, but you are short two disks... >> >> You may also consider using RAID-6 with 6 data + 2 parity disks. Having >> the RAID stripe width be 15+1 is probably quite bad because a single >> 1MB RPC will always need to recalculate the parity for that IO. >> >>> What do I do for Lustre set-up? I thought the fewer partitions the >>> better because one does has less "overhead" space. Do I put them out >>> as 16 single Tb partitions????? That seems like extra work for a >>> file system to track. >> >> Please see section 10.1 in the Lustre manual for more tips: >> http://manual.lustre.org/manual/LustreManual16_HTML/RAID.html >> >> Cheers, Andreas >> -- >> Andreas Dilger >> Sr. Staff Engineer, Lustre Group >> Sun Microsystems of Canada, Inc. >> >> >
On Tue, 2008-07-29 at 15:38 -0400, Ms. Megan Larko wrote:> Hello,Hi,> Lustre-IOkit sgpdd-survey: > scsidevs=/dev/sg16Per your followup message, you specify the block devices you want to test, not their SG mapped device. sgpdd_survey will figure out what devices they are mapped to automatically. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080730/a33e1e33/attachment-0001.bin
On Tue, 2008-07-29 at 15:43 -0400, Ms. Megan Larko wrote:> Hi, > > Additional info. > > If I use "scsidevs=/dev/sdj" in /usr/bin/sgpdd-survey in place of the > /dev/sg16Yes, this is the correct syntax.> I receive the following result: > Tue Jul 29 15:40:47 EDT 2008 sgpdd-survey on /dev/sdj from oss4.crew.local > total_size 17487872K rsz 1024 crg 1 thr 4 write 1 failed read 1 failed > total_size 17487872K rsz 1024 crg 1 thr 8 write 1 failed read 1 failed > total_size 17487872K rsz 1024 crg 1 thr 16 write 1 failed read 1 failed > total_size 17487872K rsz 1024 crg 2 thr 4 write 2 failed read 2 failed > total_size 17487872K rsz 1024 crg 2 thr 8 write 2 failed read 2 failed > total_size 17487872K rsz 1024 crg 2 thr 16 write 2 failed read 2 failed > total_size 17487872K rsz 1024 crg 2 thr 32 write 2 failed read 2 failed > total_size 17487872K rsz 1024 crg 4 thr 4 write 4 failed read 4 failed > total_size 17487872K rsz 1024 crg 4 thr 8 write 4 failed read 4 failed > total_size 17487872K rsz 1024 crg 4 thr 16 write 4 failed read 4 failed > total_size 17487872K rsz 1024 crg 4 thr 32 write 4 failed read 4 failed > total_size 17487872K rsz 1024 crg 4 thr 64 write 4 failed read 4 failed > total_size 17487872K rsz 1024 crg 8 thr 8 write 8 failed read 8 failed > total_size 17487872K rsz 1024 crg 8 thr 16 write 8 failed read 8 failed > total_size 17487872K rsz 1024 crg 8 thr 32 write 8 failed read 8 failed > total_size 17487872K rsz 1024 crg 8 thr 64 write 8 failed read 8 failed > total_size 17487872K rsz 1024 crg 16 thr 16 write 16 failed read 16 failed > total_size 17487872K rsz 1024 crg 16 thr 32 write 16 failed read 16 failed > total_size 17487872K rsz 1024 crg 16 thr 64 write 16 failed read 16 failed > total_size 17487872K rsz 1024 crg 32 thr 32 write 32 failed read 32 failed > total_size 17487872K rsz 1024 crg 32 thr 64 write 32 failed read 32 failed > total_size 17487872K rsz 1024 crg 64 thr 64 write 64 failed read 64 failed > > All writes and reads fail but it indicates that it found the device....Indeed. So the question is, why are the reads and writes failing. Do you have any files in /tmp named: /tmp/sgpdd_survey_$(date)_$(uname -n).detail If so, can you paste one here? Alternatively you can try using sgp_dd to read a device. The following should work: # sgp_dd /dev/sg16 /dev/null count=10 bs=512 time=1 and paste the result here. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080730/e1e625a3/attachment.bin
>From megan:Comments in-line. == 2 of 2 =Date: Wed, Jul 30 2008 7:12 am From: "Brian J. Murrell" On Tue, 2008-07-29 at 15:43 -0400, Ms. Megan Larko wrote:> Hi, > > Additional info. > > If I use "scsidevs=/dev/sdj" in /usr/bin/sgpdd-survey in place of the > /dev/sg16Yes, this is the correct syntax.> I receive the following result: > Tue Jul 29 15:40:47 EDT 2008 sgpdd-survey on /dev/sdj from oss4.crew.local > total_size 17487872K rsz 1024 crg 1 thr 4 write 1 failed read 1 failed > total_size 17487872K rsz 1024 crg 1 thr 8 write 1 failed read 1 failed > total_size 17487872K rsz 1024 crg 1 thr 16 write 1 failed read 1 failed > total_size 17487872K rsz 1024 crg 2 thr 4 write 2 failed read 2 failed > total_size 17487872K rsz 1024 crg 2 thr 8 write 2 failed read 2 failed > total_size 17487872K rsz 1024 crg 2 thr 16 write 2 failed read 2 failed > total_size 17487872K rsz 1024 crg 2 thr 32 write 2 failed read 2 failed > total_size 17487872K rsz 1024 crg 4 thr 4 write 4 failed read 4 failed > total_size 17487872K rsz 1024 crg 4 thr 8 write 4 failed read 4 failed > total_size 17487872K rsz 1024 crg 4 thr 16 write 4 failed read 4 failed > total_size 17487872K rsz 1024 crg 4 thr 32 write 4 failed read 4 failed > total_size 17487872K rsz 1024 crg 4 thr 64 write 4 failed read 4 failed > total_size 17487872K rsz 1024 crg 8 thr 8 write 8 failed read 8 failed > total_size 17487872K rsz 1024 crg 8 thr 16 write 8 failed read 8 failed > total_size 17487872K rsz 1024 crg 8 thr 32 write 8 failed read 8 failed > total_size 17487872K rsz 1024 crg 8 thr 64 write 8 failed read 8 failed > total_size 17487872K rsz 1024 crg 16 thr 16 write 16 failed read 16 failed > total_size 17487872K rsz 1024 crg 16 thr 32 write 16 failed read 16 failed > total_size 17487872K rsz 1024 crg 16 thr 64 write 16 failed read 16 failed > total_size 17487872K rsz 1024 crg 32 thr 32 write 32 failed read 32 failed > total_size 17487872K rsz 1024 crg 32 thr 64 write 32 failed read 32 failed > total_size 17487872K rsz 1024 crg 64 thr 64 write 64 failed read 64 failed > > All writes and reads fail but it indicates that it found the device....Indeed. So the question is, why are the reads and writes failing. Do you have any files in /tmp named: /tmp/sgpdd_survey_$(date)_$(uname -n).detail If so, can you paste one here? megan: I am attaching the file from /tmp/sgpdd_survey_2008-07-29 at 15:40_oss4.crew.local.detail The complaint seems to be that the memory cannot be accessed. Alternatively you can try using sgp_dd to read a device. The following should work: # sgp_dd /dev/sg16 /dev/null count=10 bs=512 time=1 and paste the result here. megan: Pasting result-- [root at oss4 ~]# sgp_dd of=/dev/sg16 if=/dev/null count=10 bs=512 time=1 time to transfer data was 0.000121 secs remaining block count=10 0+0 records in 0+0 records out Note that a "cat /proc/meminfo" shows 16Gb RAM on the machine oss4. [root at oss4 ~]# cat /proc/meminfo MemTotal: 16439328 kB MemFree: 16101332 kB Buffers: 32260 kB Cached: 205820 kB ---snip--- BTW I am running iozone v. 3.283 on the OS drive, a RAID6 JBOD disk formatted ext3 and one of our existing Lustre disks and the lustre system is doing well under iozone. Thanks, megan b. -------------- next part -------------- A non-text attachment was scrubbed... Name: sgpdd_survey_2008-07-29 at 15:40_oss4.crew.local.detail Type: application/octet-stream Size: 33332 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080731/a61eb44d/attachment-0001.obj
On Thu, 2008-07-31 at 14:12 -0400, Ms. Megan Larko wrote:> > megan: I am attaching the file from > /tmp/sgpdd_survey_2008-07-29 at 15:40_oss4.crew.local.detail > The complaint seems to be that the memory cannot be accessed.Allocated, not accessed: ==============> total_size 17487872K rsz 1024 crg 1 thr 4 =====> write sg starting out command at "sgp_dd.c":843: Cannot allocate memory =====> read sg starting in command at "sgp_dd.c":784: Cannot allocate memory ==============> total_size 17487872K rsz 1024 crg 1 thr 8 =====> write sg starting out command at "sgp_dd.c":843: Cannot allocate memory =====> read sg starting in command at "sgp_dd.c":784: Cannot allocate memory ==============> total_size 17487872K rsz 1024 crg 1 thr 16 =====> write sg starting out command at "sgp_dd.c":843: Cannot allocate memory =====> read sg starting in command at "sgp_dd.c":784: Cannot allocate memory ==============> total_size 17487872K rsz 1024 crg 2 thr 4 =====> write sg starting out command at "sgp_dd.c":843: Cannot allocate memory sg starting out command at "sgp_dd.c":843: Cannot allocate memory =====> read sg starting in command at "sgp_dd.c":784: Cannot allocate memory sg starting in command at "sgp_dd.c":784: Cannot allocate memory So for whatever reason sgp_dd can''t allocate memory.> megan: Pasting result-- > [root at oss4 ~]# sgp_dd of=/dev/sg16 if=/dev/null count=10 bs=512 time=1 > time to transfer data was 0.000121 secs > remaining block count=10 > 0+0 records in > 0+0 records outHrm. The result doesn''t look right.> Note that a "cat /proc/meminfo" shows 16Gb RAM on the machine oss4. > [root at oss4 ~]# cat /proc/meminfo > MemTotal: 16439328 kB > MemFree: 16101332 kB > Buffers: 32260 kB > Cached: 205820 kBI don''t really know why you''d be getting those errors then. Buggy version of sgp_dd maybe? Buggy something else?> BTW I am running iozone v. 3.283 on the OS drive, a RAID6 JBOD disk > formatted ext3 and one of our existing Lustre disks and the lustre > system is doing well under iozone.Good. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080801/87f58d6a/attachment.bin