I''ve got a funky problem, and was wondering if anyone could help get me started with using dtrace to find the problem. The setup: Solaris 10 with all of the patches up to about a month ago. Veritas Foundation Suite 4.1 MP1 8 TB SCSI Mirrored Disk on one array, SAN attached w/ 4x2G HBA''s 30 TB SATA RAID-5 Disk on another array, also SAN attached w/ separate 2x2G HBA''s The problem: We have a script which moves oracle files from the mirrored disk to the RAID-5 disk as the data ages. At the end of the day, the script is essentially doing a cp -p /expensivedisk/file1 /cheapdisk/file2. After the cp we do checksums on the files and compare for corruption. These are usually 2GB files, and about 1 in 500 get corrupted, where a contiguous 2K chunk of data gets zero''d in the destination file. I''m wondering if I could use dtrace to narrow down the source of the corruption from somewhere between the disk, the hba, the driver, the OS, or veritas. Thanks.
Carisdad
2006-Apr-11 17:54 UTC
[dtrace-discuss] Re: Disk corruption problem, where to start
Carisdad wrote:> I''ve got a funky problem, and was wondering if anyone could help get > me started with using dtrace to find the problem. > > The setup: > Solaris 10 with all of the patches up to about a month ago. > Veritas Foundation Suite 4.1 MP1 > 8 TB SCSI Mirrored Disk on one array, SAN attached w/ 4x2G HBA''s > 30 TB SATA RAID-5 Disk on another array, also SAN attached w/ separate > 2x2G HBA''s > > The problem: > We have a script which moves oracle files from the mirrored disk > to the RAID-5 disk as the data ages. At the end of the day, the > script is essentially doing a cp -p /expensivedisk/file1 > /cheapdisk/file2. After the cp we do checksums on the files and > compare for corruption. These are usually 2GB files, and about 1 in > 500 get corrupted, where a contiguous 2K chunk of data gets zero''d in > the destination file.I actually mis-stated the corruption. It''s a contiguous 48k (4 oracle blocks vs 4 disk blocks) of zero''d data.> > I''m wondering if I could use dtrace to narrow down the source of the > corruption from somewhere between the disk, the hba, the driver, the > OS, or veritas. > > Thanks. >Thanks again.
Andy Rumer
2006-Apr-11 19:37 UTC
[dtrace-discuss] Re: Disk corruption problem, where to start
> Carisdad wrote: > > > I''ve got a funky problem, and was wondering if > anyone could help get > > me started with using dtrace to find the problem. > > > > The setup: > > Solaris 10 with all of the patches up to about a > month ago. > > Veritas Foundation Suite 4.1 MP1 > > 8 TB SCSI Mirrored Disk on one array, SAN > attached w/ 4x2G HBA''s > > 30 TB SATA RAID-5 Disk on another array, also SAN > attached w/ separate > > 2x2G HBA''s > > > > The problem: > > We have a script which moves oracle files from > the mirrored disk > > to the RAID-5 disk as the data ages. At the end of > the day, the > > script is essentially doing a cp -p > /expensivedisk/file1 > > /cheapdisk/file2. After the cp we do checksums on > the files and > > compare for corruption. These are usually 2GB > files, and about 1 in > > 500 get corrupted, where a contiguous 2K chunk of > data gets zero''d in > > the destination file. > > I actually mis-stated the corruption. It''s a > contiguous 48k (4 oracle > blocks vs 4 disk blocks) of zero''d data. >And because I haven''t been sleeping, make that 64K (4x16K oracle blocks) not 48K. Sheesh, I''m not really this incompetent, I promise.> > > > I''m wondering if I could use dtrace to narrow down > the source of the > > corruption from somewhere between the disk, the > hba, the driver, the > > OS, or veritas. > > > > Thanks. > > > Thanks again. > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org >This message posted from opensolaris.org
Wee Yeh Tan
2006-Apr-14 05:22 UTC
[dtrace-discuss] Re: Disk corruption problem, where to start
Andy, I''ll start by looking at who else is could have written to the said files during the entire period of the copy/verify. Brendan''s rwsnoop will be a good start. Check out: <http://users.tpg.com.au/adsln4yb/dtrace.html#DTraceToolkit> -- Just me, Wire ... On 4/12/06, Andy Rumer <carisdad at gmail.com> wrote:> > Carisdad wrote: > > > > > I''ve got a funky problem, and was wondering if > > anyone could help get > > > me started with using dtrace to find the problem. > > > > > > The setup: > > > Solaris 10 with all of the patches up to about a > > month ago. > > > Veritas Foundation Suite 4.1 MP1 > > > 8 TB SCSI Mirrored Disk on one array, SAN > > attached w/ 4x2G HBA''s > > > 30 TB SATA RAID-5 Disk on another array, also SAN > > attached w/ separate > > > 2x2G HBA''s > > > > > > The problem: > > > We have a script which moves oracle files from > > the mirrored disk > > > to the RAID-5 disk as the data ages. At the end of > > the day, the > > > script is essentially doing a cp -p > > /expensivedisk/file1 > > > /cheapdisk/file2. After the cp we do checksums on > > the files and > > > compare for corruption. These are usually 2GB > > files, and about 1 in > > > 500 get corrupted, where a contiguous 2K chunk of > > data gets zero''d in > > > the destination file. > > > > I actually mis-stated the corruption. It''s a > > contiguous 48k (4 oracle > > blocks vs 4 disk blocks) of zero''d data. > > > And because I haven''t been sleeping, make that 64K (4x16K oracle blocks) not 48K. Sheesh, I''m not really this incompetent, I promise. > > > > > > I''m wondering if I could use dtrace to narrow down > > the source of the > > > corruption from somewhere between the disk, the > > hba, the driver, the > > > OS, or veritas. > > > > > > Thanks. > > > > > Thanks again. > > _______________________________________________ > > dtrace-discuss mailing list > > dtrace-discuss at opensolaris.org > > > > > This message posted from opensolaris.org > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org >