On 2020-04-29 00:40, Jeremy Allison via samba wrote:> On Mon, Apr 27, 2020 at 11:21:35PM +0200, A L wrote: >> I set up the following test case: >> * Linux 5.7-rc3 (with the patch from previous mail) >> * samba-4.12.1 >> * gcc-9.3.0 >> * liburing-0.6 >> * glibc-2.30-r8 >> >> ================================>> Test 1) >> Copy 10 10GB files. >> 1) ddrescue -s 10G -v -f /dev/urandom 0.bin >> 2) for((i=1;i<=10;i+=1)); do cp --reflink=always 0.bin $i.bin; done >> 3) sha256sum *.bin > sha256sum.txt >> 4) Windows 10, file explorer, copy the 10 files to a local disk D:\test\ >> 5) Verify local files in D:\test with sha256sum >> 6) sha256sum was correct. >> 7) redid step 4 and 5. Now sha256sum was wrong, but all 10 files had the >> same (but wrong) csum! >> >> >> ================================>> Test 2) >> Copy 1000 10MB files. >> 1) ddrescue -s 10M -v -f /dev/urandom 0.bin >> 2) for((i=1;i<=1000;i+=1)); do cp --reflink=always 0.bin $i.bin; done >> 3) sha256sum *.bin > sha256sum.txt >> 4) Windows 10, file explorer, copy all 1000 files to a local disk D:\test\ >> 5) Verify local files in D:\test with sha256sum > I just tried to reproduce this using > Samba master on Ubuntu 19.10 kernel 5.3.0-51-generic > liburing-dev:0.4-2. > > I only tried with 100 files, and fetched > them using smbclient "mget", and the results > were always the same - identical sha256sum > hashes on all files. > > We're going to need more info to track this > down in your environment I'm afraid. >I'll do some more testing with locally mounted samba using the cifs module and also the smbclient tool. Does mget use multiple concurrent threads? I notice that when I copy from Windows explorer (and using FastCopy), I have about 10 I/O threads with smbd, according to iotop.
On 2020-04-30 09:08, A L via samba wrote:> > On 2020-04-29 00:40, Jeremy Allison via samba wrote: >> On Mon, Apr 27, 2020 at 11:21:35PM +0200, A L wrote: >>> I set up the following test case: >>> * Linux 5.7-rc3 (with the patch from previous mail) >>> * samba-4.12.1 >>> * gcc-9.3.0 >>> * liburing-0.6 >>> * glibc-2.30-r8 >>> >>> ================================>>> Test 1) >>> Copy 10 10GB files. >>> 1) ddrescue -s 10G -v -f /dev/urandom 0.bin >>> 2) for((i=1;i<=10;i+=1)); do cp --reflink=always 0.bin $i.bin; done >>> 3) sha256sum *.bin > sha256sum.txt >>> 4) Windows 10, file explorer, copy the 10 files to a local disk >>> D:\test\ >>> 5) Verify local files in D:\test with sha256sum >>> 6) sha256sum was correct. >>> 7) redid step 4 and 5. Now sha256sum was wrong, but all 10 files had >>> the >>> same (but wrong) csum! >>> >>> >>> ================================>>> Test 2) >>> Copy 1000 10MB files. >>> 1) ddrescue -s 10M -v -f /dev/urandom 0.bin >>> 2) for((i=1;i<=1000;i+=1)); do cp --reflink=always 0.bin $i.bin; done >>> 3) sha256sum *.bin > sha256sum.txt >>> 4) Windows 10, file explorer, copy all 1000 files to a local disk >>> D:\test\ >>> 5) Verify local files in D:\test with sha256sum >> I just tried to reproduce this using >> Samba master on Ubuntu 19.10 kernel 5.3.0-51-generic >> liburing-dev:0.4-2. >> >> I only tried with 100 files, and fetched >> them using smbclient "mget", and the results >> were always the same - identical sha256sum >> hashes on all files. >> >> We're going to need more info to track this >> down in your environment I'm afraid. >> > I'll do some more testing with locally mounted samba using the cifs > module and also the smbclient tool. > > Does mget use multiple concurrent threads? I notice that when I copy > from Windows explorer (and using FastCopy), I have about 10 I/O > threads with smbd, according to iotop. >So I did some more tests. smbclient mget does not copy in the same way Windows Explorer does. When copying in Windows Explorer, there are many multiple concurrent threads used to transfer the files. With smbclient mget there are no corruptions, both locally and over the network from another Linux machine. I analysed the difference between a correct file and a corrupt file. At position 0x7A0000 the corrupt file started to contain only binary zero. At position 0x800000 the zeroes ended and correct data continued. To me it sound like some wrong memory is copied somehow. These two files shows the difference as shown in a hex-editor. https://paste.tnonline.net/files/MO1FJvDOG6E8_smb_1 https://paste.tnonline.net/files/Rglite4KWmU8_smb_2 I will redo the tests with different Windows clients and see if that shows different results.
On Thu, Apr 30, 2020 at 10:25:49AM +0200, A L wrote:> So I did some more tests. smbclient mget does not copy in the same way > Windows Explorer does. When copying in Windows Explorer, there are many > multiple concurrent threads used to transfer the files. With smbclient mget > there are no corruptions, both locally and over the network from another > Linux machine.Client threads can't be seen over the network :-). What matters is concurrent requests, and smbclient certainly does issue concurrent requests up until the credit window size of the server (or 256, whichever is smaller). What would be really interesting is a comparative wireshark trace of the request pattern issued by the threads on a Windows client vs. the request pattern issued by smbclient.
On Thu, Apr 30, 2020 at 10:25:49AM +0200, A L wrote:> So I did some more tests. smbclient mget does not copy in the same way > Windows Explorer does. When copying in Windows Explorer, there are many > multiple concurrent threads used to transfer the files. With smbclient mget > there are no corruptions, both locally and over the network from another > Linux machine. > > I analysed the difference between a correct file and a corrupt file. > At position 0x7A0000 the corrupt file started to contain only binary zero. > At position 0x800000 the zeroes ended and correct data continued. To me it > sound like some wrong memory is copied somehow. > > These two files shows the difference as shown in a hex-editor. > https://paste.tnonline.net/files/MO1FJvDOG6E8_smb_1 > https://paste.tnonline.net/files/Rglite4KWmU8_smb_2Is it always the same area in the file that is corrupt ? The fact that it's on a 4K page-aligned boundary is interesting. If you can corrolate I'd love to see the SMB2 traffic on the wire that corresponds to the corrupted data write/read.