Lionel Cons
2025-Apr-04 21:59 UTC
Support for transferring sparse files via scp/sftp correctly?
On Fri, 4 Apr 2025 at 07:07, Ron Frederick <ronf at timeheart.net> wrote:> > On Apr 3, 2025, at 6:02?PM, Darren Tucker <dtucker at dtucker.net> wrote: > > On Sat, 29 Mar 2025 at 16:14, Ron Frederick <ronf at timeheart.net <mailto:ronf at timeheart.net>> wrote: > >> [...] > >> If you don?t get all of the requested ranges in a single request, additional requests can be sent starting at just past the end of the last range previously returned. > >> > >> What do you think? > > > > That seems like it'd work well for things with SEEK_HOLE or equivalent, although there's always the chance of the underlying file changing between mapping it out and doing the transfer. > > Since my last message, I?ve also implemented support for this in Windows, which has a DeviceIOControl called FSCTL_QUERY_ALLOCATED_RANGES that returns an array of offset and length values, within a given range in a file (also specified by offset and length). So, it?s almost a direct mapping to the extension I proposed. I basically have three different versions of a request_ranges() function (Windows, systems with SEEK_DATA/SEEK_HOLE, and a dummy implementation for all other platforms which just returns the full range passed in). > > The risk of missing data due to file changes is no different than what could happen if you were reading data sequentially and something did a write to the source file after you had already copied that part of the file. > > > > Damien pointed out that it's possible to do a reasonable but not perfect sparse file support by memcmp'ing your existing file buffer with a block of zeros and skipping the write if it matches. OpenBSD's cp(1) does this (look for "skipholes"): https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/bin/cp/utils.c?annotate=HEAD.This should not be done. Either a system has SEEK_DATA/SEEK_HOLE, Win32 (Windows&ReactOS) FSCTL_QUERY_ALLOCATED_RANGES, or just copy all bytes. The misunderstanding is that sequences of 0x00 bytes are automatically holes. That is not true. Holes represent ranges of "no data", and only for backwards compatibility read as 0x00 bytes. Valid data ranges can contain long sequences of 0x00 bytes, therefore PLEASE don't invent extra holes in sparse files just because they are sequences of 0x00 bytes. Lionel
Ron Frederick
2025-Apr-04 23:42 UTC
Support for transferring sparse files via scp/sftp correctly?
Hi Lionel, On Apr 4, 2025, at 2:59?PM, Lionel Cons <lionelcons1972 at gmail.com> wrote:>> Damien pointed out that it's possible to do a reasonable but not perfect sparse file support by memcmp'ing your existing file buffer with a block of zeros and skipping the write if it matches. OpenBSD's cp(1) does this (look for "skipholes"): https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/bin/cp/utils.c?annotate=HEAD. > > This should not be done. Either a system has SEEK_DATA/SEEK_HOLE, > Win32 (Windows&ReactOS) FSCTL_QUERY_ALLOCATED_RANGES, or just copy all > bytes. > > The misunderstanding is that sequences of 0x00 bytes are automatically > holes. That is not true. Holes represent ranges of "no data", and only > for backwards compatibility read as 0x00 bytes. Valid data ranges can > contain long sequences of 0x00 bytes, therefore PLEASE don't invent > extra holes in sparse files just because they are sequences of 0x00 > bytes.My current implementation matches what you describe, copying all the ranges marked as containing data regardless of the content. However, I am curious about what the concern would be. From a pure data reading perspective, the data should be identical and the ?extra? holes which get created would allow the file to take up less space. Are you saying there are applications that actually make decisions based on the returned ranges from FSCTL_QUERY_ALLOCATED_RANGES (or SEEK_DATA/SEEK_HOLE) and behave differently on ?no data? vs. null bytes? How would such code deal with the fact that the filesystem sometimes allocates a data range larger than the requested range of a write? I currently have an argument for whether a copy will be sparse or not. I could imagine a separate argument to control this null-matching behavior (probably defaulting to off). Would that address your concern? I don?t know if this extra processing is really worth the trouble, but there are some cases where it might be valuable given the way sparse file range allocation works, at least on macOS. In experiments I?ve run, data ranges on macOS can be as small at 16 KB, but if you write to two different ranges within a 16 _megabyte_ region of a file, macOS will allocate a single data range that covers both of the ranges actually written plus all of the bytes in between them (showing up now as one big range with the middle filled with null bytes). It could be argued that this region between the two ranges is a ?false? data range that really should have remained a hole. Code looking for null bytes could avoid having to read and forward potentially tens of megabytes in each of these ?false" data ranges out over SFTP. I haven?t looked closely at Windows to see if it has a similar behavior, but I did see that its allocated data ranges tend to be at least 64 KB in size even if the actual writes are smaller than that. That?s not as bad as the macOS case, but there could be some savings there. -- Ron Frederick ronf at timeheart.net
Darren Tucker
2025-Apr-05 10:08 UTC
Support for transferring sparse files via scp/sftp correctly?
On Sat, 5 Apr 2025 at 09:07, Lionel Cons <lionelcons1972 at gmail.com> wrote:> On Fri, 4 Apr 2025 at 07:07, Ron Frederick <ronf at timeheart.net> wrote: > > > > On Apr 3, 2025, at 6:02?PM, Darren Tucker <dtucker at dtucker.net> wrote: > [...] > > > Damien pointed out that it's possible to do a reasonable but not > perfect sparse file support by memcmp'ing your existing file buffer with a > block of zeros and skipping the write if it matches. OpenBSD's cp(1) does > this (look for "skipholes"): > https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/bin/cp/utils.c?annotate=HEAD > . > > This should not be done. Either a system has SEEK_DATA/SEEK_HOLE, > Win32 (Windows&ReactOS) FSCTL_QUERY_ALLOCATED_RANGES, or just copy all > bytes.If there's a protocol extension I'd like for it to be able to support other use cases, not just the one you care about. -- Darren Tucker (dtucker at dtucker.net) GPG key 11EAA6FA / A86E 3E07 5B19 5880 E860 37F4 9357 ECEF 11EA A6FA Good judgement comes with experience. Unfortunately, the experience usually comes from bad judgement.
Apparently Analagous Threads
- Support for transferring sparse files via scp/sftp correctly?
- Support for transferring sparse files via scp/sftp correctly?
- Support for transferring sparse files via scp/sftp correctly?
- Support for transferring sparse files via scp/sftp correctly?
- Duplicate value used in disconnect reason definitons