Hello,
We found that fsync and fdatasync syscalls sometimes don't sync
data in an ext3 file system under the following conditions.
1. Kernel version is 2.6.6 or later (including 2.6.8.1 and 2.6.9-rc2).
2. Ext3's journalling mode is "data=journal".
3. Create a file (whose size is 1Mbytes) and execute umount/mount.
4. lseek to a random position within the file, write 8192 bytes
data, and fsync or fdatasync.
We presume the data was not written to the corresponding disk
before returning from fsync or fdatasync syscall on the evidence
as follows:
1. The response time of fsync() and fdatasync() was extremely
short.
We use the "diskio" tool, which is downloadable from OSDL page
(http://developer.osdl.jp/projects/doubt/). The program showed
that the response time was under 10 microseconds. This time
cannot be achieved with data transfer on IDE and PCI bus!
2. The IDE writing routine ide_start_dma() was not called under
DMA enabled.
We inserted the print messages in the sys_write(), sys_fsync()
and ide_start_dma() by the attached patch. Sometimes the
"ide_start_dma: ..." message was not shown between "write: in
..." and "fsync: out ...".
The problem was occurred since 2.6.5-bk1, which includes the patch
"[PATCH] ext3 fsync() and fdatasync() speedup". We found that the
problem was solved by deleting the part of the patch which
modifies ext3_sync_file(). Maybe, i_state is not correctly set to
I_DIRTY when the related page cache is dirty (is it true?)
Attached file is tarball (tar + bzip2) which contains following
files. The patches are for 2.6.8.1-kernel (applicable to
2.6.9-rc2), and the results were also produced with
2.6.8.1-kernel.
- fsync.c.patch: patch for solving the problem
- kernel.printk.patch: patch for showing the problem by printks
- kernel.printk.log: printks from the kernel with the printk patch
- results/*: output from "diskio" program
(result for fsync, and O_SYNC for comparison)
- kernel-2.6.8.1.config: .config file which we used to build kernel
We use the following hardware to make the data:
CPU: Intel Pentium 4 3GHz Hyper Threading, 2nd cache 512KB
RAM: 1GBytes
IDE controller: on-board ICH5 (ATA100)
HDD: E-IDE (7200rpm, 8MB cache, ATA133) x 2
(one for system and the other for writing test data)
(The problem was also reproduced with a SCSI system)
Regards,
Seiji
--
Seiji Kihara
Open Source Software Computing Project,
NTT Cyber Space Laboratories, Yokosuka, JAPAN
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patches+results.tar.bz2
Type: application/octet-stream
Size: 80725 bytes
Desc: not available
URL:
<http://listman.redhat.com/archives/ext3-users/attachments/20040916/869218f6/attachment.obj>
Seiji Kihara <kihara.seiji at lab.ntt.co.jp> wrote:> > We found that fsync and fdatasync syscalls sometimes don't sync > data in an ext3 file system under the following conditions. > > 1. Kernel version is 2.6.6 or later (including 2.6.8.1 and 2.6.9-rc2). > 2. Ext3's journalling mode is "data=journal". > 3. Create a file (whose size is 1Mbytes) and execute umount/mount. > 4. lseek to a random position within the file, write 8192 bytes > data, and fsync or fdatasync. > > We presume the data was not written to the corresponding disk > before returning from fsync or fdatasync syscall on the evidence > as follows: > > 1. The response time of fsync() and fdatasync() was extremely > short. > > We use the "diskio" tool, which is downloadable from OSDL page > (http://developer.osdl.jp/projects/doubt/). The program showed > that the response time was under 10 microseconds. This time > cannot be achieved with data transfer on IDE and PCI bus! > > 2. The IDE writing routine ide_start_dma() was not called under > DMA enabled. > > We inserted the print messages in the sys_write(), sys_fsync() > and ide_start_dma() by the attached patch. Sometimes the > "ide_start_dma: ..." message was not shown between "write: in > ..." and "fsync: out ...". > > The problem was occurred since 2.6.5-bk1, which includes the patch > "[PATCH] ext3 fsync() and fdatasync() speedup". We found that the > problem was solved by deleting the part of the patch which > modifies ext3_sync_file(). Maybe, i_state is not correctly set to > I_DIRTY when the related page cache is dirty (is it true?)I forgot about this one.> Attached file is tarball (tar + bzip2) which contains following > files. The patches are for 2.6.8.1-kernel (applicable to > 2.6.9-rc2), and the results were also produced with > 2.6.8.1-kernel.We really don't need a 100k tarball to communicate a three line patch :( Yes, the I_DIRTY test is bogus because data pages are not marked dirty at write() time when the filesystem is mounted in data=journal mode. However your patch will disable the above optimisation for data=writeback and data-ordered modes as well. I don't think that's necessary? How about this? --- 25/fs/ext3/fsync.c~ext3-journal-data-fsync-fix Thu Sep 16 14:47:21 2004 +++ 25-akpm/fs/ext3/fsync.c Thu Sep 16 14:47:33 2004 @@ -49,10 +49,6 @@ int ext3_sync_file(struct file * file, s J_ASSERT(ext3_journal_current_handle() == 0); - smp_mb(); /* prepare for lockless i_state read */ - if (!(inode->i_state & I_DIRTY)) - goto out; - /* * data=writeback: * The caller's filemap_fdatawrite()/wait will sync the data. @@ -76,6 +72,10 @@ int ext3_sync_file(struct file * file, s goto out; } + smp_mb(); /* prepare for lockless i_state read */ + if (!(inode->i_state & I_DIRTY)) + goto out; + /* * The VFS has written the file data. If the inode is unaltered * then we need not start a commit. _
Reasonably Related Threads
- [PATCH] OCFS2: fdatasync should skip unimportant metadata writeout
- [PATCH] Btrfs: fdatasync should skip metadata writeout
- EXT2 vs. EXT3: mount w/sync or fdatasync
- fdatasync -- Error -- bad file descriptor
- Linux performance bug: fsync() for files with zero links