Damian Menscher
2005-Feb-16 21:02 UTC
mke2fs options for very large filesystems (and corruption!)
[sorry if this isn't threaded right... I just subscribed] Theodore Ts'o wrote:> > There are two reasons for the reserve. One is to reserve space on the > partition containing /var and /etc for log files, etc. The other is > to avoid the performance degredation when the last 5-10% of the disk > space is used. (BSD actually reserves 10% by default.) Given that > the cost of a 200 GB disks is under $100 USD the cost of reserving > 10GB is approximately $5.00 USD, or approximately the cost of a large > cup of coffee at Starbucks. > > Why is does the filesystem performance degrade? Because as the last > blocks of a filesystem are consumed the probability of filesystem > fragmentation goes up dramatically? Exactly when this happens depends > on the write patterns of the workload and the distribution of the file > sizes on the filesystem, but on average this tends to happen when the > utilization rises to 90-95%.I'm working on something similar -- we're hoping to create a single 3.5 TB ext3 filesystem. For us, 5% is 175 gig. Since we're using hot-swap scsi disks instead of ide, that comes out to well over $100. So we'll be going with 1-2%. Even then, we're leaving several gig available, which will probably be fine. When we fill the disk, I expect we'll be more concerned with having no free space, and less concerned with its performance. More importantly, I have some questions about whether ext3 can realistically handle a filesystem this large. Since fdisk created a DOS partition table, and that couldn't handle such a large partition, I used parted to create a gpt partition of 3.5TB. We were then able to mkfs and mount the filesystem. As a test, I filled it with: cp 1_gig_file 1_gig_file.### (where ### ranged from 000 to 999) and dd if=/dev/zero of=bigfile bs=10M I expected bigfile to crash out at 2TB, but actually it filled the filesystem (created a 2.7TB file). The next bit is slightly muddled: 09:50 deleted the bigfile (this took a LONG time) 10:04 pulled a drive on the raid array (hardware raid5) 10:12 errors started appearing in the logs, and the filesystem remounted itself read-only Here are some sample errors: Feb 16 10:12:05 hera kernel: EXT3-fs error (device sdb1): ext3_free_blocks: Freeing blocks not in datazone - block = 1027000215, count = 1 Feb 16 10:12:05 hera kernel: Aborting journal on device sdb1. Feb 16 10:12:05 hera kernel: EXT3-fs error (device sdb1): ext3_free_blocks: Freeing blocks not in datazone - block = 3696983906, count = 1 Feb 16 10:12:05 hera kernel: EXT3-fs error (device sdb1): ext3_free_blocks: Freeing blocks not in datazone - block = 2229194010, count = 1 Feb 16 10:12:05 hera kernel: EXT3-fs error (device sdb1): ext3_free_blocks: Freeing blocks not in datazone - block = 1172112249, count = 1 Feb 16 10:12:05 hera kernel: EXT3-fs error (device sdb1): ext3_free_blocks: Freeing blocks not in datazone - block = 3315908307, count = 1 Feb 16 10:12:05 hera kernel: ext3_free_blocks_sb: aborting transaction: Journal has aborted in __ext3_journal_get_undo_access<2>EXT3-fs error (device sdb1) in ext3_free_blocks_sb: Journal has aborted ... Feb 16 10:12:10 hera kernel: EXT3-fs error (device sdb1) in ext3_delete_inode: Journal has aborted Feb 16 10:12:10 hera kernel: __journal_remove_journal_head: freeing b_committed_data Feb 16 10:12:10 hera last message repeated 5 times Feb 16 10:12:10 hera kernel: ext3_abort called. Feb 16 10:12:10 hera kernel: EXT3-fs error (device sdb1): ext3_journal_start_sb: Detected aborted journal Feb 16 10:12:10 hera kernel: Remounting filesystem read-only My best guess of what happened here is that "bigfile" became larger than the maximum filesize (it was a 2.7TB file, but the max filesize for ext3 is 2.0TB). So when we deleted it, it probably deleted data blocks for lots of other files. Or at least tried to... maybe these logs indicate failure. In any case, when we ran fsck, it had plenty of complaints. Here's a sample session: [root at hera /]# fsck /dev/sdb1 fsck 1.35 (28-Feb-2004) e2fsck 1.35 (28-Feb-2004) /dev/sdb1 contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes Inode 17072 has illegal block(s). Clear<y>? yes Illegal block #69645 (3666359908) in inode 17072. CLEARED. Illegal block #69646 (3602993593) in inode 17072. CLEARED. Illegal block #69647 (3034171662) in inode 17072. CLEARED. Illegal block #69649 (1280104119) in inode 17072. CLEARED. Illegal block #69650 (1548422666) in inode 17072. CLEARED. Illegal block #69651 (2791173364) in inode 17072. CLEARED. Illegal block #69652 (2279495497) in inode 17072. CLEARED. Illegal block #69653 (3744042344) in inode 17072. CLEARED. Illegal block #69655 (2639235121) in inode 17072. CLEARED. Illegal block #69656 (3190320101) in inode 17072. CLEARED. Illegal block #69657 (1758759856) in inode 17072. CLEARED. Too many illegal blocks in inode 17072. Clear inode<y>? yes Inode 17124 has illegal block(s). Clear<y>? yes Note that for each inode we cleared, we lost one of the 1_gig_files. For example, check out this directory listing: [root at hera newprojects]# ls -lart | more total 835484832 ?--------- ? ? ? ? ? rand_gig.797 ?--------- ? ? ? ? ? rand_gig.788 ?--------- ? ? ? ? ? rand_gig.786 ?--------- ? ? ? ? ? rand_gig.785 ?--------- ? ? ? ? ? rand_gig.733 ?--------- ? ? ? ? ? rand_gig.653 ?--------- ? ? ? ? ? rand_gig.628 ?--------- ? ? ? ? ? rand_gig.627 ?--------- ? ? ? ? ? rand_gig.621 ?--------- ? ? ? ? ? rand_gig.583 ?--------- ? ? ? ? ? rand_gig.577 ?--------- ? ? ? ? ? rand_gig.559 ?--------- ? ? ? ? ? rand_gig.405 ?--------- ? ? ? ? ? rand_gig.393 drwxr-xr-x 3 root root 4096 Feb 14 15:29 .. drwx------ 2 root root 16384 Feb 15 15:25 lost+found -rw-r--r-- 1 root root 1073741824 Feb 15 16:37 rand_gig.3 -rw-r--r-- 1 root root 1073741824 Feb 15 16:41 rand_gig.4 -rw-r--r-- 1 root root 1073741824 Feb 15 16:44 rand_gig.5 ... We never let fsck finish completely, though perhaps it would have just moved all of these to lost+found. Anyway, there are obviously some serious problems with this setup, and I suspect it's on the software side. I'd appreciate hearing any thoughts people might have on this. Are we safe if we just don't create files larger than 2TB in the future? Or did something else go wrong? If creating a >2TB file really is possible at the user level, and can really trash a filesystem, then this would appear to be a serious bug. Damian Menscher -- -=#| Physics Grad Student & SysAdmin @ U Illinois Urbana-Champaign |#=- -=#| 488 LLP, 1110 W. Green St, Urbana, IL 61801 Ofc:(217)333-0038 |#=- -=#| 4602 Beckman, VMIL/MS, Imaging Technology Group:(217)244-3074 |#=- -=#| <menscher at uiuc.edu> www.uiuc.edu/~menscher/ Fax:(217)333-9819 |#=- -=#| The above opinions are not necessarily those of my employers. |#=-