Hello everyone, I wasn't planning on releasing v0.12 yet, and it was supposed to have some initial support for multiple devices. But, I have made a number of performance fixes and small bug fixes, and I wanted to get them out there before the (destabilizing) work on multiple-devices took over. So, here's v0.12. It comes with a shiny new disk format (sorry), but the gain is dramatically better random writes to existing files. In testing here, the random write phase of tiobench went from 1MB/s to 30MB/s. The fix was to change the way back references for file extents were hashed. Other changes: Insert and delete multiple items at once in the btree where possible. Back references added more tree balances, and it showed up in a few benchmarks. With v0.12, backrefs have no real impact on performance. Optimize bio end_io routines. Btrfs was spending way too much CPU time in the bio end_io routines, leading to lock contention and other problems. Optimize read ahead during transaction commit. The old code was trying to read far too much at once, which made the end_io problems really stand out. mount -o ssd option, which clusters file data writes together regardless of the directory the files belong to. There are a number of other performance tweaks for SSD, aimed at clustering metadata and data writes to better take advantage of the hardware. mount -o max_inline=size option, to override the default max inline file data size (default is 8k). Any value up to the leaf size is allowed (default 16k). Simple -ENOSPC handling. Emphasis on simple, but it prevents accidentally filling the disk most of the time. With enough threads/procs banging on things, you can still easily crash the box. -chris
On Sunday 10 February 2008, David Miller wrote:> From: Chris Mason <chris.mason@oracle.com> > Date: Wed, 6 Feb 2008 12:00:13 -0500 > > This function never returns an error, so the simplest fix was to > return the hash value which avoids all of the issues. In attempting > other schemes to fix this, I found it very difficult to give gcc > a packed attribute for that "u64 *" argument other than to create > some new pseudo structure which would have been ugly. >Many thanks, I clearly didn't put enough thought into the unaligned access problems.> Similar code lives in the btrfs kernel code too, I'll try to get a > partition at least mounted and working minimally and if successful > I'll send you patches for that too.The kernel is actually worse, because the set/get macros are more complex. Some live in ctree.h like in the progs, but the nasty ones live in struct-funcs.c -chris
On Tuesday 12 February 2008, David Miller wrote:> From: Chris Mason <chris.mason@oracle.com> > Date: Wed, 6 Feb 2008 12:00:13 -0500 > > > So, here's v0.12. > > Any page size larger than 4K will not work with btrfs. All of the > extent stuff assumes that PAGE_SIZE <= sectorsize.Yeah, there is definitely clean up to do in that area.> > I confirmed this by forcing mkfs.btrfs to use an 8K sectorsize on > sparc64 and I was finally able to successfully mount a partition.Nice> > With 4K there are zero's in the root tree node header, because it's > extent's location on disk is at a sub-PAGE_SIZE multiple and the > extent code doesn't handle that. > > You really need to start validating this stuff on other platforms. > Something that isn't little endian and something that doesn't use 4K > pages. I'm sure you have some powerpc parts around somewhere. :)Grin, I think around v0.4 I grabbed a ppc box for a day and got things working. There has been some churn since then... My first prio is the newest set of disk format changes, and then I'll sit down and work on stability on a bunch of arches.> > Anyways, here is a patch for the kernel bits which fixes most of the > unaligned accesses on sparc64.Many thanks, I'll try these out here and push them into the tree. -chris
From: Chris Mason <chris.mason@oracle.com> Date: Wed, 6 Feb 2008 12:00:13 -0500> So, here's v0.12.I couldn't even make a filesystem on sparc64 without the following patch. The first problem is that these SETGET macros lose typing information, and therefore can't see the 'packed' attribute and therefore take unaligned access SIGBUS signals on sparc64 when trying to derefernce the member. The next problem is a similar issue in btrfs_name_hash(). This gets passed things like &key.offset which is a member of a packed structure, losing this packed'ness information btrfs_name_hash() performs a potentially unaligned memory access, again resulting in a SIGBUS. This function never returns an error, so the simplest fix was to return the hash value which avoids all of the issues. In attempting other schemes to fix this, I found it very difficult to give gcc a packed attribute for that "u64 *" argument other than to create some new pseudo structure which would have been ugly. Similar code lives in the btrfs kernel code too, I'll try to get a partition at least mounted and working minimally and if successful I'll send you patches for that too. diff -up --recursive --new-file vanilla/btrfs-progs-0.12/ctree.h btrfs-progs-0.12/ctree.h --- vanilla/btrfs-progs-0.12/ctree.h 2008-02-06 08:37:45.000000000 -0800 +++ btrfs-progs-0.12/ctree.h 2008-02-10 16:53:24.000000000 -0800 @@ -451,18 +451,16 @@ static inline void btrfs_set_##name(stru static inline u##bits btrfs_##name(struct extent_buffer *eb, \ type *s) \ { \ - unsigned long offset = (unsigned long)s + \ - offsetof(type, member); \ - __le##bits *tmp = (__le##bits *)(eb->data + offset); \ - return le##bits##_to_cpu(*tmp); \ + unsigned long offset = (unsigned long)s; \ + type *p = (type *) (eb->data + offset); \ + return le##bits##_to_cpu(p->member); \ } \ static inline void btrfs_set_##name(struct extent_buffer *eb, \ type *s, u##bits val) \ { \ - unsigned long offset = (unsigned long)s + \ - offsetof(type, member); \ - __le##bits *tmp = (__le##bits *)(eb->data + offset); \ - *tmp = cpu_to_le##bits(val); \ + unsigned long offset = (unsigned long)s; \ + type *p = (type *) (eb->data + offset); \ + p->member = cpu_to_le##bits(val); \ } #define BTRFS_SETGET_STACK_FUNCS(name, type, member, bits) \ diff -up --recursive --new-file vanilla/btrfs-progs-0.12/dir-item.c btrfs-progs-0.12/dir-item.c --- vanilla/btrfs-progs-0.12/dir-item.c 2008-02-06 08:37:45.000000000 -0800 +++ btrfs-progs-0.12/dir-item.c 2008-02-10 17:03:34.000000000 -0800 @@ -71,8 +71,7 @@ int btrfs_insert_xattr_item(struct btrfs key.objectid = dir; btrfs_set_key_type(&key, BTRFS_XATTR_ITEM_KEY); - ret = btrfs_name_hash(name, name_len, &key.offset); - BUG_ON(ret); + key.offset = btrfs_name_hash(name, name_len); path = btrfs_alloc_path(); if (!path) return -ENOMEM; @@ -122,8 +121,7 @@ int btrfs_insert_dir_item(struct btrfs_t key.objectid = dir; btrfs_set_key_type(&key, BTRFS_DIR_ITEM_KEY); - ret = btrfs_name_hash(name, name_len, &key.offset); - BUG_ON(ret); + key.offset = btrfs_name_hash(name, name_len); path = btrfs_alloc_path(); data_size = sizeof(*dir_item) + name_len; dir_item = insert_with_overflow(trans, root, path, &key, data_size, @@ -196,8 +194,7 @@ struct btrfs_dir_item *btrfs_lookup_dir_ key.objectid = dir; btrfs_set_key_type(&key, BTRFS_DIR_ITEM_KEY); - ret = btrfs_name_hash(name, name_len, &key.offset); - BUG_ON(ret); + key.offset = btrfs_name_hash(name, name_len); ret = btrfs_search_slot(trans, root, &key, path, ins_len, cow); if (ret < 0) @@ -258,8 +255,7 @@ struct btrfs_dir_item *btrfs_lookup_xatt key.objectid = dir; btrfs_set_key_type(&key, BTRFS_XATTR_ITEM_KEY); - ret = btrfs_name_hash(name, name_len, &key.offset); - BUG_ON(ret); + key.offset = btrfs_name_hash(name, name_len); ret = btrfs_search_slot(trans, root, &key, path, ins_len, cow); if (ret < 0) return ERR_PTR(ret); diff -up --recursive --new-file vanilla/btrfs-progs-0.12/dir-test.c btrfs-progs-0.12/dir-test.c --- vanilla/btrfs-progs-0.12/dir-test.c 2008-02-06 08:37:45.000000000 -0800 +++ btrfs-progs-0.12/dir-test.c 2008-02-10 17:03:49.000000000 -0800 @@ -129,8 +129,8 @@ error: struct btrfs_dir_item); found = (char *)(di + 1); found_len = btrfs_dir_name_len(di); - btrfs_name_hash(buf, strlen(buf), &myhash); - btrfs_name_hash(found, found_len, &foundhash); + myhash = btrfs_name_hash(buf, strlen(buf)); + foundhash = btrfs_name_hash(found, found_len); if (myhash != foundhash) goto fatal_release; btrfs_release_path(root, &path); diff -up --recursive --new-file vanilla/btrfs-progs-0.12/hash.c btrfs-progs-0.12/hash.c --- vanilla/btrfs-progs-0.12/hash.c 2008-02-06 08:37:45.000000000 -0800 +++ btrfs-progs-0.12/hash.c 2008-02-10 17:02:36.000000000 -0800 @@ -75,12 +75,13 @@ static void str2hashbuf(const char *msg, *buf++ = pad; } -int btrfs_name_hash(const char *name, int len, u64 *hash_result) +u64 btrfs_name_hash(const char *name, int len) { __u32 hash; __u32 minor_hash = 0; const char *p; __u32 in[8], buf[2]; + u64 hash_result; /* Initialize the default seed for the hash checksum functions */ buf[0] = 0x67452301; @@ -97,8 +98,8 @@ int btrfs_name_hash(const char *name, in } hash = buf[0]; minor_hash = buf[1]; - *hash_result = buf[0]; - *hash_result <<= 32; - *hash_result |= buf[1]; - return 0; + hash_result = buf[0]; + hash_result <<= 32; + hash_result |= buf[1]; + return hash_result; } diff -up --recursive --new-file vanilla/btrfs-progs-0.12/hash.h btrfs-progs-0.12/hash.h --- vanilla/btrfs-progs-0.12/hash.h 2008-02-06 08:37:45.000000000 -0800 +++ btrfs-progs-0.12/hash.h 2008-02-10 17:02:21.000000000 -0800 @@ -18,5 +18,5 @@ #ifndef __HASH__ #define __HASH__ -int btrfs_name_hash(const char *name, int len, u64 *hash_result); +u64 btrfs_name_hash(const char *name, int len); #endif diff -up --recursive --new-file vanilla/btrfs-progs-0.12/hasher.c btrfs-progs-0.12/hasher.c --- vanilla/btrfs-progs-0.12/hasher.c 2008-02-06 08:37:45.000000000 -0800 +++ btrfs-progs-0.12/hasher.c 2008-02-10 17:04:04.000000000 -0800 @@ -35,8 +35,7 @@ int main() { continue; if (line[strlen(line)-1] == '\n') line[strlen(line)-1] = '\0'; - ret = btrfs_name_hash(line, strlen(line), &result); - BUG_ON(ret); + result = btrfs_name_hash(line, strlen(line)); printf("hash returns %llu\n", (unsigned long long)result); } return 0;
Filesystems like ext2 put their superblock 1 block into the partition in order to avoid overwriting disk labels and other uglies. UFS does this too, as do several others. One of the few exceptions I've been able to find is XFS. This is a real issue on sparc where the default sun disk labels created use an initial partition where block zero aliases the disk label. It took me a few iterations before I figured out why every btrfs make would zero out my disk label :-/
The CRC32C implementation in the btrfs progs is different from the one in the kernel, so obviously nothing can possibly work on big-endian. This is getting less and less fun by the minute, I simply wanted to test btrfs on Niagara :-/ Here is a patch to fix that: --- vanilla/btrfs-progs-0.12/crc32c.c 2008-02-06 08:37:45.000000000 -0800 +++ btrfs-progs-0.12/crc32c.c 2008-02-12 01:19:33.000000000 -0800 @@ -91,13 +91,11 @@ static const u32 crc32c_table[256] = { * crc using table. */ -u32 crc32c_le(u32 seed, unsigned char const *data, size_t length) +u32 crc32c_le(u32 crc, unsigned char const *data, size_t length) { - u32 crc = (__force __u32)(cpu_to_le32(seed)); - while (length--) crc crc32c_table[(crc ^ *data++) & 0xFFL] ^ (crc >> 8); - return le32_to_cpu((__force __le32)crc); + return crc; }
From: Chris Mason <chris.mason@oracle.com> Date: Wed, 6 Feb 2008 12:00:13 -0500> So, here's v0.12.Any page size larger than 4K will not work with btrfs. All of the extent stuff assumes that PAGE_SIZE <= sectorsize. I confirmed this by forcing mkfs.btrfs to use an 8K sectorsize on sparc64 and I was finally able to successfully mount a partition. With 4K there are zero's in the root tree node header, because it's extent's location on disk is at a sub-PAGE_SIZE multiple and the extent code doesn't handle that. You really need to start validating this stuff on other platforms. Something that isn't little endian and something that doesn't use 4K pages. I'm sure you have some powerpc parts around somewhere. :) Anyways, here is a patch for the kernel bits which fixes most of the unaligned accesses on sparc64. diff -u --recursive --new-file vanilla/btrfs-0.12/ctree.h btrfs-0.12/ctree.h --- vanilla/btrfs-0.12/ctree.h 2008-02-06 08:37:39.000000000 -0800 +++ btrfs-0.12/ctree.h 2008-02-10 17:17:49.000000000 -0800 @@ -495,22 +495,17 @@ #define BTRFS_SETGET_HEADER_FUNCS(name, type, member, bits) \ static inline u##bits btrfs_##name(struct extent_buffer *eb) \ { \ - char *kaddr = kmap_atomic(eb->first_page, KM_USER0); \ - unsigned long offset = offsetof(type, member); \ - u##bits res; \ - __le##bits *tmp = (__le##bits *)(kaddr + offset); \ - res = le##bits##_to_cpu(*tmp); \ - kunmap_atomic(kaddr, KM_USER0); \ + type *p = kmap_atomic(eb->first_page, KM_USER0); \ + u##bits res = le##bits##_to_cpu(p->member); \ + kunmap_atomic(p, KM_USER0); \ return res; \ } \ static inline void btrfs_set_##name(struct extent_buffer *eb, \ u##bits val) \ { \ - char *kaddr = kmap_atomic(eb->first_page, KM_USER0); \ - unsigned long offset = offsetof(type, member); \ - __le##bits *tmp = (__le##bits *)(kaddr + offset); \ - *tmp = cpu_to_le##bits(val); \ - kunmap_atomic(kaddr, KM_USER0); \ + type *p = kmap_atomic(eb->first_page, KM_USER0); \ + p->member = cpu_to_le##bits(val); \ + kunmap_atomic(p, KM_USER0); \ } #define BTRFS_SETGET_STACK_FUNCS(name, type, member, bits) \ diff -u --recursive --new-file vanilla/btrfs-0.12/dir-item.c btrfs-0.12/dir-item.c --- vanilla/btrfs-0.12/dir-item.c 2008-02-06 08:37:39.000000000 -0800 +++ btrfs-0.12/dir-item.c 2008-02-10 17:20:00.000000000 -0800 @@ -71,8 +71,7 @@ key.objectid = dir; btrfs_set_key_type(&key, BTRFS_XATTR_ITEM_KEY); - ret = btrfs_name_hash(name, name_len, &key.offset); - BUG_ON(ret); + key.offset = btrfs_name_hash(name, name_len); path = btrfs_alloc_path(); if (!path) return -ENOMEM; @@ -125,8 +124,7 @@ key.objectid = dir; btrfs_set_key_type(&key, BTRFS_DIR_ITEM_KEY); - ret = btrfs_name_hash(name, name_len, &key.offset); - BUG_ON(ret); + key.offset = btrfs_name_hash(name, name_len); path = btrfs_alloc_path(); data_size = sizeof(*dir_item) + name_len; dir_item = insert_with_overflow(trans, root, path, &key, data_size, @@ -199,8 +197,7 @@ key.objectid = dir; btrfs_set_key_type(&key, BTRFS_DIR_ITEM_KEY); - ret = btrfs_name_hash(name, name_len, &key.offset); - BUG_ON(ret); + key.offset = btrfs_name_hash(name, name_len); ret = btrfs_search_slot(trans, root, &key, path, ins_len, cow); if (ret < 0) @@ -261,8 +258,7 @@ key.objectid = dir; btrfs_set_key_type(&key, BTRFS_XATTR_ITEM_KEY); - ret = btrfs_name_hash(name, name_len, &key.offset); - BUG_ON(ret); + key.offset = btrfs_name_hash(name, name_len); ret = btrfs_search_slot(trans, root, &key, path, ins_len, cow); if (ret < 0) return ERR_PTR(ret); diff -u --recursive --new-file vanilla/btrfs-0.12/hash.c btrfs-0.12/hash.c --- vanilla/btrfs-0.12/hash.c 2008-02-06 08:37:39.000000000 -0800 +++ btrfs-0.12/hash.c 2008-02-10 17:19:19.000000000 -0800 @@ -76,19 +76,18 @@ *buf++ = pad; } -int btrfs_name_hash(const char *name, int len, u64 *hash_result) +u64 btrfs_name_hash(const char *name, int len) { __u32 hash; __u32 minor_hash = 0; const char *p; __u32 in[8], buf[2]; + u64 hash_result; if (len == 1 && *name == '.') { - *hash_result = 1; - return 0; + return 1; } else if (len == 2 && name[0] == '.' && name[1] == '.') { - *hash_result = 2; - return 0; + return 2; } /* Initialize the default seed for the hash checksum functions */ @@ -106,8 +105,8 @@ } hash = buf[0]; minor_hash = buf[1]; - *hash_result = buf[0]; - *hash_result <<= 32; - *hash_result |= buf[1]; - return 0; + hash_result = buf[0]; + hash_result <<= 32; + hash_result |= buf[1]; + return hash_result; } diff -u --recursive --new-file vanilla/btrfs-0.12/hash.h btrfs-0.12/hash.h --- vanilla/btrfs-0.12/hash.h 2008-02-06 08:37:39.000000000 -0800 +++ btrfs-0.12/hash.h 2008-02-10 17:19:25.000000000 -0800 @@ -18,5 +18,5 @@ #ifndef __HASH__ #define __HASH__ -int btrfs_name_hash(const char *name, int len, u64 *hash_result); +u64 btrfs_name_hash(const char *name, int len); #endif diff -u --recursive --new-file vanilla/btrfs-0.12/struct-funcs.c btrfs-0.12/struct-funcs.c --- vanilla/btrfs-0.12/struct-funcs.c 2008-02-06 08:37:39.000000000 -0800 +++ btrfs-0.12/struct-funcs.c 2008-02-11 22:50:46.000000000 -0800 @@ -21,16 +21,15 @@ u##bits btrfs_##name(struct extent_buffer *eb, \ type *s) \ { \ - unsigned long offset = (unsigned long)s + \ - offsetof(type, member); \ - __le##bits *tmp; \ + unsigned long part_offset = (unsigned long)s; \ + unsigned long offset = part_offset + offsetof(type, member); \ + type *p; \ /* ugly, but we want the fast path here */ \ if (eb->map_token && offset >= eb->map_start && \ offset + sizeof(((type *)0)->member) <= eb->map_start + \ eb->map_len) { \ - tmp = (__le##bits *)(eb->kaddr + offset - \ - eb->map_start); \ - return le##bits##_to_cpu(*tmp); \ + p = (type *)(eb->kaddr + part_offset - eb->map_start); \ + return le##bits##_to_cpu(p->member); \ } \ { \ int err; \ @@ -48,8 +47,8 @@ read_eb_member(eb, s, type, member, &res); \ return le##bits##_to_cpu(res); \ } \ - tmp = (__le##bits *)(kaddr + offset - map_start); \ - res = le##bits##_to_cpu(*tmp); \ + p = (type *)(kaddr + part_offset - map_start); \ + res = le##bits##_to_cpu(p->member); \ if (unmap_on_exit) \ unmap_extent_buffer(eb, map_token, KM_USER1); \ return res; \ @@ -58,16 +57,15 @@ void btrfs_set_##name(struct extent_buffer *eb, \ type *s, u##bits val) \ { \ - unsigned long offset = (unsigned long)s + \ - offsetof(type, member); \ - __le##bits *tmp; \ + unsigned long part_offset = (unsigned long)s; \ + unsigned long offset = part_offset + offsetof(type, member); \ + type *p; \ /* ugly, but we want the fast path here */ \ if (eb->map_token && offset >= eb->map_start && \ offset + sizeof(((type *)0)->member) <= eb->map_start + \ eb->map_len) { \ - tmp = (__le##bits *)(eb->kaddr + offset - \ - eb->map_start); \ - *tmp = cpu_to_le##bits(val); \ + p = (type *)(eb->kaddr + part_offset - eb->map_start); \ + p->member = cpu_to_le##bits(val); \ return; \ } \ { \ @@ -86,8 +84,8 @@ write_eb_member(eb, s, type, member, &val); \ return; \ } \ - tmp = (__le##bits *)(kaddr + offset - map_start); \ - *tmp = cpu_to_le##bits(val); \ + p = (type *)(kaddr + part_offset - map_start); \ + p->member = cpu_to_le##bits(val); \ if (unmap_on_exit) \ unmap_extent_buffer(eb, map_token, KM_USER1); \ } \