ocfs2 currently uses the Journaled Block Device (JBD) for its journaling. This is a very stable and tested codebase. However, JBD is limited by architecture to 32bit block numbers. This means an ocfs2 filesystem is limited to 2^32 blocks. With a 4K blocksize, that's 16TB. People want larger volumes. Fortunately, there is now JBD2. JBD2 adds 64bit block number support and some other features. JBD2 is backwards compatible, so it can be used with existing filesystems. The new features won't be accessed until the ocfs2-tools turn them on. Before we support JBD2, however, we need to ensure that inode blocks are allocated below that 32bit boundary. stat(2) cannot handle larger inode numbers on 32bit machines. With that limit in place, we can fully support JBD2 in the kernel filesystem. Like XFS, we provide the 'inode64' option to remove that limit. A config option is provided (OCFS2_COMPAT_JBD) to force ocfs2 to use the old JBD. This is for the paranoid while we test JBD2. The kernel code is available on the 'jbd2' branch of my git repository. View: http://oss.oracle.com/git/?p=jlbec/linux-2.6.git;a=shortlog;h=jbd2 Pull: git pull git://oss.oracle.com/git/jlbec/linux-2.6.git jbd2
Joel Becker
2008-Sep-04 03:03 UTC
[Ocfs2-devel] [PATCH 1/3] ocfs2: Limit inode allocation to 32bits.
ocfs2 inode numbers are block numbers. For any filesystem with less than 2^32 blocks, this is not a problem. However, when ocfs2 starts using JDB2, it will be able to support filesystems with more than 2^32 blocks. This would result in inode numbers higher than 2^32. The problem is that stat(2) can't handle those numbers on 32bit machines. The simple solution is to have ocfs2 allocate all inodes below that boundary. The suballoc code is changed to honor an optional block limit. Only the inode suballocator sets that limit - all other allocations stay unlimited. The biggest trick is to grow the inode suballocator beneath that limit. There's no point in allocating block groups that are above the limit, then rejecting their elements later on. We want to prevent the inode allocator from ever having block groups above the limit. This involves a little gyration with the local alloc code. If the local alloc window is above the limit, it signals the caller to try the global bitmap but does not disable the local alloc file (which can be used for other allocations). Signed-off-by: Joel Becker <joel.becker at oracle.com> --- fs/ocfs2/localalloc.c | 55 +++++++++++++++++++++++++++++++ fs/ocfs2/suballoc.c | 86 +++++++++++++++++++++++++++++++++++++++--------- fs/ocfs2/suballoc.h | 11 ++++-- 3 files changed, 132 insertions(+), 20 deletions(-) diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c index 28e492e..4b15885 100644 --- a/fs/ocfs2/localalloc.c +++ b/fs/ocfs2/localalloc.c @@ -453,6 +453,46 @@ out: return status; } +/* Check to see if the local alloc window is within ac->ac_max_block */ +static int ocfs2_local_alloc_in_range(struct inode *inode, + struct ocfs2_alloc_context *ac, + u32 bits_wanted) +{ + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); + struct ocfs2_dinode *alloc; + struct ocfs2_local_alloc *la; + int start; + u64 block_off; + + if (!ac->ac_max_block) + return 1; + + alloc = (struct ocfs2_dinode *) osb->local_alloc_bh->b_data; + la = OCFS2_LOCAL_ALLOC(alloc); + + start = ocfs2_local_alloc_find_clear_bits(osb, alloc, bits_wanted); + if (start == -1) { + mlog_errno(-ENOSPC); + return 0; + } + + /* + * Converting (bm_off + start + bits_wanted) to blocks gives us + * the blkno just past our actual allocation. This is perfect + * to compare with ac_max_block. + */ + block_off = ocfs2_clusters_to_blocks(inode->i_sb, + le32_to_cpu(la->la_bm_off) + + start + bits_wanted); + mlog(0, "Checking %llu against %llu\n", + (unsigned long long)block_off, + (unsigned long long)ac->ac_max_block); + if (block_off > ac->ac_max_block) + return 0; + + return 1; +} + /* * make sure we've got at least bitswanted contiguous bits in the * local alloc. You lose them when you drop i_mutex. @@ -524,6 +564,21 @@ int ocfs2_reserve_local_alloc_bits(struct ocfs2_super *osb, } } + if (ac->ac_max_block) + mlog(0, "Calling in_range for max block %llu\n", + (unsigned long long)ac->ac_max_block); + + if (!ocfs2_local_alloc_in_range(local_alloc_inode, ac, + bits_wanted)) { + /* + * The window is outside ac->ac_max_block. + * This errno tells the caller to keep localalloc enabled + * but to get the allocation from the main bitmap. + */ + status = -EFBIG; + goto bail; + } + ac->ac_inode = local_alloc_inode; /* We should never use localalloc from another slot */ ac->ac_alloc_slot = osb->slot_num; diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c index d2d278f..65e2bd5 100644 --- a/fs/ocfs2/suballoc.c +++ b/fs/ocfs2/suballoc.c @@ -62,15 +62,18 @@ static int ocfs2_block_group_fill(handle_t *handle, struct ocfs2_chain_list *cl); static int ocfs2_block_group_alloc(struct ocfs2_super *osb, struct inode *alloc_inode, - struct buffer_head *bh); + struct buffer_head *bh, + u64 max_block); static int ocfs2_cluster_group_search(struct inode *inode, struct buffer_head *group_bh, u32 bits_wanted, u32 min_bits, + u64 max_block, u16 *bit_off, u16 *bits_found); static int ocfs2_block_group_search(struct inode *inode, struct buffer_head *group_bh, u32 bits_wanted, u32 min_bits, + u64 max_block, u16 *bit_off, u16 *bits_found); static int ocfs2_claim_suballoc_bits(struct ocfs2_super *osb, struct ocfs2_alloc_context *ac, @@ -110,6 +113,9 @@ static inline void ocfs2_block_to_cluster_group(struct inode *inode, u64 data_blkno, u64 *bg_blkno, u16 *bg_bit_off); +static int ocfs2_reserve_clusters_with_limit(struct ocfs2_super *osb, + u32 bits_wanted, u64 max_block, + struct ocfs2_alloc_context **ac); static void ocfs2_free_ac_resource(struct ocfs2_alloc_context *ac) { @@ -276,7 +282,8 @@ static inline u16 ocfs2_find_smallest_chain(struct ocfs2_chain_list *cl) */ static int ocfs2_block_group_alloc(struct ocfs2_super *osb, struct inode *alloc_inode, - struct buffer_head *bh) + struct buffer_head *bh, + u64 max_block) { int status, credits; struct ocfs2_dinode *fe = (struct ocfs2_dinode *) bh->b_data; @@ -294,9 +301,9 @@ static int ocfs2_block_group_alloc(struct ocfs2_super *osb, mlog_entry_void(); cl = &fe->id2.i_chain; - status = ocfs2_reserve_clusters(osb, - le16_to_cpu(cl->cl_cpg), - &ac); + status = ocfs2_reserve_clusters_with_limit(osb, + le16_to_cpu(cl->cl_cpg), + max_block, &ac); if (status < 0) { if (status != -ENOSPC) mlog_errno(status); @@ -469,7 +476,8 @@ static int ocfs2_reserve_suballoc_bits(struct ocfs2_super *osb, goto bail; } - status = ocfs2_block_group_alloc(osb, alloc_inode, bh); + status = ocfs2_block_group_alloc(osb, alloc_inode, bh, + ac->ac_max_block); if (status < 0) { if (status != -ENOSPC) mlog_errno(status); @@ -582,6 +590,13 @@ int ocfs2_reserve_new_inode(struct ocfs2_super *osb, (*ac)->ac_group_search = ocfs2_block_group_search; /* + * stat(2) can't handle i_ino > 32bits, so we tell the + * lower levels not to allocate us a block group past that + * limit. + */ + (*ac)->ac_max_block = (u32)~0U; + + /* * slot is set when we successfully steal inode from other nodes. * It is reset in 3 places: * 1. when we flush the truncate log @@ -661,9 +676,9 @@ bail: /* Callers don't need to care which bitmap (local alloc or main) to * use so we figure it out for them, but unfortunately this clutters * things a bit. */ -int ocfs2_reserve_clusters(struct ocfs2_super *osb, - u32 bits_wanted, - struct ocfs2_alloc_context **ac) +static int ocfs2_reserve_clusters_with_limit(struct ocfs2_super *osb, + u32 bits_wanted, u64 max_block, + struct ocfs2_alloc_context **ac) { int status; @@ -677,16 +692,14 @@ int ocfs2_reserve_clusters(struct ocfs2_super *osb, } (*ac)->ac_bits_wanted = bits_wanted; + (*ac)->ac_max_block = max_block; status = -ENOSPC; if (ocfs2_alloc_should_use_local(osb, bits_wanted)) { status = ocfs2_reserve_local_alloc_bits(osb, bits_wanted, *ac); - if ((status < 0) && (status != -ENOSPC)) { - mlog_errno(status); - goto bail; - } else if (status == -ENOSPC) { + if (status == -ENOSPC) { /* reserve_local_bits will return enospc with * the local alloc inode still locked, so we * can change this safely here. */ @@ -695,6 +708,14 @@ int ocfs2_reserve_clusters(struct ocfs2_super *osb, * can clean up what's left of the local * allocation */ osb->local_alloc_state = OCFS2_LA_DISABLED; + } else if (status == -EFBIG) { + /* The local alloc window is outside ac_max_block. + * use the main bitmap, but don't disable + * local alloc. */ + status = -ENOSPC; + } else if (status < 0) { + mlog_errno(status); + goto bail; } } @@ -718,6 +739,13 @@ bail: return status; } +int ocfs2_reserve_clusters(struct ocfs2_super *osb, + u32 bits_wanted, + struct ocfs2_alloc_context **ac) +{ + return ocfs2_reserve_clusters_with_limit(osb, bits_wanted, 0, ac); +} + /* * More or less lifted from ext3. I'll leave their description below: * @@ -1000,10 +1028,12 @@ static inline int ocfs2_block_group_reasonably_empty(struct ocfs2_group_desc *bg static int ocfs2_cluster_group_search(struct inode *inode, struct buffer_head *group_bh, u32 bits_wanted, u32 min_bits, + u64 max_block, u16 *bit_off, u16 *bits_found) { int search = -ENOSPC; int ret; + u64 blkoff; struct ocfs2_group_desc *gd = (struct ocfs2_group_desc *) group_bh->b_data; u16 tmp_off, tmp_found; unsigned int max_bits, gd_cluster_off; @@ -1037,6 +1067,17 @@ static int ocfs2_cluster_group_search(struct inode *inode, if (ret) return ret; + if (max_block) { + blkoff = ocfs2_clusters_to_blocks(inode->i_sb, + gd_cluster_off + + tmp_off + tmp_found); + mlog(ML_NOTICE, "Checking %llu against %llu\n", + (unsigned long long)blkoff, + (unsigned long long)max_block); + if (blkoff > max_block) + return -ENOSPC; + } + /* ocfs2_block_group_find_clear_bits() might * return success, but we still want to return * -ENOSPC unless it found the minimum number @@ -1054,19 +1095,31 @@ static int ocfs2_cluster_group_search(struct inode *inode, static int ocfs2_block_group_search(struct inode *inode, struct buffer_head *group_bh, u32 bits_wanted, u32 min_bits, + u64 max_block, u16 *bit_off, u16 *bits_found) { int ret = -ENOSPC; + u64 blkoff; struct ocfs2_group_desc *bg = (struct ocfs2_group_desc *) group_bh->b_data; BUG_ON(min_bits != 1); BUG_ON(ocfs2_is_cluster_bitmap(inode)); - if (bg->bg_free_bits_count) + if (bg->bg_free_bits_count) { ret = ocfs2_block_group_find_clear_bits(OCFS2_SB(inode->i_sb), group_bh, bits_wanted, le16_to_cpu(bg->bg_bits), bit_off, bits_found); + if (!ret && max_block) { + blkoff = le64_to_cpu(bg->bg_blkno) + *bit_off + + *bits_found; + mlog(0, "Checking %llu against %llu\n", + (unsigned long long)blkoff, + (unsigned long long)max_block); + if (blkoff > max_block) + ret = -ENOSPC; + } + } return ret; } @@ -1131,7 +1184,7 @@ static int ocfs2_search_one_group(struct ocfs2_alloc_context *ac, } ret = ac->ac_group_search(alloc_inode, group_bh, bits_wanted, min_bits, - bit_off, &found); + ac->ac_max_block, bit_off, &found); if (ret < 0) { if (ret != -ENOSPC) mlog_errno(ret); @@ -1204,7 +1257,8 @@ static int ocfs2_search_chain(struct ocfs2_alloc_context *ac, /* for now, the chain search is a bit simplistic. We just use * the 1st group with any empty bits. */ while ((status = ac->ac_group_search(alloc_inode, group_bh, - bits_wanted, min_bits, bit_off, + bits_wanted, min_bits, + ac->ac_max_block, bit_off, &tmp_bits)) == -ENOSPC) { if (!bg->bg_next_group) break; diff --git a/fs/ocfs2/suballoc.h b/fs/ocfs2/suballoc.h index 544c600..2e9c032 100644 --- a/fs/ocfs2/suballoc.h +++ b/fs/ocfs2/suballoc.h @@ -28,10 +28,11 @@ typedef int (group_search_t)(struct inode *, struct buffer_head *, - u32, - u32, - u16 *, - u16 *); + u32, /* bits_wanted */ + u32, /* min_bits */ + u64, /* max_block */ + u16 *, /* *bit_off */ + u16 *); /* *bits_found */ struct ocfs2_alloc_context { struct inode *ac_inode; /* which bitmap are we allocating from? */ @@ -51,6 +52,8 @@ struct ocfs2_alloc_context { group_search_t *ac_group_search; u64 ac_last_group; + u64 ac_max_block; /* Highest block number to allocate. 0 is + is the same as ~0 - unlimited */ }; void ocfs2_free_alloc_context(struct ocfs2_alloc_context *ac); -- 1.5.6.3
Joel Becker
2008-Sep-04 03:03 UTC
[Ocfs2-devel] [PATCH 2/3] ocfs2: Add the 'inode64' mount option.
Now that ocfs2 limits inode numbers to 32bits, add a mount option to disable the limit. This parallels XFS. 64bit systems can handle the larger inode numbers. Signed-off-by: Joel Becker <joel.becker at oracle.com> --- fs/ocfs2/ocfs2.h | 1 + fs/ocfs2/suballoc.c | 5 +++-- fs/ocfs2/super.c | 17 +++++++++++++++++ 3 files changed, 21 insertions(+), 2 deletions(-) diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h index 7f625f2..ca89e65 100644 --- a/fs/ocfs2/ocfs2.h +++ b/fs/ocfs2/ocfs2.h @@ -184,6 +184,7 @@ enum ocfs2_mount_options OCFS2_MOUNT_ERRORS_PANIC = 1 << 3, /* Panic on errors */ OCFS2_MOUNT_DATA_WRITEBACK = 1 << 4, /* No data ordering */ OCFS2_MOUNT_LOCALFLOCKS = 1 << 5, /* No cluster aware user file locks */ + OCFS2_MOUNT_INODE64 = 1 << 6, /* Allow inode numbers > 2^32 */ }; #define OCFS2_OSB_SOFT_RO 0x0001 diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c index 65e2bd5..b783806 100644 --- a/fs/ocfs2/suballoc.c +++ b/fs/ocfs2/suballoc.c @@ -592,9 +592,10 @@ int ocfs2_reserve_new_inode(struct ocfs2_super *osb, /* * stat(2) can't handle i_ino > 32bits, so we tell the * lower levels not to allocate us a block group past that - * limit. + * limit. The 'inode64' mount option avoids this behavior. */ - (*ac)->ac_max_block = (u32)~0U; + if (!(osb->s_mount_opt & OCFS2_MOUNT_INODE64)) + (*ac)->ac_max_block = (u32)~0U; /* * slot is set when we successfully steal inode from other nodes. diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index 88255d3..260da3a 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -154,6 +154,7 @@ enum { Opt_localalloc, Opt_localflocks, Opt_stack, + Opt_inode64, Opt_err, }; @@ -173,6 +174,7 @@ static match_table_t tokens = { {Opt_localalloc, "localalloc=%d"}, {Opt_localflocks, "localflocks"}, {Opt_stack, "cluster_stack=%s"}, + {Opt_inode64, "inode64"}, {Opt_err, NULL} }; @@ -406,6 +408,15 @@ static int ocfs2_remount(struct super_block *sb, int *flags, char *data) goto out; } + /* Probably don't want this on remount; it might + * mess with other nodes */ + if (!(osb->s_mount_opt & OCFS2_MOUNT_INODE64) && + (parsed_options.mount_opt & OCFS2_MOUNT_INODE64)) { + ret = -EINVAL; + mlog(ML_ERROR, "Cannot enable inode64 on remount\n"); + goto out; + } + /* We're going to/from readonly mode. */ if ((*flags & MS_RDONLY) != (sb->s_flags & MS_RDONLY)) { /* Lock here so the check of HARD_RO and the potential @@ -918,6 +929,9 @@ static int ocfs2_parse_options(struct super_block *sb, OCFS2_STACK_LABEL_LEN); mopt->cluster_stack[OCFS2_STACK_LABEL_LEN] = '\0'; break; + case Opt_inode64: + mopt->mount_opt |= OCFS2_MOUNT_INODE64; + break; default: mlog(ML_ERROR, "Unrecognized mount option \"%s\" " @@ -980,6 +994,9 @@ static int ocfs2_show_options(struct seq_file *s, struct vfsmount *mnt) seq_printf(s, ",cluster_stack=%.*s", OCFS2_STACK_LABEL_LEN, osb->osb_cluster_stack); + if (opts & OCFS2_MOUNT_INODE64) + seq_printf(s, ",inode64"); + return 0; } -- 1.5.6.3
ocfs2 wants JBD2 for many reasons, not the least of which is that JBD is limiting our maximum filesystem size. It's a pretty trivial change. Most functions are just renamed. The only functional change is moving to Jan's inode-based ordered data mode. It's better, too. Because JBD2 reads and writes JBD journals, this is compatible with any existing filesystem. It can even interact with JBD-based ocfs2 as long as the journal is formated for JBD. We provide a compatibility option so that paranoid people can still use JBD for the time being. This will go away shortly. Signed-off-by: Joel Becker <joel.becker at oracle.com> --- fs/Kconfig | 40 +++++++++++++-------- fs/ocfs2/alloc.c | 28 ++++++--------- fs/ocfs2/aops.c | 21 ++++++++--- fs/ocfs2/file.c | 14 +++++-- fs/ocfs2/inode.c | 5 +++ fs/ocfs2/inode.h | 1 + fs/ocfs2/journal.c | 72 ++++++++++++++++++++------------------ fs/ocfs2/journal.h | 25 +++++++++++-- fs/ocfs2/ocfs2.h | 7 +++- fs/ocfs2/ocfs2_jbd_compat.h | 82 +++++++++++++++++++++++++++++++++++++++++++ fs/ocfs2/super.c | 10 +++-- fs/ocfs2/uptodate.c | 6 +++- 12 files changed, 227 insertions(+), 84 deletions(-) create mode 100644 fs/ocfs2/ocfs2_jbd_compat.h diff --git a/fs/Kconfig b/fs/Kconfig index abccb5d..e651a36 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -206,17 +206,16 @@ config JBD tristate help This is a generic journalling layer for block devices. It is - currently used by the ext3 and OCFS2 file systems, but it could - also be used to add journal support to other file systems or block + currently used by the ext3 file system, but it could also be + used to add journal support to other file systems or block devices such as RAID or LVM. - If you are using the ext3 or OCFS2 file systems, you need to - say Y here. If you are not using ext3 OCFS2 then you will probably - want to say N. + If you are using the ext3 file system, you need to say Y here. + If you are not using ext3 then you will probably want to say N. To compile this device as a module, choose M here: the module will be - called jbd. If you are compiling ext3 or OCFS2 into the kernel, - you cannot compile this code as a module. + called jbd. If you are compiling ext3 into the kernel, you + cannot compile this code as a module. config JBD_DEBUG bool "JBD (ext3) debugging support" @@ -240,16 +239,17 @@ config JBD2 help This is a generic journaling layer for block devices that support both 32-bit and 64-bit block numbers. It is currently used by - the ext4dev/ext4 filesystem, but it could also be used to add - journal support to other file systems or block devices such - as RAID or LVM. + the ext4dev/ext4 and OCFS2 filesystems, but it could also be + used to add journal support to other file systems or block + devices such as RAID or LVM. - If you are using ext4dev/ext4, you need to say Y here. If you are not - using ext4dev/ext4 then you will probably want to say N. + If you are using ext4dev/ext4 or OCFS2, you need to say Y here. + If you are not using ext4dev/ext4 or OCFS2 then you will + probably want to say N. To compile this device as a module, choose M here. The module will be - called jbd2. If you are compiling ext4dev/ext4 into the kernel, - you cannot compile this code as a module. + called jbd2. If you are compiling ext4dev/ext4 or OCFS2 into the + kernel, you cannot compile this code as a module. config JBD2_DEBUG bool "JBD2 (ext4dev/ext4) debugging support" @@ -426,7 +426,7 @@ config OCFS2_FS tristate "OCFS2 file system support" depends on NET && SYSFS select CONFIGFS_FS - select JBD + select JBD2 select CRC32 help OCFS2 is a general purpose extent based shared disk cluster file @@ -497,6 +497,16 @@ config OCFS2_DEBUG_FS this option for debugging only as it is likely to decrease performance of the filesystem. +config OCFS2_COMPAT_JBD + bool "Use JBD for compatibility" + depends on OCFS2_FS + default n + select JBD + help + The ocfs2 filesystem now uses JBD2 for its journalling. JBD2 + is backwards compatible with JBD. It is safe to say N here. + However, if you really want to use the original JBD, say Y here. + endif # BLOCK config DNOTIFY diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c index 10bfb46..134c56d 100644 --- a/fs/ocfs2/alloc.c +++ b/fs/ocfs2/alloc.c @@ -6008,20 +6008,13 @@ bail: return status; } -static int ocfs2_writeback_zero_func(handle_t *handle, struct buffer_head *bh) +static int ocfs2_zero_func(handle_t *handle, struct buffer_head *bh) { set_buffer_uptodate(bh); mark_buffer_dirty(bh); return 0; } -static int ocfs2_ordered_zero_func(handle_t *handle, struct buffer_head *bh) -{ - set_buffer_uptodate(bh); - mark_buffer_dirty(bh); - return ocfs2_journal_dirty_data(handle, bh); -} - static void ocfs2_map_and_dirty_page(struct inode *inode, handle_t *handle, unsigned int from, unsigned int to, struct page *page, int zero, u64 *phys) @@ -6040,17 +6033,18 @@ static void ocfs2_map_and_dirty_page(struct inode *inode, handle_t *handle, * here if they aren't - ocfs2_map_page_blocks() * might've skipped some */ - if (ocfs2_should_order_data(inode)) { - ret = walk_page_buffers(handle, - page_buffers(page), - from, to, &partial, - ocfs2_ordered_zero_func); - if (ret < 0) - mlog_errno(ret); - } else { + ret = walk_page_buffers(handle, page_buffers(page), + from, to, &partial, + ocfs2_zero_func); + if (ret < 0) + mlog_errno(ret); + else if (ocfs2_should_order_data(inode)) { + ret = ocfs2_jbd2_file_inode(handle, inode); +#ifdef CONFIG_OCFS2_COMPAT_JBD ret = walk_page_buffers(handle, page_buffers(page), from, to, &partial, - ocfs2_writeback_zero_func); + ocfs2_journal_dirty_data); +#endif if (ret < 0) mlog_errno(ret); } diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c index 506c24f..28bb7f0 100644 --- a/fs/ocfs2/aops.c +++ b/fs/ocfs2/aops.c @@ -485,11 +485,14 @@ handle_t *ocfs2_start_walk_page_trans(struct inode *inode, } if (ocfs2_should_order_data(inode)) { + ret = ocfs2_jbd2_file_inode(handle, inode); +#ifdef CONFIG_OCFS2_COMPAT_JBD ret = walk_page_buffers(handle, page_buffers(page), from, to, NULL, ocfs2_journal_dirty_data); - if (ret < 0) +#endif + if (ret < 0) mlog_errno(ret); } out: @@ -669,7 +672,7 @@ static void ocfs2_invalidatepage(struct page *page, unsigned long offset) { journal_t *journal = OCFS2_SB(page->mapping->host->i_sb)->journal->j_journal; - journal_invalidatepage(journal, page, offset); + jbd2_journal_invalidatepage(journal, page, offset); } static int ocfs2_releasepage(struct page *page, gfp_t wait) @@ -678,7 +681,7 @@ static int ocfs2_releasepage(struct page *page, gfp_t wait) if (!page_has_buffers(page)) return 0; - return journal_try_to_free_buffers(journal, page, wait); + return jbd2_journal_try_to_free_buffers(journal, page, wait); } static ssize_t ocfs2_direct_IO(int rw, @@ -1074,11 +1077,15 @@ static void ocfs2_write_failure(struct inode *inode, tmppage = wc->w_pages[i]; if (page_has_buffers(tmppage)) { - if (ocfs2_should_order_data(inode)) + if (ocfs2_should_order_data(inode)) { + ocfs2_jbd2_file_inode(wc->w_handle, inode); +#ifdef CONFIG_OCFS2_COMPAT_JBD walk_page_buffers(wc->w_handle, page_buffers(tmppage), from, to, NULL, ocfs2_journal_dirty_data); +#endif + } block_commit_write(tmppage, from, to); } @@ -1905,11 +1912,15 @@ int ocfs2_write_end_nolock(struct address_space *mapping, } if (page_has_buffers(tmppage)) { - if (ocfs2_should_order_data(inode)) + if (ocfs2_should_order_data(inode)) { + ocfs2_jbd2_file_inode(wc->w_handle, inode); +#ifdef CONFIG_OCFS2_COMPAT_JBD walk_page_buffers(wc->w_handle, page_buffers(tmppage), from, to, NULL, ocfs2_journal_dirty_data); +#endif + } block_commit_write(tmppage, from, to); } } diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index ec2ed15..2f2217a 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -184,7 +184,7 @@ static int ocfs2_sync_file(struct file *file, goto bail; journal = osb->journal->j_journal; - err = journal_force_commit(journal); + err = jbd2_journal_force_commit(journal); bail: mlog_exit(err); @@ -1096,9 +1096,15 @@ int ocfs2_setattr(struct dentry *dentry, struct iattr *attr) goto bail_unlock; } - if (i_size_read(inode) > attr->ia_size) + if (i_size_read(inode) > attr->ia_size) { + if (ocfs2_should_order_data(inode)) { + status = ocfs2_begin_ordered_truncate(inode, + attr->ia_size); + if (status) + goto bail_unlock; + } status = ocfs2_truncate_file(inode, bh, attr->ia_size); - else + } else status = ocfs2_extend_file(inode, bh, attr->ia_size); if (status < 0) { if (status != -ENOSPC) @@ -2040,7 +2046,7 @@ out_dio: */ if (old_size != i_size_read(inode) || old_clusters != OCFS2_I(inode)->ip_clusters) { - ret = journal_force_commit(osb->journal->j_journal); + ret = jbd2_journal_force_commit(osb->journal->j_journal); if (ret < 0) written = ret; } diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c index 7e9e4c7..ce8ec09 100644 --- a/fs/ocfs2/inode.c +++ b/fs/ocfs2/inode.c @@ -909,6 +909,9 @@ void ocfs2_delete_inode(struct inode *inode) goto bail; } + if (ocfs2_should_order_data(inode)) + ocfs2_begin_ordered_truncate(inode, 0); + if (!ocfs2_inode_is_valid_to_delete(inode)) { /* It's probably not necessary to truncate_inode_pages * here but we do it for safety anyway (it will most @@ -1081,6 +1084,8 @@ void ocfs2_clear_inode(struct inode *inode) oi->ip_last_trans = 0; oi->ip_dir_start_lookup = 0; oi->ip_blkno = 0ULL; + jbd2_journal_release_jbd_inode(OCFS2_SB(inode->i_sb)->journal->j_journal, + &oi->ip_jinode); bail: mlog_exit_void(); diff --git a/fs/ocfs2/inode.h b/fs/ocfs2/inode.h index 390a855..3b2eb2a 100644 --- a/fs/ocfs2/inode.h +++ b/fs/ocfs2/inode.h @@ -68,6 +68,7 @@ struct ocfs2_inode_info struct ocfs2_extent_map ip_extent_map; struct inode vfs_inode; + struct jbd2_inode ip_jinode; }; /* diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c index c47bc2a..373d943 100644 --- a/fs/ocfs2/journal.c +++ b/fs/ocfs2/journal.c @@ -215,9 +215,9 @@ static int ocfs2_commit_cache(struct ocfs2_super *osb) goto finally; } - journal_lock_updates(journal->j_journal); - status = journal_flush(journal->j_journal); - journal_unlock_updates(journal->j_journal); + jbd2_journal_lock_updates(journal->j_journal); + status = jbd2_journal_flush(journal->j_journal); + jbd2_journal_unlock_updates(journal->j_journal); if (status < 0) { up_write(&journal->j_trans_barrier); mlog_errno(status); @@ -264,7 +264,7 @@ handle_t *ocfs2_start_trans(struct ocfs2_super *osb, int max_buffs) down_read(&osb->journal->j_trans_barrier); - handle = journal_start(journal, max_buffs); + handle = jbd2_journal_start(journal, max_buffs); if (IS_ERR(handle)) { up_read(&osb->journal->j_trans_barrier); @@ -290,7 +290,7 @@ int ocfs2_commit_trans(struct ocfs2_super *osb, BUG_ON(!handle); - ret = journal_stop(handle); + ret = jbd2_journal_stop(handle); if (ret < 0) mlog_errno(ret); @@ -304,7 +304,7 @@ int ocfs2_commit_trans(struct ocfs2_super *osb, * transaction. extend_trans will either extend the current handle by * nblocks, or commit it and start a new one with nblocks credits. * - * This might call journal_restart() which will commit dirty buffers + * This might call jbd2_journal_restart() which will commit dirty buffers * and then restart the transaction. Before calling * ocfs2_extend_trans(), any changed blocks should have been * dirtied. After calling it, all blocks which need to be changed must @@ -332,7 +332,7 @@ int ocfs2_extend_trans(handle_t *handle, int nblocks) #ifdef CONFIG_OCFS2_DEBUG_FS status = 1; #else - status = journal_extend(handle, nblocks); + status = jbd2_journal_extend(handle, nblocks); if (status < 0) { mlog_errno(status); goto bail; @@ -340,8 +340,10 @@ int ocfs2_extend_trans(handle_t *handle, int nblocks) #endif if (status > 0) { - mlog(0, "journal_extend failed, trying journal_restart\n"); - status = journal_restart(handle, nblocks); + mlog(0, + "jbd2_journal_extend failed, trying " + "jbd2_journal_restart\n"); + status = jbd2_journal_restart(handle, nblocks); if (status < 0) { mlog_errno(status); goto bail; @@ -393,11 +395,11 @@ int ocfs2_journal_access(handle_t *handle, switch (type) { case OCFS2_JOURNAL_ACCESS_CREATE: case OCFS2_JOURNAL_ACCESS_WRITE: - status = journal_get_write_access(handle, bh); + status = jbd2_journal_get_write_access(handle, bh); break; case OCFS2_JOURNAL_ACCESS_UNDO: - status = journal_get_undo_access(handle, bh); + status = jbd2_journal_get_undo_access(handle, bh); break; default: @@ -422,7 +424,7 @@ int ocfs2_journal_dirty(handle_t *handle, mlog_entry("(bh->b_blocknr=%llu)\n", (unsigned long long)bh->b_blocknr); - status = journal_dirty_metadata(handle, bh); + status = jbd2_journal_dirty_metadata(handle, bh); if (status < 0) mlog(ML_ERROR, "Could not dirty metadata buffer. " "(bh->b_blocknr=%llu)\n", @@ -432,6 +434,7 @@ int ocfs2_journal_dirty(handle_t *handle, return status; } +#ifdef CONFIG_OCFS2_COMPAT_JBD int ocfs2_journal_dirty_data(handle_t *handle, struct buffer_head *bh) { @@ -443,8 +446,9 @@ int ocfs2_journal_dirty_data(handle_t *handle, return err; } +#endif -#define OCFS2_DEFAULT_COMMIT_INTERVAL (HZ * JBD_DEFAULT_MAX_COMMIT_AGE) +#define OCFS2_DEFAULT_COMMIT_INTERVAL (HZ * JBD2_DEFAULT_MAX_COMMIT_AGE) void ocfs2_set_journal_params(struct ocfs2_super *osb) { @@ -457,9 +461,9 @@ void ocfs2_set_journal_params(struct ocfs2_super *osb) spin_lock(&journal->j_state_lock); journal->j_commit_interval = commit_interval; if (osb->s_mount_opt & OCFS2_MOUNT_BARRIER) - journal->j_flags |= JFS_BARRIER; + journal->j_flags |= JBD2_BARRIER; else - journal->j_flags &= ~JFS_BARRIER; + journal->j_flags &= ~JBD2_BARRIER; spin_unlock(&journal->j_state_lock); } @@ -524,14 +528,14 @@ int ocfs2_journal_init(struct ocfs2_journal *journal, int *dirty) mlog(0, "inode->ip_clusters = %u\n", OCFS2_I(inode)->ip_clusters); /* call the kernels journal init function now */ - j_journal = journal_init_inode(inode); + j_journal = jbd2_journal_init_inode(inode); if (j_journal == NULL) { mlog(ML_ERROR, "Linux journal layer error\n"); status = -EINVAL; goto done; } - mlog(0, "Returned from journal_init_inode\n"); + mlog(0, "Returned from jbd2_journal_init_inode\n"); mlog(0, "j_journal->j_maxlen = %u\n", j_journal->j_maxlen); *dirty = (le32_to_cpu(di->id1.journal1.ij_flags) & @@ -639,7 +643,7 @@ void ocfs2_journal_shutdown(struct ocfs2_super *osb) if (journal->j_state != OCFS2_JOURNAL_LOADED) goto done; - /* need to inc inode use count as journal_destroy will iput. */ + /* need to inc inode use count - jbd2_journal_destroy will iput. */ if (!igrab(inode)) BUG(); @@ -668,9 +672,9 @@ void ocfs2_journal_shutdown(struct ocfs2_super *osb) BUG_ON(atomic_read(&(osb->journal->j_num_trans)) != 0); if (ocfs2_mount_local(osb)) { - journal_lock_updates(journal->j_journal); - status = journal_flush(journal->j_journal); - journal_unlock_updates(journal->j_journal); + jbd2_journal_lock_updates(journal->j_journal); + status = jbd2_journal_flush(journal->j_journal); + jbd2_journal_unlock_updates(journal->j_journal); if (status < 0) mlog_errno(status); } @@ -686,7 +690,7 @@ void ocfs2_journal_shutdown(struct ocfs2_super *osb) } /* Shutdown the kernel journal system */ - journal_destroy(journal->j_journal); + jbd2_journal_destroy(journal->j_journal); OCFS2_I(inode)->ip_open_count--; @@ -711,15 +715,15 @@ static void ocfs2_clear_journal_error(struct super_block *sb, { int olderr; - olderr = journal_errno(journal); + olderr = jbd2_journal_errno(journal); if (olderr) { mlog(ML_ERROR, "File system error %d recorded in " "journal %u.\n", olderr, slot); mlog(ML_ERROR, "File system on device %s needs checking.\n", sb->s_id); - journal_ack_err(journal); - journal_clear_err(journal); + jbd2_journal_ack_err(journal); + jbd2_journal_clear_err(journal); } } @@ -734,7 +738,7 @@ int ocfs2_journal_load(struct ocfs2_journal *journal, int local, int replayed) osb = journal->j_osb; - status = journal_load(journal->j_journal); + status = jbd2_journal_load(journal->j_journal); if (status < 0) { mlog(ML_ERROR, "Failed to load journal!\n"); goto done; @@ -778,7 +782,7 @@ int ocfs2_journal_wipe(struct ocfs2_journal *journal, int full) BUG_ON(!journal); - status = journal_wipe(journal->j_journal, full); + status = jbd2_journal_wipe(journal->j_journal, full); if (status < 0) { mlog_errno(status); goto bail; @@ -1229,19 +1233,19 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, } mlog(0, "calling journal_init_inode\n"); - journal = journal_init_inode(inode); + journal = jbd2_journal_init_inode(inode); if (journal == NULL) { mlog(ML_ERROR, "Linux journal layer error\n"); status = -EIO; goto done; } - status = journal_load(journal); + status = jbd2_journal_load(journal); if (status < 0) { mlog_errno(status); if (!igrab(inode)) BUG(); - journal_destroy(journal); + jbd2_journal_destroy(journal); goto done; } @@ -1249,9 +1253,9 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, /* wipe the journal */ mlog(0, "flushing the journal.\n"); - journal_lock_updates(journal); - status = journal_flush(journal); - journal_unlock_updates(journal); + jbd2_journal_lock_updates(journal); + status = jbd2_journal_flush(journal); + jbd2_journal_unlock_updates(journal); if (status < 0) mlog_errno(status); @@ -1272,7 +1276,7 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, if (!igrab(inode)) BUG(); - journal_destroy(journal); + jbd2_journal_destroy(journal); done: /* drop the lock on this nodes journal */ diff --git a/fs/ocfs2/journal.h b/fs/ocfs2/journal.h index 2178ebf..401cc65 100644 --- a/fs/ocfs2/journal.h +++ b/fs/ocfs2/journal.h @@ -27,7 +27,12 @@ #define OCFS2_JOURNAL_H #include <linux/fs.h> -#include <linux/jbd.h> +#ifndef CONFIG_OCFS2_COMPAT_JBD +# include <linux/jbd2.h> +#else +# include <linux/jbd.h> +# include "ocfs2_jbd_compat.h" +#endif enum ocfs2_journal_state { OCFS2_JOURNAL_FREE = 0, @@ -215,8 +220,8 @@ static inline void ocfs2_checkpoint_inode(struct inode *inode) * buffer. Will have to call ocfs2_journal_dirty once * we've actually dirtied it. Type is one of . or . * ocfs2_journal_dirty - Mark a journalled buffer as having dirty data. - * ocfs2_journal_dirty_data - Indicate that a data buffer should go out before - * the current handle commits. + * ocfs2_jbd2_file_inode - Mark an inode so that its data goes out before + * the current handle commits. */ /* You must always start_trans with a number of buffs > 0, but it's @@ -268,8 +273,10 @@ int ocfs2_journal_access(handle_t *handle, */ int ocfs2_journal_dirty(handle_t *handle, struct buffer_head *bh); +#ifdef CONFIG_OCFS2_COMPAT_JBD int ocfs2_journal_dirty_data(handle_t *handle, struct buffer_head *bh); +#endif /* * Credit Macros: @@ -415,4 +422,16 @@ static inline int ocfs2_calc_tree_trunc_credits(struct super_block *sb, return credits; } +static inline int ocfs2_jbd2_file_inode(handle_t *handle, struct inode *inode) +{ + return jbd2_journal_file_inode(handle, &OCFS2_I(inode)->ip_jinode); +} + +static inline int ocfs2_begin_ordered_truncate(struct inode *inode, + loff_t new_size) +{ + return jbd2_journal_begin_ordered_truncate(&OCFS2_I(inode)->ip_jinode, + new_size); +} + #endif /* OCFS2_JOURNAL_H */ diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h index ca89e65..2b13519 100644 --- a/fs/ocfs2/ocfs2.h +++ b/fs/ocfs2/ocfs2.h @@ -34,7 +34,12 @@ #include <linux/workqueue.h> #include <linux/kref.h> #include <linux/mutex.h> -#include <linux/jbd.h> +#ifndef CONFIG_OCFS2_COMPAT_JBD +# include <linux/jbd2.h> +#else +# include <linux/jbd.h> +# include "ocfs2_jbd_compat.h" +#endif /* For union ocfs2_dlm_lksb */ #include "stackglue.h" diff --git a/fs/ocfs2/ocfs2_jbd_compat.h b/fs/ocfs2/ocfs2_jbd_compat.h new file mode 100644 index 0000000..b91c78f --- /dev/null +++ b/fs/ocfs2/ocfs2_jbd_compat.h @@ -0,0 +1,82 @@ +/* -*- mode: c; c-basic-offset: 8; -*- + * vim: noexpandtab sw=8 ts=8 sts=0: + * + * ocfs2_jbd_compat.h + * + * Compatibility defines for JBD. + * + * Copyright (C) 2008 Oracle. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License version 2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ + +#ifndef OCFS2_JBD_COMPAT_H +#define OCFS2_JBD_COMPAT_H + +#ifndef CONFIG_OCFS2_COMPAT_JBD +# error Should not have been included +#endif + +struct jbd2_inode { + unsigned int dummy; +}; + +#define JBD2_BARRIER JFS_BARRIER +#define JBD2_DEFAULT_MAX_COMMIT_AGE JBD_DEFAULT_MAX_COMMIT_AGE + +#define jbd2_journal_ack_err journal_ack_err +#define jbd2_journal_clear_err journal_clear_err +#define jbd2_journal_destroy journal_destroy +#define jbd2_journal_dirty_metadata journal_dirty_metadata +#define jbd2_journal_errno journal_errno +#define jbd2_journal_extend journal_extend +#define jbd2_journal_flush journal_flush +#define jbd2_journal_force_commit journal_force_commit +#define jbd2_journal_get_write_access journal_get_write_access +#define jbd2_journal_get_undo_access journal_get_undo_access +#define jbd2_journal_init_inode journal_init_inode +#define jbd2_journal_invalidatepage journal_invalidatepage +#define jbd2_journal_load journal_load +#define jbd2_journal_lock_updates journal_lock_updates +#define jbd2_journal_restart journal_restart +#define jbd2_journal_start journal_start +#define jbd2_journal_start_commit journal_start_commit +#define jbd2_journal_stop journal_stop +#define jbd2_journal_try_to_free_buffers journal_try_to_free_buffers +#define jbd2_journal_unlock_updates journal_unlock_updates +#define jbd2_journal_wipe journal_wipe +#define jbd2_log_wait_commit log_wait_commit + +static inline int jbd2_journal_file_inode(handle_t *handle, + struct jbd2_inode *inode) +{ + return 0; +} + +static inline int jbd2_journal_begin_ordered_truncate(struct jbd2_inode *inode, + loff_t new_size) +{ + return 0; +} + +static inline void jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, + struct inode *inode) +{ + return; +} + +static inline void jbd2_journal_release_jbd_inode(journal_t *journal, + struct jbd2_inode *jinode) +{ + return; +} + + +#endif /* OCFS2_JBD_COMPAT_H */ diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index 260da3a..cf1fae5 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -207,10 +207,11 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait) ocfs2_schedule_truncate_log_flush(osb, 0); } - if (journal_start_commit(OCFS2_SB(sb)->journal->j_journal, &target)) { + if (jbd2_journal_start_commit(OCFS2_SB(sb)->journal->j_journal, + &target)) { if (wait) - log_wait_commit(OCFS2_SB(sb)->journal->j_journal, - target); + jbd2_log_wait_commit(OCFS2_SB(sb)->journal->j_journal, + target); } return 0; } @@ -327,6 +328,7 @@ static struct inode *ocfs2_alloc_inode(struct super_block *sb) if (!oi) return NULL; + jbd2_journal_init_jbd_inode(&oi->ip_jinode, &oi->vfs_inode); return &oi->vfs_inode; } @@ -884,7 +886,7 @@ static int ocfs2_parse_options(struct super_block *sb, if (option < 0) return 0; if (option == 0) - option = JBD_DEFAULT_MAX_COMMIT_AGE; + option = JBD2_DEFAULT_MAX_COMMIT_AGE; mopt->commit_interval = HZ * option; break; case Opt_localalloc: diff --git a/fs/ocfs2/uptodate.c b/fs/ocfs2/uptodate.c index 4da8851..fc3a078 100644 --- a/fs/ocfs2/uptodate.c +++ b/fs/ocfs2/uptodate.c @@ -53,7 +53,11 @@ #include <linux/highmem.h> #include <linux/buffer_head.h> #include <linux/rbtree.h> -#include <linux/jbd.h> +#ifndef CONFIG_OCFS2_COMPAT_JBD +# include <linux/jbd2.h> +#else +# include <linux/jbd.h> +#endif #define MLOG_MASK_PREFIX ML_UPTODATE -- 1.5.6.3
On Wed, Sep 03, 2008 at 08:03:38PM -0700, Joel Becker wrote:> ocfs2 currently uses the Journaled Block Device (JBD) for its > journaling. This is a very stable and tested codebase. However, JBD > is limited by architecture to 32bit block numbers. This means an ocfs2 > filesystem is limited to 2^32 blocks. With a 4K blocksize, that's 16TB. > People want larger volumes. > > Fortunately, there is now JBD2. JBD2 adds 64bit block number support > and some other features. JBD2 is backwards compatible, so it can be > used with existing filesystems. The new features won't be accessed > until the ocfs2-tools turn them on. > > Before we support JBD2, however, we need to ensure that inode blocks are > allocated below that 32bit boundary. stat(2) cannot handle larger inode > numbers on 32bit machines. With that limit in place, we can fully > support JBD2 in the kernel filesystem. Like XFS, we provide the > 'inode64' option to remove that limit. > > A config option is provided (OCFS2_COMPAT_JBD) to force ocfs2 to use the > old JBD. This is for the paranoid while we test JBD2. > > The kernel code is available on the 'jbd2' branch of my git repository. > > View: > http://oss.oracle.com/git/?p=jlbec/linux-2.6.git;a=shortlog;h=jbd2 > Pull: > git pull git://oss.oracle.com/git/jlbec/linux-2.6.git jbd2Ok, these look great. A couple of notes: - I removed an ML_NOTICE in the 1st patch - I added information to Documentation/filesystems/ocfs2.txt in the 2nd patch - Unsurprisingly, the last patch conflicted with lots of things: CONFLICT (content): Merge conflict in fs/ocfs2/ocfs2.h CONFLICT (content): Merge conflict in fs/ocfs2/suballoc.c CONFLICT (content): Merge conflict in fs/ocfs2/super.c They were fairly easy to fix - please double check my work. I created a branch 'feature_merge' in which I pulled in xattrs, dyn_la, and then jbd2 (purposfully keeping it on top). I'll just pull that into linux-next for now, since linux-next is throwaway and I don't want to have to re-merge every time I create it. --Mark -- Mark Fasheh
Apparently Analagous Threads
- [PATCH 1/3] ocfs2: Optimize inode allocation by remembering last group.
- [PATCH 0/3] ocfs2: Inode Allocation Strategy Improvement.v2
- [PATCH 0/3] ocfs2-1.4: Backport inode alloc from mainline.
- [RFC] metadata alloc fix in machines which has PAGE_SIZE > CLUSTER_SIZE
- [git patches] Ocfs2 updates for 2.6.30