Hi, when running fsstress test on an almost full filesystem we observe the following errors: 1163.522931] (4774,1):ocfs2_query_inode_wipe:898 ERROR: bug expression: !(di->i_flags & cpu_to_le32(OCFS2_ORPHANED_FL)) [ 1163.522938] (4774,1):ocfs2_query_inode_wipe:898 ERROR: Inode 77233 (on-disk 77233) not orphaned! Disk flags 0x1, inode flags 0x0 This is caused by the fact that we succeed in allocating inode in ocfs2_mknod_locked but later fail to allocate block for symlink data or directory data because of ENOSPC. So we set i_nlink to 0 and by doing iput() we continue through standard inode deletion path but the inode is not orphaned and thus the error check is triggered. Now this isn't trivial to fix (at least AFAICS) so I wanted to share my thoughts before investing too much time in writing the patch which would be then rejected. The easiest solution would be to always create inodes in the orphan directory (we even have a function ocfs2_create_inode_in_orphan for this). The downside this has would be that I expect we would start contending on orphan dir i_mutex quite early and thus fs scalability would suffer a lot. Also there's some additional IO and CPU cost involved... Adding inode to orphan dir after we find out we cannot finish allocation is IMHO no-go. Because the filesystem is close to ENOSPC, we even don't have to have a block to extend orphan directory to accomodate new directory entry. Also adding to orphan directory has to happen outside of a transaction (due to lock ordering) but we have a transaction already started and cannot stop it without adding a link to the inode somewhere (otherwise we would leak the inode in case of crash). The last idea I have is that we could "undo" the inode allocation and other operations we did in the transaction so far. But looking at the code it would get nasty quickly - all the xattr handling which gets inode locks, starts & stops transactions, etc... Any other ideas? What would make things much easier would be if orphan handling was more lightweight like it is e.g. in ext3 / ext4 - there we have just linked list of orphaned inodes and so if we decide an inode needs to be orphaned, we just have to modify the superblock (orphan list head) and the inode (to point at the current orphan list head)... In OCFS2 we could have a per-slot lists like this but a change like this would probably be an overkill for the above bug so it would make sence only if there would be other benefits from this. Honza -- Jan Kara <jack at suse.cz> SUSE Labs, CR
On Tue, Apr 20, 2010 at 08:00:54PM +0200, Jan Kara wrote:> the following errors: > 1163.522931] (4774,1):ocfs2_query_inode_wipe:898 ERROR: bug expression: > !(di->i_flags & cpu_to_le32(OCFS2_ORPHANED_FL)) > [ 1163.522938] (4774,1):ocfs2_query_inode_wipe:898 ERROR: Inode 77233 > (on-disk 77233) not orphaned! Disk flags 0x1, inode flags 0x0See the thread starting "Subject: [Ocfs2-devel] [PATCH] ocfs2: alloc orphaned inode in ocfs2_symlink". I'll sum up here.> The easiest solution would be to always create inodes in the orphan > directory (we even have a function ocfs2_create_inode_in_orphan for this). > The downside this has would be that I expect we would start contending on > orphan dir i_mutex quite early and thus fs scalability would suffer a lot. > Also there's some additional IO and CPU cost involved...Yeah, this is a non-starter.> The last idea I have is that we could "undo" the inode allocation and > other operations we did in the transaction so far. But looking at the code > it would get nasty quickly - all the xattr handling which gets inode locks, > starts & stops transactions, etc...This is the "best" solution, but it requires some thought and care. We'd love to get here someday.> Any other ideas? What would make things much easier would be if orphan > handling was more lightweight like it is e.g. in ext3 / ext4 - there we > have just linked list of orphaned inodes and so if we decide an inode needs > to be orphaned, we just have to modify the superblock (orphan list head) > and the inode (to point at the current orphan list head)... In OCFS2 we > could have a per-slot lists like this but a change like this would probably > be an overkill for the above bug so it would make sence only if there would > be other benefits from this.We're not going to change our orphan storage, either. This still needs locking in the cluster, and that's just a pain. Near the end of the referenced thread, Mark told Dong Yang to implement the OCFS2_INODE_SKIP_ORPHAN_DIR flag. This flag merely lets delete_inode know that we never orphaned the sucker. delete_inode can do the rest of its work without triggering the above warning. Joel -- Life's Little Instruction Book #237 "Seek out the good in people." Joel Becker Principal Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127
On Tue, Apr 20, 2010 at 08:00:54PM +0200, Jan Kara wrote:> Hi, > > when running fsstress test on an almost full filesystem we observe > the following errors: > 1163.522931] (4774,1):ocfs2_query_inode_wipe:898 ERROR: bug expression: > !(di->i_flags & cpu_to_le32(OCFS2_ORPHANED_FL)) > [ 1163.522938] (4774,1):ocfs2_query_inode_wipe:898 ERROR: Inode 77233 > (on-disk 77233) not orphaned! Disk flags 0x1, inode flags 0x0 > > This is caused by the fact that we succeed in allocating inode in > ocfs2_mknod_locked but later fail to allocate block for symlink data > or directory data because of ENOSPC. So we set i_nlink to 0 and > by doing iput() we continue through standard inode deletion path but the > inode is not orphaned and thus the error check is triggered. > > Now this isn't trivial to fix (at least AFAICS) so I wanted to share my > thoughts before investing too much time in writing the patch which would > be then rejected. > > The easiest solution would be to always create inodes in the orphan > directory (we even have a function ocfs2_create_inode_in_orphan for this). > The downside this has would be that I expect we would start contending on > orphan dir i_mutex quite early and thus fs scalability would suffer a lot. > Also there's some additional IO and CPU cost involved... > > Adding inode to orphan dir after we find out we cannot finish allocation > is IMHO no-go. Because the filesystem is close to ENOSPC, we even don't > have to have a block to extend orphan directory to accomodate new directory > entry. Also adding to orphan directory has to happen outside of a > transaction (due to lock ordering) but we have a transaction already > started and cannot stop it without adding a link to the inode somewhere > (otherwise we would leak the inode in case of crash). > > The last idea I have is that we could "undo" the inode allocation and > other operations we did in the transaction so far. But looking at the code > it would get nasty quickly - all the xattr handling which gets inode locks, > starts & stops transactions, etc... > > Any other ideas? What would make things much easier would be if orphan > handling was more lightweight like it is e.g. in ext3 / ext4 - there we > have just linked list of orphaned inodes and so if we decide an inode needs > to be orphaned, we just have to modify the superblock (orphan list head) > and the inode (to point at the current orphan list head)... In OCFS2 we > could have a per-slot lists like this but a change like this would probably > be an overkill for the above bug so it would make sence only if there would > be other benefits from this.Flag the failed inodes in memory (see ocfs2_inode_info->ip_flags) and special case it in ocfs2_delete_inode(). That's the easiest way to go about this without impacting performance. Then our window for leaking blocks is only if the box takes a reboot (or crash) right before we hit delete_inode(). That's an acceptable risk (we're talking a very small number of blocks and a very small window of time). --Mark -- Mark Fasheh