So in light of prior issues with lock contention and such due to writing apache logs to shared files I have started storing them locally on each node. I made a script to combine them nightly before the statistics generator kicks off for the previous day's traffic analysis. This script, using logresolvemerge.pl, is actually writing the output back to the shard volume for easy reference later. I figure I would not have issues with this as it's a large amount of sequential writes from a single node at off-peak time. However, It's been getting hung with high CPU from the merger. I'm pretty sure I'm running into the famous "free space fragmentation" problem, but wanted to confirm that this was the case or see if there was additional troubleshooting I can do. Here's the disk, plenty of overall free space: Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/mpath1 209725440 85311460 124413980 41% /san/live-websites While my merging was going 100% of a CPU core, but the merged file was not growing in size and not much I/O actually happening to the shared volume, I did an strace to see what it was doing and got this: # strace -p 16844 Process 16844 attached - interrupt to quit read(3, "1\" 200 936 \"http://www.industria"..., 4096) = 4096 write(1, ".NET CLR 1.1.4322; .NET CLR 2.0."..., 4096) = -1 ENOSPC (No space left on device) read(4, "oration&locationName=South+Jerse"..., 4096) = 4096 write(1, "ivers=8&ngPipelines=600&kvtl230="..., 4096) = -1 ENOSPC (No space left on device) read(4, "1\" 200 936 \"http://www.industria"..., 4096) = 4096 write(1, "gan+Boulevard&locationCSZ=Salem%"..., 4096) = -1 ENOSPC (No space left on device) read(3, "HTTP/1.0\" 200 4096 \"-\" \"WinampMP"..., 4096) = 4096 write(1, "elta=.375&zoomlevel=6&label=Sout"..., 4096) = -1 ENOSPC (No space left on device) read(4, "HTTP/1.0\" 200 4096 \"-\" \"WinampMP"..., 4096) = 4096 write(1, "ident/4.0; .NET CLR 1.1.4322; .N"..., 4096) = -1 ENOSPC (No space left on device) read(3, "0 36516 \"-\" \"Mozilla/5.0 (compat"..., 4096) = 4096 Now I'm really worried about the cluster stability from other routine writes that might fail soon. I know the typical workaround is to reduce the node slots, but I don't have any excess slots to spare. Are there any other tricks to improve/reduce freespace fragmentation?
Quite a bit of work is ongoing on this front. I'll list all that work in another email. Meanwhile make a bz with the stat_sysdir output. We'll need that to determine the best way forward. David Johle wrote:> So in light of prior issues with lock contention and such due to > writing apache logs to shared files I have started storing them > locally on each node. I made a script to combine them nightly before > the statistics generator kicks off for the previous day's traffic analysis. > > This script, using logresolvemerge.pl, is actually writing the output > back to the shard volume for easy reference later. I figure I would > not have issues with this as it's a large amount of sequential writes > from a single node at off-peak time. However, It's been getting hung > with high CPU from the merger. > > I'm pretty sure I'm running into the famous "free space > fragmentation" problem, but wanted to confirm that this was the case > or see if there was additional troubleshooting I can do. > > Here's the disk, plenty of overall free space: > > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/mapper/mpath1 209725440 85311460 124413980 41% /san/live-websites > > > While my merging was going 100% of a CPU core, but the merged file > was not growing in size and not much I/O actually happening to the > shared volume, I did an strace to see what it was doing and got this: > > # strace -p 16844 > Process 16844 attached - interrupt to quit > read(3, "1\" 200 936 \"http://www.industria"..., 4096) = 4096 > write(1, ".NET CLR 1.1.4322; .NET CLR 2.0."..., 4096) = -1 ENOSPC (No > space left on device) > read(4, "oration&locationName=South+Jerse"..., 4096) = 4096 > write(1, "ivers=8&ngPipelines=600&kvtl230="..., 4096) = -1 ENOSPC (No > space left on device) > read(4, "1\" 200 936 \"http://www.industria"..., 4096) = 4096 > write(1, "gan+Boulevard&locationCSZ=Salem%"..., 4096) = -1 ENOSPC (No > space left on device) > read(3, "HTTP/1.0\" 200 4096 \"-\" \"WinampMP"..., 4096) = 4096 > write(1, "elta=.375&zoomlevel=6&label=Sout"..., 4096) = -1 ENOSPC (No > space left on device) > read(4, "HTTP/1.0\" 200 4096 \"-\" \"WinampMP"..., 4096) = 4096 > write(1, "ident/4.0; .NET CLR 1.1.4322; .N"..., 4096) = -1 ENOSPC (No > space left on device) > read(3, "0 36516 \"-\" \"Mozilla/5.0 (compat"..., 4096) = 4096 > > > Now I'm really worried about the cluster stability from other routine > writes that might fail soon. I know the typical workaround is to > reduce the node slots, but I don't have any excess slots to > spare. Are there any other tricks to improve/reduce freespace fragmentation? > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >
Just realized I never replied to this, but I had created the bugzilla report a little while back. It can be found at: http://oss.oracle.com/bugzilla/show_bug.cgi?id=1237 At 08:54 PM 3/24/2010, Sunil Mushran wrote:>Quite a bit of work is ongoing on this front. I'll list all that work >in another email. > >Meanwhile make a bz with the stat_sysdir output. We'll need that >to determine the best way forward. > >David Johle wrote: >>So in light of prior issues with lock contention and such due to >>writing apache logs to shared files I have started storing them >>locally on each node. I made a script to combine them nightly >>before the statistics generator kicks off for the previous day's >>traffic analysis. >> >>This script, using logresolvemerge.pl, is actually writing the >>output back to the shard volume for easy reference later. I figure >>I would not have issues with this as it's a large amount of >>sequential writes from a single node at off-peak time. However, >>It's been getting hung with high CPU from the merger. >> >>I'm pretty sure I'm running into the famous "free space >>fragmentation" problem, but wanted to confirm that this was the >>case or see if there was additional troubleshooting I can do. >> >>Here's the disk, plenty of overall free space: >> >>Filesystem 1K-blocks Used Available Use% Mounted on >>/dev/mapper/mpath1 209725440 85311460 124413980 41% /san/live-websites >> >> >>While my merging was going 100% of a CPU core, but the merged file >>was not growing in size and not much I/O actually happening to the >>shared volume, I did an strace to see what it was doing and got this: >> >># strace -p 16844 >>Process 16844 attached - interrupt to quit >>read(3, "1\" 200 936 \"http://www.industria"..., 4096) = 4096 >>write(1, ".NET CLR 1.1.4322; .NET CLR 2.0."..., 4096) = -1 ENOSPC >>(No space left on device) >>read(4, "oration&locationName=South+Jerse"..., 4096) = 4096 >>write(1, "ivers=8&ngPipelines=600&kvtl230="..., 4096) = -1 ENOSPC >>(No space left on device) >>read(4, "1\" 200 936 \"http://www.industria"..., 4096) = 4096 >>write(1, "gan+Boulevard&locationCSZ=Salem%"..., 4096) = -1 ENOSPC >>(No space left on device) >>read(3, "HTTP/1.0\" 200 4096 \"-\" \"WinampMP"..., 4096) = 4096 >>write(1, "elta=.375&zoomlevel=6&label=Sout"..., 4096) = -1 ENOSPC >>(No space left on device) >>read(4, "HTTP/1.0\" 200 4096 \"-\" \"WinampMP"..., 4096) = 4096 >>write(1, "ident/4.0; .NET CLR 1.1.4322; .N"..., 4096) = -1 ENOSPC >>(No space left on device) >>read(3, "0 36516 \"-\" \"Mozilla/5.0 (compat"..., 4096) = 4096 >> >> >>Now I'm really worried about the cluster stability from other >>routine writes that might fail soon. I know the typical workaround >>is to reduce the node slots, but I don't have any excess slots to >>spare. Are there any other tricks to improve/reduce freespace fragmentation?