Hi, we are porting lustre-1.4.9 to 2.6.20 (actually Goswin has done most of the work) and now we have a problem, which is not easy for us to solve. Maybe some of you with better kernel knowledge could give us some hints? Symptom: On our test system we can write about 1.6GB and then suddenly the write process gets stuck. Analysis: The problem occurs when the page cache is written. Actually it goes into something like an endless loop in mm/page-writeback.c: generic_writepages(). I''m not sure if it really goes into an entirely endless loop, since I see almost the same problem on only writing 1MB and calling ''sync'' then. This sync won''t return for a rather long time, but finally it does return. So for the 1.6GB it might be that I just had to be more patient (it didn''t finish over night, though, and there my patience ended). In generic_writepages() it doesn''t call writepage() due to this: if (PageWriteback(page) || !clear_page_dirty_for_io(page)) { unlock_page(page); continue; } Actually, clear_page_dirty_for_io() will return 0. This is due to TestClearPageDirty(page) in this function, within "if (mapping ...)" returning 0. This is the point were we don''t understand whats actually going on and any help to understand it is highly appreciated. Just for fun and education I already forced the call of writepage(), but it rather soon returned from ll_writepage() without actually doing something useful. So it still stayed in the endless loop in generic_writepages(). Thanks in advance for any help, Bernd -- Bernd Schubert Q-Leap Networks GmbH / transtec AG / ttec
Hi Bernd, can you check this issue with PG_writeback patch applied? patch available at BZ bug 11710. also you can see similar work for lustre 1.6 from BZ bug 11647 and other about patchless client. On Wed, 2007-03-14 at 20:57, Bernd Schubert wrote:> Hi, > > we are porting lustre-1.4.9 to 2.6.20 (actually Goswin has done most of the > work) and now we have a problem, which is not easy for us to solve. Maybe > some of you with better kernel knowledge could give us some hints? > > Symptom: On our test system we can write about 1.6GB and then suddenly the > write process gets stuck. > > Analysis: The problem occurs when the page cache is written. Actually it goes > into something like an endless loop in mm/page-writeback.c: > generic_writepages(). I''m not sure if it really goes into an entirely endless > loop, since I see almost the same problem on only writing 1MB and > calling ''sync'' then. This sync won''t return for a rather long time, but > finally it does return. So for the 1.6GB it might be that I just had to be > more patient (it didn''t finish over night, though, and there my patience > ended). > > In generic_writepages() it doesn''t call writepage() due to this: > > if (PageWriteback(page) || > !clear_page_dirty_for_io(page)) { > unlock_page(page); > continue; > } > > Actually, clear_page_dirty_for_io() will return 0. This is due to > TestClearPageDirty(page) in this function, within "if (mapping ...)" > returning 0. > > This is the point were we don''t understand whats actually going on and any > help to understand it is highly appreciated. > > Just for fun and education I already forced the call of writepage(), but it > rather soon returned from ll_writepage() without actually doing something > useful. So it still stayed in the endless loop in generic_writepages(). > > > Thanks in advance for any help, > Bernd-- Alexey Lyashkov <shadow@clusterfs.com> Beaver team
Hi Alex, On Wednesday 14 March 2007 21:11, Alexey Lyashkov wrote:> Hi Bernd, > > can you check this issue with PG_writeback patch applied? > patch available at BZ bug 11710. also you can see similar work for > lustre 1.6 from BZ bug 11647 and other about patchless client.thanks a lot for your help, after applying the PG_writeback patch it works like a charm. Thanks again, Bernd -- Bernd Schubert Q-Leap Networks GmbH / transtec AG / ttec