Dear all, I set up an 2.0-alpha2 system and planned to populate it with 100 million files. While populating it however, the MDS ran out of memory, the OOM kicked in, killed some processes, and all ended in a kernel panic. So I resetted the MDS and remounted the MDT. After around 30 seconds (no client access yet), the memory gets eaten up again, reproducing the very same scenario mentioned above. If I unmount the MDT ''in time'', the memory gets freed up (so I am pretty sure it''s Lustre and not something else). I had seen this with 2.0-alpha1 already, hence I upgraded to 2.0-alpha2. When using 2.0-alpha1, the system had around 10 million files and was not accessed at all when this behavior showed up. The system I am using for my tests has 1 MDS, 1 client and 3 OSSs. The MDS has only 2 GB of memory, but this should only impact performance, not stability, right? Any comments welcome, I am also happy to provide more details. Cheers, Arne -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 4188 bytes Desc: S/MIME Cryptographic Signature Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090609/f6763545/attachment.bin
On Tue, Jun 09, 2009 at 02:36:37PM +0200, Arne Wiebalck wrote:> Dear all, > > I set up an 2.0-alpha2 system and planned to populate it with > 100 million files. While populating it however, the MDS ran > out of memory, the OOM kicked in, killed some processes, and > all ended in a kernel panic. > > So I resetted the MDS and remounted the MDT. After around > 30 seconds (no client access yet), the memory gets eaten up > again, reproducing the very same scenario mentioned above. > > If I unmount the MDT ''in time'', the memory gets freed up (so > I am pretty sure it''s Lustre and not something else). > > I had seen this with 2.0-alpha1 already, hence I upgraded > to 2.0-alpha2. When using 2.0-alpha1, the system had around > 10 million files and was not accessed at all when this > behavior showed up. > > The system I am using for my tests has 1 MDS, 1 client and 3 > OSSs. The MDS has only 2 GB of memory, but this should only > impact performance, not stability, right? > > Any comments welcome, I am also happy to provide more details.Please show us: /proc/meminfo /proc/slabinfo /proc/sys/lnet/memused /proc/sys/lustre/memused* /proc/sys/lustre/pagesused* Preferably at around the OOM. It''d also be helpful to get a debug dump of memory allocations: 1. echo malloc > /proc/sys/lnet/debug 2. at around the OOM, lctl dk > malloc.dk How many clients were there? How were they connected to the MDS? Isaac
On Wed, Jun 10, 2009 at 11:30:39AM +0200, Arne Wiebalck wrote:>>> ...... >>> Any comments welcome, I am also happy to provide more details. >> >> Please show us: >> /proc/meminfo /proc/slabinfo >> /proc/sys/lnet/memused /proc/sys/lustre/memused* /proc/sys/lustre/pagesused* >> >> Preferably at around the OOM. >> >> It''d also be helpful to get a debug dump of memory allocations: >> 1. echo malloc > /proc/sys/lnet/debug >> 2. at around the OOM, lctl dk > malloc.dk >> >> How many clients were there? How were they connected to the MDS? >> >> Isaac > > Issac, > > Please find the requested info attached.>From meminfo, it looked like ''slab'' consumed most of the memory:MemTotal: 2058932 kB MemFree: 7640 kB Slab: 1976324 kB>From slabinfo, the biggest offenders are:ldiskfs_inode_cache 994608 994608 944 4 1 : tunables 54 27 8 : slabdata 248652 248652 0 size-256 2988210 2988210 256 15 1 : tunables 120 60 8 : slabdata 199214 199214 0 About 900M and 800M respectively. LNet seemed innocent, less than 1M: /proc/sys/lnet/memused 453172 Lustre accounted around 600M at most: /proc/sys/lustre/memused* 604295420 485604764 The ldiskfs_inode_cache slab looked fishy to me, but it''s above my head. The malloc dk dump at around the OOM is attached. Thanks, Isaac -------------- next part -------------- A non-text attachment was scrubbed... Name: malloc-1244620938.dk.gz Type: application/octet-stream Size: 278024 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090610/88a8ff2e/attachment-0001.obj