Guillaume Demillecamps
2009-Jul-22 09:45 UTC
[Lustre-discuss] Lustre client memory usage very high
Hello people, Lustre 1.8.0 on all servers / clients involved in this. OS is SLES 10 SP2 with un-patched kernel on the clients. I however has put the same kernel revision downloaded from suse.com on the clients as the version used in the Lustre-patched MGS/MDS/OSS servers. File system is only several GBs, with ~500000 files. All inter-connections are through TCP. We have some ?manual? replication of an active lustre file system to a passive lustre file system. We have ?sync? clients that just basically mount both file systems and run large sync jobs from the active Lustre to the passive Lustre. So far, so good (apart that it is quite a slow process). However my issue is that Lustre is rising memory so high that rsync cannot get enough RAM to finish its job before kswap kicks in and slows things down drastically. Up to now, I have succeeded fine-tuning things using the following steps in my rsync script: ######## umount /opt/lustre_a umount /opt/lustre_z mount /opt/lustre_a mount /opt/lustre_z for i in `ls /proc/fs/lustre/osc/*/max_dirty_mb`; do echo 4 > $i ; done for i in `ls /proc/fs/lustre/ldlm/namespaces/*/lru_max_age`; do echo 30 > $i ; done for i in `ls /proc/fs/lustre/llite/*/max_cached_mb`; do echo 64 > $i ; done echo 64 > /proc/sys/lustre/max_dirty_mb lctl set_param ldlm.namespaces.*osc*.lru_size=100 sysctl -w lnet.debug=0 ######## What I still don''t understand is that even when putting a max limit of a few MB of read-cache (max_cached_mb / max_dirty_mb) and putting the write-cache (lru_max_age ? is it correct ?) to a very limited number, it still sky-rise to several GBs in /proc/sys/lustre/mem_used ? And as soon as I un-mount the disks, it drops. The memused number however will not decrease even if the client remains idle for several days with no i/o from/to any lustre file systems. Note that cutting the rsync jobs in smaller but more numbered jobs is not helping. Unless I''d start un-mounting and re-mounting the lustre file systems between each job (which is nevertheless what I may have to plan if there is no further parameter which would help me) ! Any help/guidance/hint/... is very much appreciated. Thank you, Guillaume Demillecamps
Andreas Dilger
2009-Jul-29 22:46 UTC
[Lustre-discuss] Lustre client memory usage very high
On Jul 22, 2009 11:45 +0200, Guillaume Demillecamps wrote:> Lustre 1.8.0 on all servers / clients involved in this. OS is SLES 10 > SP2 with un-patched kernel on the clients. I however has put the same > kernel revision downloaded from suse.com on the clients as the version > used in the Lustre-patched MGS/MDS/OSS servers. File system is only > several GBs, with ~500000 files. All inter-connections are through TCP. > > We have some ?manual? replication of an active lustre file system to a > passive lustre file system. We have ?sync? clients that just basically > mount both file systems and run large sync jobs from the active Lustre > to the passive Lustre. So far, so good (apart that it is quite a slow > process). However my issue is that Lustre is rising memory so high > that rsync cannot get enough RAM to finish its job before kswap kicks > in and slows things down drastically. > Up to now, I have succeeded fine-tuning things using the following > steps in my rsync script: > ######## > umount /opt/lustre_a > umount /opt/lustre_z > mount /opt/lustre_a > mount /opt/lustre_z > for i in `ls /proc/fs/lustre/osc/*/max_dirty_mb`; do echo 4 > $i ; done > for i in `ls /proc/fs/lustre/ldlm/namespaces/*/lru_max_age`; do echo > 30 > $i ; done > for i in `ls /proc/fs/lustre/llite/*/max_cached_mb`; do echo 64 > $i ; done > echo 64 > /proc/sys/lustre/max_dirty_mbNote that you can do these more easily with lctl set_param osc.*.max_dirty_mb=4 lctl set_param ldlm.namespaces.*.lru_max_age=30 lctl set_param llite.*.max_cache_mb=64 lctl set_param max_dirty_mb=64> lctl set_param ldlm.namespaces.*osc*.lru_size=100 > sysctl -w lnet.debug=0This can also be "lctl set_param debug=0".> What I still don''t understand is that even when putting a max limit of > a few MB of read-cache (max_cached_mb / max_dirty_mb) and putting the > write-cache (lru_max_age ? is it correct ?) to a very limited number, > it still sky-rise to several GBs in /proc/sys/lustre/mem_used ?Can you please check /proc/slabinfo to see what kind of memory is being allocated the most? The max_cached_mb/max_dirty_mb are only limits on the cached/dirty data pages, and not for metadata structures. Also, in 30s I expect you can have a LOT of inodes traversed, so that might be your problem, and even then lock cancellation does not necessarily force the kernel dentry/inode out of memory. Getting total lock counts would also help: lctl get_param ldlm.namespaces.*.resource_count You might be able to tweak some of the "normal" (not Lustre specific) /proc parmeters to flush the inodes from cache more quickly, or increase the rate at which kswapd is trying to flush unused inodes.> And as soon as I un-mount the disks, it drops. The memused number however > will not decrease even if the client remains idle for several days > with no i/o from/to any lustre file systems. Note that cutting the > rsync jobs in smaller but more numbered jobs is not helping.There is a test program called "memhog" that could force memory to be flushed between jobs, but that is a sub-standard solution.> Unless > I''d start un-mounting and re-mounting the lustre file systems between > each job (which is nevertheless what I may have to plan if there is no > further parameter which would help me) !Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Guillaume Demillecamps
2009-Jul-30 07:52 UTC
[Lustre-discuss] Lustre client memory usage very high
Hello, First of all thank you for your time. You can find here attached the information you asked. If you can keep on spending some more of your time on this... Your help is greatly appreciated ! Best regards, Guillaume Demillecamps ----- Message de adilger at sun.com --------- Date?: Wed, 29 Jul 2009 16:46:27 -0600 De?: Andreas Dilger <adilger at sun.com> Objet?: Re: [Lustre-discuss] Lustre client memory usage very high ??: Guillaume Demillecamps <guillaume at multipurpose.be> Cc?: lustre-discuss at lists.lustre.org> On Jul 22, 2009 11:45 +0200, Guillaume Demillecamps wrote: >> Lustre 1.8.0 on all servers / clients involved in this. OS is SLES 10 >> SP2 with un-patched kernel on the clients. I however has put the same >> kernel revision downloaded from suse.com on the clients as the version >> used in the Lustre-patched MGS/MDS/OSS servers. File system is only >> several GBs, with ~500000 files. All inter-connections are through TCP. >> >> We have some ?manual? replication of an active lustre file system to a >> passive lustre file system. We have ?sync? clients that just basically >> mount both file systems and run large sync jobs from the active Lustre >> to the passive Lustre. So far, so good (apart that it is quite a slow >> process). However my issue is that Lustre is rising memory so high >> that rsync cannot get enough RAM to finish its job before kswap kicks >> in and slows things down drastically. >> Up to now, I have succeeded fine-tuning things using the following >> steps in my rsync script: >> ######## >> umount /opt/lustre_a >> umount /opt/lustre_z >> mount /opt/lustre_a >> mount /opt/lustre_z >> for i in `ls /proc/fs/lustre/osc/*/max_dirty_mb`; do echo 4 > $i ; done >> for i in `ls /proc/fs/lustre/ldlm/namespaces/*/lru_max_age`; do echo >> 30 > $i ; done >> for i in `ls /proc/fs/lustre/llite/*/max_cached_mb`; do echo 64 > $i ; done >> echo 64 > /proc/sys/lustre/max_dirty_mb > > Note that you can do these more easily with > > lctl set_param osc.*.max_dirty_mb=4 > lctl set_param ldlm.namespaces.*.lru_max_age=30 > lctl set_param llite.*.max_cache_mb=64 > lctl set_param max_dirty_mb=64 > >> lctl set_param ldlm.namespaces.*osc*.lru_size=100 >> sysctl -w lnet.debug=0 > > This can also be "lctl set_param debug=0". > >> What I still don''t understand is that even when putting a max limit of >> a few MB of read-cache (max_cached_mb / max_dirty_mb) and putting the >> write-cache (lru_max_age ? is it correct ?) to a very limited number, >> it still sky-rise to several GBs in /proc/sys/lustre/mem_used ? > > Can you please check /proc/slabinfo to see what kind of memory is being > allocated the most? The max_cached_mb/max_dirty_mb are only limits on > the cached/dirty data pages, and not for metadata structures. Also, > in 30s I expect you can have a LOT of inodes traversed, so that might > be your problem, and even then lock cancellation does not necessarily > force the kernel dentry/inode out of memory. > > Getting total lock counts would also help: > > lctl get_param ldlm.namespaces.*.resource_count > > You might be able to tweak some of the "normal" (not Lustre specific) > /proc parmeters to flush the inodes from cache more quickly, or increase > the rate at which kswapd is trying to flush unused inodes. > >> And as soon as I un-mount the disks, it drops. The memused number however >> will not decrease even if the client remains idle for several days >> with no i/o from/to any lustre file systems. Note that cutting the >> rsync jobs in smaller but more numbered jobs is not helping. > > There is a test program called "memhog" that could force memory to be > flushed between jobs, but that is a sub-standard solution. > >> Unless >> I''d start un-mounting and re-mounting the lustre file systems between >> each job (which is nevertheless what I may have to plan if there is no >> further parameter which would help me) ! > > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > >----- Fin du message de adilger at sun.com ----- -------------- next part -------------- =======================================================BEESPBESXSNC08:~ # cat /proc/sys/lustre/memused 1338178530 =======================================================BEESPBESXSNC08:~ # cat /proc/slabinfo slabinfo - version: 2.1 # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail> ll_async_page 0 0 320 12 1 : tunables 54 27 8 : slabdata 0 0 0 ll_file_data 20 20 192 20 1 : tunables 120 60 8 : slabdata 1 1 0 lustre_inode_cache 385652 385652 960 4 1 : tunables 54 27 8 : slabdata 96413 96413 0 lov_oinfo 2929548 2929548 320 12 1 : tunables 54 27 8 : slabdata 244129 244129 0 osc_quota_info 0 0 32 112 1 : tunables 120 60 8 : slabdata 0 0 0 ll_qunit_cache 0 0 112 34 1 : tunables 120 60 8 : slabdata 0 0 0 llcd_cache 0 0 3952 1 1 : tunables 24 12 8 : slabdata 0 0 0 ptlrpc_cbdatas 0 0 32 112 1 : tunables 120 60 8 : slabdata 0 0 0 interval_node 89964 209790 128 30 1 : tunables 120 60 8 : slabdata 6993 6993 480 ldlm_locks 136262 254424 512 8 1 : tunables 54 27 8 : slabdata 31803 31803 216 ldlm_resources 136183 256120 384 10 1 : tunables 54 27 8 : slabdata 25612 25612 216 ll_import_cache 0 0 984 4 1 : tunables 54 27 8 : slabdata 0 0 0 ll_obdo_cache 16 19 208 19 1 : tunables 120 60 8 : slabdata 1 1 0 ll_obd_dev_cache 22 22 5600 1 2 : tunables 8 4 0 : slabdata 22 22 0 obd_lvfs_ctxt_cache 0 0 88 44 1 : tunables 120 60 8 : slabdata 0 0 0 ip_fib_alias 23 59 64 59 1 : tunables 120 60 8 : slabdata 1 1 0 ip_fib_hash 23 59 64 59 1 : tunables 120 60 8 : slabdata 1 1 0 dm_events 16 92 40 92 1 : tunables 120 60 8 : slabdata 1 1 0 dm_tio 768 864 24 144 1 : tunables 120 60 8 : slabdata 6 6 0 dm_io 768 828 40 92 1 : tunables 120 60 8 : slabdata 9 9 0 ext3_inode_cache 8981 8985 800 5 1 : tunables 54 27 8 : slabdata 1797 1797 0 ext3_xattr 0 0 88 44 1 : tunables 120 60 8 : slabdata 0 0 0 journal_handle 0 0 24 144 1 : tunables 120 60 8 : slabdata 0 0 0 journal_head 15 80 96 40 1 : tunables 120 60 8 : slabdata 2 2 0 revoke_table 8 202 16 202 1 : tunables 120 60 8 : slabdata 1 1 0 revoke_record 0 0 32 112 1 : tunables 120 60 8 : slabdata 0 0 0 scsi_cmd_cache 1 10 384 10 1 : tunables 54 27 8 : slabdata 1 1 0 sgpool-256 32 32 8192 1 2 : tunables 8 4 0 : slabdata 32 32 0 sgpool-128 32 32 4096 1 1 : tunables 24 12 8 : slabdata 32 32 0 sgpool-64 32 32 2048 2 1 : tunables 24 12 8 : slabdata 16 16 0 sgpool-32 32 32 1024 4 1 : tunables 54 27 8 : slabdata 8 8 0 sgpool-16 32 32 512 8 1 : tunables 54 27 8 : slabdata 4 4 0 sgpool-8 32 45 256 15 1 : tunables 120 60 8 : slabdata 3 3 0 scsi_io_context 0 0 112 34 1 : tunables 120 60 8 : slabdata 0 0 0 UNIX 33 33 704 11 2 : tunables 54 27 8 : slabdata 3 3 0 ip_mrt_cache 0 0 128 30 1 : tunables 120 60 8 : slabdata 0 0 0 tcp_bind_bucket 7 112 32 112 1 : tunables 120 60 8 : slabdata 1 1 0 inet_peer_cache 1 30 128 30 1 : tunables 120 60 8 : slabdata 1 1 0 secpath_cache 0 0 192 20 1 : tunables 120 60 8 : slabdata 0 0 0 xfrm_dst_cache 0 0 384 10 1 : tunables 54 27 8 : slabdata 0 0 0 ip_dst_cache 59 70 384 10 1 : tunables 54 27 8 : slabdata 7 7 0 arp_cache 13 15 256 15 1 : tunables 120 60 8 : slabdata 1 1 0 RAW 2 5 768 5 1 : tunables 54 27 8 : slabdata 1 1 0 UDP 5 5 768 5 1 : tunables 54 27 8 : slabdata 1 1 0 tw_sock_TCP 0 0 192 20 1 : tunables 120 60 8 : slabdata 0 0 0 request_sock_TCP 0 0 128 30 1 : tunables 120 60 8 : slabdata 0 0 0 TCP 41 45 1536 5 2 : tunables 24 12 8 : slabdata 9 9 0 flow_cache 0 0 128 30 1 : tunables 120 60 8 : slabdata 0 0 0 cfq_ioc_pool 18 23 168 23 1 : tunables 120 60 8 : slabdata 1 1 0 cfq_pool 18 24 160 24 1 : tunables 120 60 8 : slabdata 1 1 0 crq_pool 16 44 88 44 1 : tunables 120 60 8 : slabdata 1 1 0 deadline_drq 0 0 96 40 1 : tunables 120 60 8 : slabdata 0 0 0 as_arq 0 0 112 34 1 : tunables 120 60 8 : slabdata 0 0 0 mqueue_inode_cache 1 4 896 4 1 : tunables 54 27 8 : slabdata 1 1 0 isofs_inode_cache 0 0 640 6 1 : tunables 54 27 8 : slabdata 0 0 0 minix_inode_cache 0 0 656 6 1 : tunables 54 27 8 : slabdata 0 0 0 hugetlbfs_inode_cache 1 6 608 6 1 : tunables 54 27 8 : slabdata 1 1 0 ext2_inode_cache 5 5 752 5 1 : tunables 54 27 8 : slabdata 1 1 0 ext2_xattr 0 0 88 44 1 : tunables 120 60 8 : slabdata 0 0 0 dnotify_cache 1 92 40 92 1 : tunables 120 60 8 : slabdata 1 1 0 dquot 0 0 256 15 1 : tunables 120 60 8 : slabdata 0 0 0 eventpoll_pwq 0 0 72 53 1 : tunables 120 60 8 : slabdata 0 0 0 eventpoll_epi 0 0 192 20 1 : tunables 120 60 8 : slabdata 0 0 0 inotify_event_cache 0 0 40 92 1 : tunables 120 60 8 : slabdata 0 0 0 inotify_watch_cache 1 53 72 53 1 : tunables 120 60 8 : slabdata 1 1 0 kioctx 0 0 384 10 1 : tunables 54 27 8 : slabdata 0 0 0 kiocb 0 0 256 15 1 : tunables 120 60 8 : slabdata 0 0 0 fasync_cache 0 0 24 144 1 : tunables 120 60 8 : slabdata 0 0 0 shmem_inode_cache 424 450 816 5 1 : tunables 54 27 8 : slabdata 90 90 0 posix_timers_cache 0 0 152 25 1 : tunables 120 60 8 : slabdata 0 0 0 uid_cache 3 59 64 59 1 : tunables 120 60 8 : slabdata 1 1 0 blkdev_ioc 17 67 56 67 1 : tunables 120 60 8 : slabdata 1 1 0 blkdev_queue 32 35 1608 5 2 : tunables 24 12 8 : slabdata 7 7 0 blkdev_requests 16 26 288 13 1 : tunables 54 27 8 : slabdata 2 2 0 biovec-(256) 268 268 4096 1 1 : tunables 24 12 8 : slabdata 268 268 0 biovec-128 280 280 2048 2 1 : tunables 24 12 8 : slabdata 140 140 0 biovec-64 304 304 1024 4 1 : tunables 54 27 8 : slabdata 76 76 0 biovec-16 304 315 256 15 1 : tunables 120 60 8 : slabdata 21 21 0 biovec-4 304 354 64 59 1 : tunables 120 60 8 : slabdata 6 6 0 biovec-1 304 808 16 202 1 : tunables 120 60 8 : slabdata 4 4 0 bio 304 390 128 30 1 : tunables 120 60 8 : slabdata 13 13 0 sock_inode_cache 90 90 704 5 1 : tunables 54 27 8 : slabdata 18 18 0 skbuff_fclone_cache 47 91 512 7 1 : tunables 54 27 8 : slabdata 12 13 0 skbuff_head_cache 346 405 256 15 1 : tunables 120 60 8 : slabdata 27 27 1 file_lock_cache 1 22 176 22 1 : tunables 120 60 8 : slabdata 1 1 0 acpi_operand 448 530 72 53 1 : tunables 120 60 8 : slabdata 10 10 0 acpi_parse_ext 0 0 64 59 1 : tunables 120 60 8 : slabdata 0 0 0 acpi_parse 0 0 40 92 1 : tunables 120 60 8 : slabdata 0 0 0 acpi_state 0 0 88 44 1 : tunables 120 60 8 : slabdata 0 0 0 delayacct_cache 142 236 64 59 1 : tunables 120 60 8 : slabdata 4 4 0 taskstats_cache 14 14 272 14 1 : tunables 54 27 8 : slabdata 1 1 0 proc_inode_cache 454 462 624 6 1 : tunables 54 27 8 : slabdata 77 77 0 sigqueue 16 24 160 24 1 : tunables 120 60 8 : slabdata 1 1 0 radix_tree_node 7938 8001 536 7 1 : tunables 54 27 8 : slabdata 1143 1143 18 bdev_cache 29 32 832 4 1 : tunables 54 27 8 : slabdata 8 8 0 sysfs_dir_cache 3079 3120 80 48 1 : tunables 120 60 8 : slabdata 65 65 0 mnt_cache 27 30 256 15 1 : tunables 120 60 8 : slabdata 2 2 0 inode_cache 693 828 592 6 1 : tunables 54 27 8 : slabdata 138 138 0 dentry_cache 68754 75791 208 19 1 : tunables 120 60 8 : slabdata 3989 3989 420 filp 510 510 256 15 1 : tunables 120 60 8 : slabdata 34 34 0 names_cache 3 3 4096 1 1 : tunables 24 12 8 : slabdata 3 3 0 idr_layer_cache 103 105 528 7 1 : tunables 54 27 8 : slabdata 15 15 0 buffer_head 16476 16500 88 44 1 : tunables 120 60 8 : slabdata 375 375 0 mm_struct 45 45 832 9 2 : tunables 54 27 8 : slabdata 5 5 0 vm_area_struct 1313 1323 184 21 1 : tunables 120 60 8 : slabdata 63 63 0 fs_cache 59 59 64 59 1 : tunables 120 60 8 : slabdata 1 1 0 files_cache 52 52 896 4 1 : tunables 54 27 8 : slabdata 13 13 0 signal_cache 85 85 768 5 1 : tunables 54 27 8 : slabdata 17 17 0 sighand_cache 87 87 2112 3 2 : tunables 24 12 8 : slabdata 29 29 0 task_struct 88 88 1888 2 1 : tunables 24 12 8 : slabdata 44 44 0 anon_vma 558 576 24 144 1 : tunables 120 60 8 : slabdata 4 4 0 shared_policy_node 0 0 56 67 1 : tunables 120 60 8 : slabdata 0 0 0 numa_policy 30 144 24 144 1 : tunables 120 60 8 : slabdata 1 1 0 size-131072(DMA) 2 2 131072 1 32 : tunables 8 4 0 : slabdata 2 2 0 size-131072 0 0 131072 1 32 : tunables 8 4 0 : slabdata 0 0 0 size-65536(DMA) 0 0 65536 1 16 : tunables 8 4 0 : slabdata 0 0 0 size-65536 1 1 65536 1 16 : tunables 8 4 0 : slabdata 1 1 0 size-32768(DMA) 2 2 32768 1 8 : tunables 8 4 0 : slabdata 2 2 0 size-32768 5 5 32768 1 8 : tunables 8 4 0 : slabdata 5 5 0 size-16384(DMA) 0 0 16384 1 4 : tunables 8 4 0 : slabdata 0 0 0 size-16384 9 9 16384 1 4 : tunables 8 4 0 : slabdata 9 9 0 size-8192(DMA) 0 0 8192 1 2 : tunables 8 4 0 : slabdata 0 0 0 size-8192 395 395 8192 1 2 : tunables 8 4 0 : slabdata 395 395 0 size-4096(DMA) 0 0 4096 1 1 : tunables 24 12 8 : slabdata 0 0 0 size-4096 122 122 4096 1 1 : tunables 24 12 8 : slabdata 122 122 0 size-2048(DMA) 0 0 2048 2 1 : tunables 24 12 8 : slabdata 0 0 0 size-2048 648 758 2048 2 1 : tunables 24 12 8 : slabdata 379 379 0 size-1024(DMA) 0 0 1024 4 1 : tunables 54 27 8 : slabdata 0 0 0 size-1024 1198 1560 1024 4 1 : tunables 54 27 8 : slabdata 390 390 81 size-512(DMA) 0 0 512 8 1 : tunables 54 27 8 : slabdata 0 0 0 size-512 691 880 512 8 1 : tunables 54 27 8 : slabdata 110 110 162 size-256(DMA) 0 0 256 15 1 : tunables 120 60 8 : slabdata 0 0 0 size-256 425400 425400 256 15 1 : tunables 120 60 8 : slabdata 28360 28360 0 size-128(DMA) 0 0 128 30 1 : tunables 120 60 8 : slabdata 0 0 0 size-64(DMA) 0 0 64 59 1 : tunables 120 60 8 : slabdata 0 0 0 size-64 98679 267211 64 59 1 : tunables 120 60 8 : slabdata 4529 4529 480 size-32(DMA) 0 0 32 112 1 : tunables 120 60 8 : slabdata 0 0 0 size-128 44478 47610 128 30 1 : tunables 120 60 8 : slabdata 1587 1587 444 size-32 2742 3024 32 112 1 : tunables 120 60 8 : slabdata 27 27 480 kmem_cache 140 140 1664 2 1 : tunables 24 12 8 : slabdata 70 70 0 =======================================================BEESPBESXSNC08:~ # lctl get_param ldlm.namespaces.*.resource_count ldlm.namespaces.MGC172.16.0.54 at tcp.resource_count=1 ldlm.namespaces.MGC172.16.0.64 at tcp.resource_count=1 ldlm.namespaces.l.ap-a-MDT0000-mdc-ffff81001b243c00.resource_count=28862 ldlm.namespaces.l.ap-a-OST0000-osc-ffff81001b243c00.resource_count=100 ldlm.namespaces.l.ap-a-OST0001-osc-ffff81001b243c00.resource_count=23306 ldlm.namespaces.l.ap-a-OST0002-osc-ffff81001b243c00.resource_count=23278 ldlm.namespaces.l.ap-a-OST0003-osc-ffff81001b243c00.resource_count=100 ldlm.namespaces.l.ap-a-OST0004-osc-ffff81001b243c00.resource_count=100 ldlm.namespaces.l.ap-a-OST0005-osc-ffff81001b243c00.resource_count=18027 ldlm.namespaces.l.ap-a-OST0006-osc-ffff81001b243c00.resource_count=17212 ldlm.namespaces.l.ap-a-OST0007-osc-ffff81001b243c00.resource_count=100 ldlm.namespaces.l.ap-z-MDT0000-mdc-ffff81001ba73400.resource_count=10138 ldlm.namespaces.l.ap-z-OST0000-osc-ffff81001ba73400.resource_count=100 ldlm.namespaces.l.ap-z-OST0001-osc-ffff81001ba73400.resource_count=100 ldlm.namespaces.l.ap-z-OST0002-osc-ffff81001ba73400.resource_count=100 ldlm.namespaces.l.ap-z-OST0003-osc-ffff81001ba73400.resource_count=100 ldlm.namespaces.l.ap-z-OST0004-osc-ffff81001ba73400.resource_count=100 ldlm.namespaces.l.ap-z-OST0005-osc-ffff81001ba73400.resource_count=100 ldlm.namespaces.l.ap-z-OST0006-osc-ffff81001ba73400.resource_count=100 ldlm.namespaces.l.ap-z-OST0007-osc-ffff81001ba73400.resource_count=100 ========================================================
Andreas Dilger
2009-Jul-30 22:45 UTC
[Lustre-discuss] Lustre client memory usage very high
On Jul 30, 2009 09:52 +0200, Guillaume Demillecamps wrote:>> On Jul 22, 2009 11:45 +0200, Guillaume Demillecamps wrote: >>> Lustre 1.8.0 on all servers / clients involved in this. OS is SLES 10 >>> SP2 with un-patched kernel on the clients. I however has put the same >>> kernel revision downloaded from suse.com on the clients as the version >>> used in the Lustre-patched MGS/MDS/OSS servers. File system is only >>> several GBs, with ~500000 files. All inter-connections are through TCP. >>> >>> We have some ?manual? replication of an active lustre file system to a >>> passive lustre file system. We have ?sync? clients that just basically >>> mount both file systems and run large sync jobs from the active Lustre >>> to the passive Lustre. So far, so good (apart that it is quite a slow >>> process). However my issue is that Lustre is rising memory so high >>> that rsync cannot get enough RAM to finish its job before kswap kicks >>> in and slows things down drastically.> # name <active> <total> <size> <obj/slab>: slabdata <active> <num> > lustre_inode_cache 385652 385652 960 4 : slabdata 96413 96413 > lov_oinfo 2929548 2929548 320 12 : slabdata 244129 244129 > ldlm_locks 136262 254424 512 8 : slabdata 31803 31803 > ldlm_resources 136183 256120 384 10 : slabdata 25612 25612This shows that we have 385k Lustre inodes, yet there are 2.9M "lov_oinfo" structs (there should only be a single one per inode). I''m not sure why that is happening, but that is consuming about 1GB of RAM. The 385k inode count is reasonable, given you have 500k files, per above. There are 136k locks, which is also fine (probably so much lower than the inode count because of your short lock expiry time). So, it seems like a problem of some kind, and is probably deserving of filing a bug. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Guillaume Demillecamps
2009-Jul-31 06:56 UTC
[Lustre-discuss] Lustre client memory usage very high
Hello again, Not sure if it is interesting to be noted, but if I use the following command, my memory is freed: sync; echo 3 > /proc/sys/vm/drop_caches What is surprising, though, is that the cache never expires (at least it remains in memory for several days at the least). Regards, Guillaume Demillecamps ----- Message de adilger at sun.com --------- Date?: Thu, 30 Jul 2009 16:45:47 -0600 De?: Andreas Dilger <adilger at sun.com> Objet?: Re: [Lustre-discuss] Lustre client memory usage very high ??: Guillaume Demillecamps <guillaume at multipurpose.be> Cc?: lustre-discuss at lists.lustre.org> On Jul 30, 2009 09:52 +0200, Guillaume Demillecamps wrote: >>> On Jul 22, 2009 11:45 +0200, Guillaume Demillecamps wrote: >>>> Lustre 1.8.0 on all servers / clients involved in this. OS is SLES 10 >>>> SP2 with un-patched kernel on the clients. I however has put the same >>>> kernel revision downloaded from suse.com on the clients as the version >>>> used in the Lustre-patched MGS/MDS/OSS servers. File system is only >>>> several GBs, with ~500000 files. All inter-connections are through TCP. >>>> >>>> We have some ?manual? replication of an active lustre file system to a >>>> passive lustre file system. We have ?sync? clients that just basically >>>> mount both file systems and run large sync jobs from the active Lustre >>>> to the passive Lustre. So far, so good (apart that it is quite a slow >>>> process). However my issue is that Lustre is rising memory so high >>>> that rsync cannot get enough RAM to finish its job before kswap kicks >>>> in and slows things down drastically. > >> # name <active> <total> <size> <obj/slab>: slabdata >> <active> <num> >> lustre_inode_cache 385652 385652 960 4 : slabdata 96413 96413 >> lov_oinfo 2929548 2929548 320 12 : slabdata 244129 244129 >> ldlm_locks 136262 254424 512 8 : slabdata 31803 31803 >> ldlm_resources 136183 256120 384 10 : slabdata 25612 25612 > > This shows that we have 385k Lustre inodes, yet there are 2.9M "lov_oinfo" > structs (there should only be a single one per inode). I''m not sure > why that is happening, but that is consuming about 1GB of RAM. The 385k > inode count is reasonable, given you have 500k files, per above. There > are 136k locks, which is also fine (probably so much lower than the inode > count because of your short lock expiry time). > > So, it seems like a problem of some kind, and is probably deserving of > filing a bug. > > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > >----- Fin du message de adilger at sun.com -----
Guillaume Demillecamps
2009-Jul-31 07:15 UTC
[Lustre-discuss] 1.8 : recurrent LBUG''s on clients
Hello, All servers and clients are having Lustre 1.8, on SLES 10 SP2. Clients use patchless kernels, using same base revision as the ones for the patched kernel servers. We recurrently encounter this error : Server log : ------------ Jul 30 06:11:47 BEESPBESXFIL27 kernel: LustreError: 22061:0:(mds_open.c:1665:mds_close()) @@@ no handle for file close ino 5606195: cookie 0x5ed7d8c3d1299f40 req at ffff810065a60400 x1308791892785337/t0 o35->4f104403-eb03-83be-2910-2fd7cc26087c at NET_0x20000c0a84410_UUID:0/0 lens 408/864 e 0 to 0 dl 1248927113 ref 1 fl Interpret:/0/0 rc 0/0 Jul 30 06:11:47 BEESPBESXFIL27 kernel: LustreError: 22061:0:(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-116) req at ffff810065a60400 x1308791892785337/t0 o35->4f104403-eb03-83be-2910-2fd7cc26087c at NET_0x20000c0a84410_UUID:0/0 lens 408/864 e 0 to 0 dl 1248927113 ref 1 fl Interpret:/0/0 rc -116/0 Jul 30 06:11:47 BEESPBESXFIL27 kernel: LustreError: 22061:0:(mds_open.c:1665:mds_close()) @@@ no handle for file close ino 5606200: cookie 0x5ed7d8c3d129a361 req at ffff810071b28400 x1308791892785342/t0 o35->4f104403-eb03-83be-2910-2fd7cc26087c at NET_0x20000c0a84410_UUID:0/0 lens 408/864 e 0 to 0 dl 1248927113 ref 1 fl Interpret:/0/0 rc 0/0 Jul 30 06:11:47 BEESPBESXFIL27 kernel: LustreError: 22061:0:(mds_open.c:1665:mds_close()) Skipped 4 previous similar messages Jul 30 06:11:47 BEESPBESXFIL27 kernel: LustreError: 22061:0:(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-116) req at ffff810071b28400 x1308791892785342/t0 o35->4f104403-eb03-83be-2910-2fd7cc26087c at NET_0x20000c0a84410_UUID:0/0 lens 408/864 e 0 to 0 dl 1248927113 ref 1 fl Interpret:/0/0 rc -116/0 Jul 30 06:11:47 BEESPBESXFIL27 kernel: LustreError: 22061:0:(ldlm_lib.c:1826:target_send_reply_msg()) Skipped 4 previous similar messages Client log: ----------- Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError: 11-0: an error occurred while communicating with 172.16.0.55 at tcp. The mds_close operation failed with -116 Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError: 13298:0:(file.c:114:ll_close_inode_openhandle()) inode 5606195 mdc close failed: rc = -116 Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError: 13298:0:(file.c:114:ll_close_inode_openhandle()) Skipped 1 previous similar message Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError: 13298:0:(file.c:114:ll_close_inode_openhandle()) inode 5606155 mdc close failed: rc = -116 Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError: 13298:0:(file.c:114:ll_close_inode_openhandle()) Skipped 3 previous similar messages Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError: 11-0: an error occurred while communicating with 172.16.0.55 at tcp. The mds_close operation failed with -116 Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError: Skipped 7 previous similar messages Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError: 13298:0:(ldlm_lock.c:602:ldlm_lock_decref_internal_nolock()) ASSERTION(lock->l_writers > 0) failed Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError: 13298:0:(ldlm_lock.c:602:ldlm_lock_decref_internal_nolock()) LBUG Jul 30 06:11:47 BEESPDESXAPP06 kernel: Jul 30 06:11:47 BEESPDESXAPP06 kernel: Call Trace: <ffffffff88257aea>{:libcfs:lbug_with_loc+122} Jul 30 06:11:47 BEESPDESXAPP06 kernel: <ffffffff8825fe00>{:libcfs:tracefile_init+0} <ffffffff8835d566>{:ptlrpc:ldlm_lock_decref_internal_nolock+182} Jul 30 06:11:47 BEESPDESXAPP06 kernel: <ffffffff8838533b>{:ptlrpc:ldlm_process_flock_lock+4139} Jul 30 06:11:47 BEESPDESXAPP06 kernel: <ffffffff883864ef>{:ptlrpc:ldlm_flock_completion_ast+2111} Jul 30 06:11:47 BEESPDESXAPP06 kernel: <ffffffff8835f4a9>{:ptlrpc:ldlm_lock_enqueue+2169} <ffffffff88377ca0>{:ptlrpc:ldlm_cli_enqueue_fini+2624} Jul 30 06:11:47 BEESPDESXAPP06 kernel: <ffffffff88376fd3>{:ptlrpc:ldlm_prep_elc_req+755} <ffffffff8835bc0d>{:ptlrpc:ldlm_lock_create+2541} Jul 30 06:11:47 BEESPDESXAPP06 kernel: <ffffffff8012c668>{default_wake_function+0} <ffffffff88379ae2>{:ptlrpc:ldlm_cli_enqueue+1666} Jul 30 06:11:47 BEESPDESXAPP06 kernel: <ffffffff88523fcf>{:lustre:ll_file_flock+1407} <ffffffff88385cb0>{:ptlrpc:ldlm_flock_completion_ast+0} Jul 30 06:11:47 BEESPDESXAPP06 kernel: <ffffffff8019ae2e>{locks_remove_posix+132} <ffffffff80147fdc>{bit_waitqueue+56} Jul 30 06:11:47 BEESPDESXAPP06 kernel: <ffffffff80190241>{flush_old_exec+2729} <ffffffff80186fc1>{__fput+355} Jul 30 06:11:47 BEESPDESXAPP06 kernel: <ffffffff8018455b>{filp_close+84} <ffffffff801360b7>{put_files_struct+107} Jul 30 06:11:47 BEESPDESXAPP06 kernel: <ffffffff8010aecb>{sysret_signal+28} <ffffffff8013725c>{do_exit+684} Jul 30 06:11:47 BEESPDESXAPP06 kernel: <ffffffff80137995>{sys_exit_group+0} <ffffffff8014083c>{get_signal_to_deliver+1394} Jul 30 06:11:47 BEESPDESXAPP06 kernel: <ffffffff8010aecb>{sysret_signal+28} <ffffffff8010a19c>{do_signal+118} Jul 30 06:11:47 BEESPDESXAPP06 kernel: <ffffffff8012c668>{default_wake_function+0} <ffffffff8014b227>{do_futex+104} Jul 30 06:11:47 BEESPDESXAPP06 kernel: <ffffffff801743b2>{sys_mprotect+1742} <ffffffff8010aecb>{sysret_signal+28} Jul 30 06:11:47 BEESPDESXAPP06 kernel: <ffffffff8010b14f>{ptregscall_common+103} Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError: dumping log to /tmp/lustre-log.1248927107.13298 Jul 30 06:11:47 BEESPDESXAPP06 kernel: Fixing recursive fault but reboot is needed! Then ineed a reboot of the client is required. What does it mean ? Could it be related to sys.timeouts and/or ldlm_timeouts too short ? Regards, Guillaume Demillecamps
Hello! On Jul 31, 2009, at 3:15 AM, Guillaume Demillecamps wrote:> All servers and clients are having Lustre 1.8, on SLES 10 SP2. Clients > use patchless kernels, using same base revision as the ones for the > patched kernel servers. > We recurrently encounter this error :Chances are you are hitting bug 17046. There is a patch with a fix that would also be included in 1.8.1 release. Bye, Oleg