Hi Eric, I would love to trade notes with you through all this. Environment is: - Latest Celeron 2.4Ghz, 512MB RAM, 80GB Western Digital @7200rpm - Redhat 9 - Intel 1000 MT running stock driver from RH9 install - Cheap 8 port DLink gigabit switch(DGS-1008D) - Lustre 1.0 rpm release I run Openssi providing a two node cluster as the client, and 2 OSS''s with roughly 70GB partition each, totalling 140GB lustre partition. As a test, the MDS is running on an older Pentium III 733Mhz with 256MB ram on a 100mbit NIC. What I''ve noticed so far: - Tuning the network performance according to SpecWeb99 doesn''t help. - Tuning the network for gigabit architecture according to many websites does not help. - Tweaking the NIC''s interrupt throttle rates, delays and buffers do not help. This all leads to the fact that the network is currently not the bottleneck for a 2 node OSS, whether running a 64k, 512k, 1mb or 2mb LOV stripe. What I DID stumble across late last night is that if, on the client, I mount, dismount, mount again, I get a good flat performance across the spectrum using iozone. However, if I leave it for a while after the test, and rerun the test again, I fall back into that familiar big dip in performance when the transfer block size hits 8192. Perhaps this is an anomoly already fixed in Lustre 1.2.x, but I''m not privy to that at the moment. I''ve played with some of the proc files under /proc/fs/lustre/osc/OSC*/max_rpcs_in_flight, etc., but they don''t appear to make much of a difference yet. I''ve also tested performance with bonnie++ and the difference between a lustre mount and the raw drive performance is really the random reads/writes and random & sequential file creation/deletions. This is where you MUST have a fast MDS and fast NIC on the MDS because all fast random read/writes and file creation/deletion is making use of the metastore on the MDS machine. What have your experiences been? I''m curious to know what your progress has been so far. Cheers, Cuong. On Wed, 22 Sep 2004 23:04:08 +0100, Eric Barton <eeb@bartonsoftware.com> wrote:> Cuong, > > How wonderful. I''ll plot the numbers with great interest. > > Can you tell me the architecture/network/os etc stuff? > > Have you got raw device performance graphs? > > We''re working on smoothing out performance and beginning to get > somewhere. Your work can only help. > > Cheers, > Eric > > --------------------------------------------------- > |Eric Barton Barton Software | > |9 York Gardens Tel: +44 (117) 330 1575 | > |Clifton Mobile: +44 (7909) 680 356 | > |Bristol BS8 4LL Fax: call first | > |United Kingdom E-Mail: eeb@bartonsoftware.com| > --------------------------------------------------- > > > > > > -----Original Message----- > > From: lustre-discuss-admin@lists.clusterfs.com > > [mailto:lustre-discuss-admin@lists.clusterfs.com]On Behalf Of smee > > Sent: Wednesday, September 22, 2004 5:15 AM > > To: lustre-discuss@lists.clusterfs.com > > Subject: [Lustre-discuss] performance tuning > > > > > > Hi, > > > > I''ve got an MDS and serveral OSS''s setup properly and am going through > > getting an idea of how to performance tune things. > > I''ve used iozone directly on each OSS itself, then comparing against > > iozone run on a lustre mount. I''m getting poor performance from the > > lustre mount as compared to raw readings from the drive itself. Even > > if I ignore the absolute figures, looking at the graphs plotted from > > the iozone readings, the graphs for writes look very different. On the > > raw drive readings graph, you can see a uniform surface area across > > the write-blocksize spectrum, whereas on the lustre mount, there are > > large valleys and troughs. This tells me that either the network is > > poorly tuned or the lustre component is poorly tuned or a combination > > of both. When I say poorly tuned, I mean for my scenario, > > hardware-wise, environment-wise. > > Sample of iozone output for raw drive: > > 4 8 16 32 64 128 256 > > 512 1024 2048 4096 8192 16384 > > 64 87066 90152 93167 85438 79692 > > > > 128 86192 98307 115324 95234 90582 83877 > > > > 256 90296 94847 103646 99259 96021 90811 84318 > > > > 512 90268 98840 102095 99668 92152 90762 91395 > > 87268 > > 1024 90957 98775 100907 99242 92964 92044 91429 > > 93133 91756 > > 2048 89859 98537 101552 99071 94121 92052 93251 > > 92439 93192 91310 > > 4096 89362 99239 100857 98206 93614 92964 92571 > > 92743 92739 92774 91310 > > 8192 88024 97479 101247 98256 92986 91263 91393 > > 85121 92203 92530 92671 91435 > > 16384 89353 97054 99498 97740 92662 90705 91244 > > 91147 90493 91031 91213 91273 90831 > > 32768 86754 92916 98786 96950 91569 89815 90547 > > 90980 90949 92074 92071 92312 91819 > > 65536 89665 96852 99777 97932 92776 91595 91472 > > 92044 91083 92173 92707 92505 92220 > > 131072 89967 97431 99183 98332 93049 91462 > > 91087 92142 91978 92585 92550 92607 92601 > > 262144 86324 91915 96636 92946 88862 86880 > > 87787 88801 89097 88852 88873 88662 89203 > > > > > > Sample of iozone output for lustre mount > > 4 8 16 32 64 128 256 > > 512 1024 2048 4096 8192 16384 > > 64 25974 26478 26891 27622 27245 > > > > 128 46060 47075 49325 49942 48339 47547 > > > > 256 63193 69698 72560 69699 72563 74943 69377 > > > > 512 89604 91904 101649 104959 97466 95150 100020 > > 58274 > > 1024 108634 116973 123610 131516 120653 118463 125907 > > 124031 118930 > > 2048 49844 51315 51991 54151 53158 52519 53311 > > 52965 53704 53196 > > 4096 37631 38040 40073 38878 39707 38800 40329 > > 39514 40555 39780 40255 > > 8192 33015 34176 35269 34960 35064 35048 35366 > > 35718 35622 35759 35900 35795 > > 16384 31763 32851 33470 33434 33251 33886 33734 > > 58253 37714 72057 34277 38210 54380 > > 32768 35487 57635 125766 92239 99827 90024 140139 > > 145908 145645 149407 147299 148796 148365 > > 65536 127508 140998 153479 149216 146867 144750 147058 > > 149344 150468 150944 151472 147819 150513 > > 131072 119698 135814 150307 148915 145023 143271 > > 145509 148743 150515 151543 152708 154004 154315 > > 262144 47873 50045 51672 51218 51160 50838 > > 50844 51929 51459 51596 51809 52068 51998 > > > > > > Are there any lustre tunable variables that we can fiddle with to get > > better performance or even just to get more uniform performance rather > > than large dips depending on block size, etc. ? > > I noticed whilst going back over the mailing list archives, that there > > was mention of a few variables: > > > > echo 0 > /proc/sys/portals/debug > > > > # mds and client > > for osc in /proc/fs/lustre/osc/OSC*/; do > > # increase number of concurrent requests to each OST > > if [ -f $osc/max_rpcs_in_flight ]; then > > echo 8 > $osc/max_rpcs_in_flight > > fi > > > > # increase client-side writeback cache > > if [ -f $osc/max_dirty_mb ]; then > > echo 64 > $osc/max_dirty_mb > > fi > > done > > > > > > # ost > > for ost in /proc/fs/lustre/obdfilter/ost*/; do > > # only read cache files up to 32MB, not larger files > > if [ -f $ost/readcache_max_filesize ]; then > > echo $((32*1024*1024)) > $ost/readcache_max_filesize > > fi > > done > > > > > > I''m just wondering if there are any more we can tweak? > > > > Cheers, > > Cuong. > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss@lists.clusterfs.com > > https://lists.clusterfs.com/mailman/listinfo/lustre-discuss > > > > >
I noticed the lustre.pdf makes mention of lmc --mountfsoptions <option>. Is that applicable to Lustre 1.2.x only? I''m using 1.0 and it gives me an error about unrecognised --mountfsoptions. I was trying to see if I could pass it options like "noatime,nodirtime". Also, the same docs mention a lmc --add net --tcpbuf <size> option where the default for size is 1MB, but when mounting the respective machines (mds, ost, client) in verbose mode, the debug shows that the send_mem and recv_mem is 8388608. So again, are the docs describing 1.2.x or are they incorrect for this option? Cheers, Cuong On Thu, 23 Sep 2004 11:54:09 +1000, smee <snotmee@gmail.com> wrote:> Hi Eric, > > I would love to trade notes with you through all this. > Environment is: > - Latest Celeron 2.4Ghz, 512MB RAM, 80GB Western Digital @7200rpm > - Redhat 9 > - Intel 1000 MT running stock driver from RH9 install > - Cheap 8 port DLink gigabit switch(DGS-1008D) > - Lustre 1.0 rpm release > > I run Openssi providing a two node cluster as the client, and 2 OSS''s > with roughly 70GB partition each, totalling 140GB lustre partition. > As a test, the MDS is running on an older Pentium III 733Mhz with > 256MB ram on a 100mbit NIC. > > What I''ve noticed so far: > - Tuning the network performance according to SpecWeb99 doesn''t help. > - Tuning the network for gigabit architecture according to many > websites does not help. > - Tweaking the NIC''s interrupt throttle rates, delays and buffers do not help. > > This all leads to the fact that the network is currently not the > bottleneck for a 2 node OSS, whether running a 64k, 512k, 1mb or 2mb > LOV stripe. > What I DID stumble across late last night is that if, on the client, I > mount, dismount, mount again, I get a good flat performance across the > spectrum using iozone. However, if I leave it for a while after the > test, and rerun the test again, I fall back into that familiar big dip > in performance when the transfer block size hits 8192. Perhaps this is > an anomoly already fixed in Lustre 1.2.x, but I''m not privy to that at > the moment. > I''ve played with some of the proc files under > /proc/fs/lustre/osc/OSC*/max_rpcs_in_flight, etc., but they don''t > appear to make much of a difference yet. > > I''ve also tested performance with bonnie++ and the difference between > a lustre mount and the raw drive performance is really the random > reads/writes and random & sequential file creation/deletions. This is > where you MUST have a fast MDS and fast NIC on the MDS because all > fast random read/writes and file creation/deletion is making use of > the metastore on the MDS machine. > > What have your experiences been? I''m curious to know what your > progress has been so far. > > Cheers, > Cuong. > > > > > On Wed, 22 Sep 2004 23:04:08 +0100, Eric Barton <eeb@bartonsoftware.com> wrote: > > Cuong, > > > > How wonderful. I''ll plot the numbers with great interest. > > > > Can you tell me the architecture/network/os etc stuff? > > > > Have you got raw device performance graphs? > > > > We''re working on smoothing out performance and beginning to get > > somewhere. Your work can only help. > > > > Cheers, > > Eric > > > > --------------------------------------------------- > > |Eric Barton Barton Software | > > |9 York Gardens Tel: +44 (117) 330 1575 | > > |Clifton Mobile: +44 (7909) 680 356 | > > |Bristol BS8 4LL Fax: call first | > > |United Kingdom E-Mail: eeb@bartonsoftware.com| > > --------------------------------------------------- > > > > > > > > > > > -----Original Message----- > > > From: lustre-discuss-admin@lists.clusterfs.com > > > [mailto:lustre-discuss-admin@lists.clusterfs.com]On Behalf Of smee > > > Sent: Wednesday, September 22, 2004 5:15 AM > > > To: lustre-discuss@lists.clusterfs.com > > > Subject: [Lustre-discuss] performance tuning > > > > > > > > > Hi, > > > > > > I''ve got an MDS and serveral OSS''s setup properly and am going through > > > getting an idea of how to performance tune things. > > > I''ve used iozone directly on each OSS itself, then comparing against > > > iozone run on a lustre mount. I''m getting poor performance from the > > > lustre mount as compared to raw readings from the drive itself. Even > > > if I ignore the absolute figures, looking at the graphs plotted from > > > the iozone readings, the graphs for writes look very different. On the > > > raw drive readings graph, you can see a uniform surface area across > > > the write-blocksize spectrum, whereas on the lustre mount, there are > > > large valleys and troughs. This tells me that either the network is > > > poorly tuned or the lustre component is poorly tuned or a combination > > > of both. When I say poorly tuned, I mean for my scenario, > > > hardware-wise, environment-wise. > > > Sample of iozone output for raw drive: > > > 4 8 16 32 64 128 256 > > > 512 1024 2048 4096 8192 16384 > > > 64 87066 90152 93167 85438 79692 > > > > > > 128 86192 98307 115324 95234 90582 83877 > > > > > > 256 90296 94847 103646 99259 96021 90811 84318 > > > > > > 512 90268 98840 102095 99668 92152 90762 91395 > > > 87268 > > > 1024 90957 98775 100907 99242 92964 92044 91429 > > > 93133 91756 > > > 2048 89859 98537 101552 99071 94121 92052 93251 > > > 92439 93192 91310 > > > 4096 89362 99239 100857 98206 93614 92964 92571 > > > 92743 92739 92774 91310 > > > 8192 88024 97479 101247 98256 92986 91263 91393 > > > 85121 92203 92530 92671 91435 > > > 16384 89353 97054 99498 97740 92662 90705 91244 > > > 91147 90493 91031 91213 91273 90831 > > > 32768 86754 92916 98786 96950 91569 89815 90547 > > > 90980 90949 92074 92071 92312 91819 > > > 65536 89665 96852 99777 97932 92776 91595 91472 > > > 92044 91083 92173 92707 92505 92220 > > > 131072 89967 97431 99183 98332 93049 91462 > > > 91087 92142 91978 92585 92550 92607 92601 > > > 262144 86324 91915 96636 92946 88862 86880 > > > 87787 88801 89097 88852 88873 88662 89203 > > > > > > > > > Sample of iozone output for lustre mount > > > 4 8 16 32 64 128 256 > > > 512 1024 2048 4096 8192 16384 > > > 64 25974 26478 26891 27622 27245 > > > > > > 128 46060 47075 49325 49942 48339 47547 > > > > > > 256 63193 69698 72560 69699 72563 74943 69377 > > > > > > 512 89604 91904 101649 104959 97466 95150 100020 > > > 58274 > > > 1024 108634 116973 123610 131516 120653 118463 125907 > > > 124031 118930 > > > 2048 49844 51315 51991 54151 53158 52519 53311 > > > 52965 53704 53196 > > > 4096 37631 38040 40073 38878 39707 38800 40329 > > > 39514 40555 39780 40255 > > > 8192 33015 34176 35269 34960 35064 35048 35366 > > > 35718 35622 35759 35900 35795 > > > 16384 31763 32851 33470 33434 33251 33886 33734 > > > 58253 37714 72057 34277 38210 54380 > > > 32768 35487 57635 125766 92239 99827 90024 140139 > > > 145908 145645 149407 147299 148796 148365 > > > 65536 127508 140998 153479 149216 146867 144750 147058 > > > 149344 150468 150944 151472 147819 150513 > > > 131072 119698 135814 150307 148915 145023 143271 > > > 145509 148743 150515 151543 152708 154004 154315 > > > 262144 47873 50045 51672 51218 51160 50838 > > > 50844 51929 51459 51596 51809 52068 51998 > > > > > > > > > Are there any lustre tunable variables that we can fiddle with to get > > > better performance or even just to get more uniform performance rather > > > than large dips depending on block size, etc. ? > > > I noticed whilst going back over the mailing list archives, that there > > > was mention of a few variables: > > > > > > echo 0 > /proc/sys/portals/debug > > > > > > # mds and client > > > for osc in /proc/fs/lustre/osc/OSC*/; do > > > # increase number of concurrent requests to each OST > > > if [ -f $osc/max_rpcs_in_flight ]; then > > > echo 8 > $osc/max_rpcs_in_flight > > > fi > > > > > > # increase client-side writeback cache > > > if [ -f $osc/max_dirty_mb ]; then > > > echo 64 > $osc/max_dirty_mb > > > fi > > > done > > > > > > > > > # ost > > > for ost in /proc/fs/lustre/obdfilter/ost*/; do > > > # only read cache files up to 32MB, not larger files > > > if [ -f $ost/readcache_max_filesize ]; then > > > echo $((32*1024*1024)) > $ost/readcache_max_filesize > > > fi > > > done > > > > > > > > > I''m just wondering if there are any more we can tweak? > > > > > > Cheers, > > > Cuong. > > > _______________________________________________ > > > Lustre-discuss mailing list > > > Lustre-discuss@lists.clusterfs.com > > > https://lists.clusterfs.com/mailman/listinfo/lustre-discuss > > > > > > > > >
Hello Cuong-- On 9/23/2004 2:15, "smee" <snotmee@gmail.com> wrote:> I noticed the lustre.pdf makes mention of lmc --mountfsoptions <option>. > Is that applicable to Lustre 1.2.x only? I''m using 1.0 and it gives me > an error about unrecognised --mountfsoptions. > I was trying to see if I could pass it options like "noatime,nodirtime".This option is supported by lmc in 1.2.x.> Also, the same docs mention a lmc --add net --tcpbuf <size> option > where the default for size is 1MB, but when mounting the respective > machines (mds, ost, client) in verbose mode, the debug shows that the > send_mem and recv_mem is 8388608. So again, are the docs describing > 1.2.x or are they incorrect for this option?This option is not yet supported. It''s important to realize that lustre.pdf often describes what we want, and not what is exactly implemented. There are several sections which describe features which are not yet built, or which are not included in our production releases. -Phil
Hi, I''ve got an MDS and serveral OSS''s setup properly and am going through getting an idea of how to performance tune things. I''ve used iozone directly on each OSS itself, then comparing against iozone run on a lustre mount. I''m getting poor performance from the lustre mount as compared to raw readings from the drive itself. Even if I ignore the absolute figures, looking at the graphs plotted from the iozone readings, the graphs for writes look very different. On the raw drive readings graph, you can see a uniform surface area across the write-blocksize spectrum, whereas on the lustre mount, there are large valleys and troughs. This tells me that either the network is poorly tuned or the lustre component is poorly tuned or a combination of both. When I say poorly tuned, I mean for my scenario, hardware-wise, environment-wise. Sample of iozone output for raw drive: 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 64 87066 90152 93167 85438 79692 128 86192 98307 115324 95234 90582 83877 256 90296 94847 103646 99259 96021 90811 84318 512 90268 98840 102095 99668 92152 90762 91395 87268 1024 90957 98775 100907 99242 92964 92044 91429 93133 91756 2048 89859 98537 101552 99071 94121 92052 93251 92439 93192 91310 4096 89362 99239 100857 98206 93614 92964 92571 92743 92739 92774 91310 8192 88024 97479 101247 98256 92986 91263 91393 85121 92203 92530 92671 91435 16384 89353 97054 99498 97740 92662 90705 91244 91147 90493 91031 91213 91273 90831 32768 86754 92916 98786 96950 91569 89815 90547 90980 90949 92074 92071 92312 91819 65536 89665 96852 99777 97932 92776 91595 91472 92044 91083 92173 92707 92505 92220 131072 89967 97431 99183 98332 93049 91462 91087 92142 91978 92585 92550 92607 92601 262144 86324 91915 96636 92946 88862 86880 87787 88801 89097 88852 88873 88662 89203 Sample of iozone output for lustre mount 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 64 25974 26478 26891 27622 27245 128 46060 47075 49325 49942 48339 47547 256 63193 69698 72560 69699 72563 74943 69377 512 89604 91904 101649 104959 97466 95150 100020 58274 1024 108634 116973 123610 131516 120653 118463 125907 124031 118930 2048 49844 51315 51991 54151 53158 52519 53311 52965 53704 53196 4096 37631 38040 40073 38878 39707 38800 40329 39514 40555 39780 40255 8192 33015 34176 35269 34960 35064 35048 35366 35718 35622 35759 35900 35795 16384 31763 32851 33470 33434 33251 33886 33734 58253 37714 72057 34277 38210 54380 32768 35487 57635 125766 92239 99827 90024 140139 145908 145645 149407 147299 148796 148365 65536 127508 140998 153479 149216 146867 144750 147058 149344 150468 150944 151472 147819 150513 131072 119698 135814 150307 148915 145023 143271 145509 148743 150515 151543 152708 154004 154315 262144 47873 50045 51672 51218 51160 50838 50844 51929 51459 51596 51809 52068 51998 Are there any lustre tunable variables that we can fiddle with to get better performance or even just to get more uniform performance rather than large dips depending on block size, etc. ? I noticed whilst going back over the mailing list archives, that there was mention of a few variables: echo 0 > /proc/sys/portals/debug # mds and client for osc in /proc/fs/lustre/osc/OSC*/; do # increase number of concurrent requests to each OST if [ -f $osc/max_rpcs_in_flight ]; then echo 8 > $osc/max_rpcs_in_flight fi # increase client-side writeback cache if [ -f $osc/max_dirty_mb ]; then echo 64 > $osc/max_dirty_mb fi done # ost for ost in /proc/fs/lustre/obdfilter/ost*/; do # only read cache files up to 32MB, not larger files if [ -f $ost/readcache_max_filesize ]; then echo $((32*1024*1024)) > $ost/readcache_max_filesize fi done I''m just wondering if there are any more we can tweak? Cheers, Cuong.