Btrfs uses below equation to calculate ra_pages: fs_info->bdi.ra_pages = max(fs_info->bdi.ra_pages, 4 * 1024 * 1024 / PAGE_CACHE_SIZE); is the max() a typo of min()? This makes the readahead size is 4M by default, which is too big. I have a system with 16 CPU, 6G memory and 12 sata disks. I create a btrfs for each disk, so this isn''t a raid setup. The test is fio, which has 12 tasks to access 12 files for each disk. The fio test is mmap sequential read. I measure the performance with different readahead size: ra size io throughput 4M 268288 k/s 2M 367616 k/s 1M 431104 k/s 512K 474112 k/s 256K 512000 k/s 128K 538624 k/s The 4M default readahead size has poor performance. I also does sync sequential read test, the test difference in''t that big. But the 4M case still has about 10% drop compared to the 512k case. One might argue how about the case memory isn''t tight. I tried only run a one-disk setup with only one task. The 4M ra almost has no difference with the 128K ra. I guess the 128k default ra size for backing dev is carefuly choosed to work with popular disks. So my question is why we have a default 4M readahead size even with noraid case? Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 18, 2010 at 09:42:57AM +0800, Shaohua Li wrote:> Btrfs uses below equation to calculate ra_pages: > fs_info->bdi.ra_pages = max(fs_info->bdi.ra_pages, > 4 * 1024 * 1024 / PAGE_CACHE_SIZE); > is the max() a typo of min()? This makes the readahead size is 4M by default, > which is too big.Looks like things have changed since I tuned that number. Fengguang has been busy ;)> I have a system with 16 CPU, 6G memory and 12 sata disks. I create a btrfs for > each disk, so this isn''t a raid setup. The test is fio, which has 12 tasks to > access 12 files for each disk. The fio test is mmap sequential read. I measure > the performance with different readahead size: > ra size io throughput > 4M 268288 k/s > 2M 367616 k/s > 1M 431104 k/s > 512K 474112 k/s > 256K 512000 k/s > 128K 538624 k/s > The 4M default readahead size has poor performance. > I also does sync sequential read test, the test difference in''t that big. But > the 4M case still has about 10% drop compared to the 512k case.I''m surprised the 4M is so much slower. At any rate, the larger size was selected because btrfs checksumming means we need a bigger buffer to keep the disks saturated. Were you on a fancy intel box with hardware crc32c enabled?> > One might argue how about the case memory isn''t tight. I tried only run a > one-disk setup with only one task. The 4M ra almost has no difference with the > 128K ra. I guess the 128k default ra size for backing dev is carefuly choosed > to work with popular disks. > So my question is why we have a default 4M readahead size even with noraid case?I''m happy to tune it down if lower numbers are more appropriate now, thanks for trying this! -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 18, 2010 at 08:53:13PM +0800, Chris Mason wrote:> On Thu, Mar 18, 2010 at 09:42:57AM +0800, Shaohua Li wrote: > > Btrfs uses below equation to calculate ra_pages: > > fs_info->bdi.ra_pages = max(fs_info->bdi.ra_pages, > > 4 * 1024 * 1024 / PAGE_CACHE_SIZE); > > is the max() a typo of min()? This makes the readahead size is 4M by default, > > which is too big. > > Looks like things have changed since I tuned that number. Fengguang has > been busy ;) > > > I have a system with 16 CPU, 6G memory and 12 sata disks. I create a btrfs for > > each disk, so this isn''t a raid setup. The test is fio, which has 12 tasks to > > access 12 files for each disk. The fio test is mmap sequential read. I measure > > the performance with different readahead size: > > ra size io throughput > > 4M 268288 k/s > > 2M 367616 k/s > > 1M 431104 k/s > > 512K 474112 k/s > > 256K 512000 k/s > > 128K 538624 k/s > > The 4M default readahead size has poor performance. > > I also does sync sequential read test, the test difference in''t that big. But > > the 4M case still has about 10% drop compared to the 512k case. > > I''m surprised the 4M is so much slower. At any rate, the larger size > was selected because btrfs checksumming means we need a bigger buffer to > keep the disks saturated. Were you on a fancy intel box with hardware > crc32c enabled?yes, this machine supports sse4.2 instruction. Let me check the result with checksum disabled. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 19, 2010 at 08:59:48AM +0800, Shaohua Li wrote:> On Thu, Mar 18, 2010 at 08:53:13PM +0800, Chris Mason wrote: > > On Thu, Mar 18, 2010 at 09:42:57AM +0800, Shaohua Li wrote: > > > Btrfs uses below equation to calculate ra_pages: > > > fs_info->bdi.ra_pages = max(fs_info->bdi.ra_pages, > > > 4 * 1024 * 1024 / PAGE_CACHE_SIZE); > > > is the max() a typo of min()? This makes the readahead size is 4M by default, > > > which is too big. > > > > Looks like things have changed since I tuned that number. Fengguang has > > been busy ;) > > > > > I have a system with 16 CPU, 6G memory and 12 sata disks. I create a btrfs for > > > each disk, so this isn''t a raid setup. The test is fio, which has 12 tasks to > > > access 12 files for each disk. The fio test is mmap sequential read. I measure > > > the performance with different readahead size: > > > ra size io throughput > > > 4M 268288 k/s > > > 2M 367616 k/s > > > 1M 431104 k/s > > > 512K 474112 k/s > > > 256K 512000 k/s > > > 128K 538624 k/s > > > The 4M default readahead size has poor performance. > > > I also does sync sequential read test, the test difference in''t that big. But > > > the 4M case still has about 10% drop compared to the 512k case. > > > > I''m surprised the 4M is so much slower. At any rate, the larger size > > was selected because btrfs checksumming means we need a bigger buffer to > > keep the disks saturated. Were you on a fancy intel box with hardware > > crc32c enabled? > yes, this machine supports sse4.2 instruction. Let me check the result with checksum > disabled.Sounds no big difference with checksum disabled. I format the disks and redo the test: 128k ra: 539648 k/s 4m ra: 285696 k/s Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 19 2010, Shaohua Li wrote:> On Fri, Mar 19, 2010 at 08:59:48AM +0800, Shaohua Li wrote: > > On Thu, Mar 18, 2010 at 08:53:13PM +0800, Chris Mason wrote: > > > On Thu, Mar 18, 2010 at 09:42:57AM +0800, Shaohua Li wrote: > > > > Btrfs uses below equation to calculate ra_pages: > > > > fs_info->bdi.ra_pages = max(fs_info->bdi.ra_pages, > > > > 4 * 1024 * 1024 / PAGE_CACHE_SIZE); > > > > is the max() a typo of min()? This makes the readahead size is 4M by default, > > > > which is too big. > > > > > > Looks like things have changed since I tuned that number. Fengguang has > > > been busy ;) > > > > > > > I have a system with 16 CPU, 6G memory and 12 sata disks. I create a btrfs for > > > > each disk, so this isn''t a raid setup. The test is fio, which has 12 tasks to > > > > access 12 files for each disk. The fio test is mmap sequential read. I measure > > > > the performance with different readahead size: > > > > ra size io throughput > > > > 4M 268288 k/s > > > > 2M 367616 k/s > > > > 1M 431104 k/s > > > > 512K 474112 k/s > > > > 256K 512000 k/s > > > > 128K 538624 k/s > > > > The 4M default readahead size has poor performance. > > > > I also does sync sequential read test, the test difference in''t that big. But > > > > the 4M case still has about 10% drop compared to the 512k case. > > > > > > I''m surprised the 4M is so much slower. At any rate, the larger size > > > was selected because btrfs checksumming means we need a bigger buffer to > > > keep the disks saturated. Were you on a fancy intel box with hardware > > > crc32c enabled? > > yes, this machine supports sse4.2 instruction. Let me check the result with checksum > > disabled. > Sounds no big difference with checksum disabled. I format the disks and redo > the test: > 128k ra: 539648 k/s > 4m ra: 285696 k/s4MB is definitely a huge read-ahead size, but I do wonder why it would perform that much worse than a 128KB window. If you narrow your test down to a single disk (or something simpler, at least), how does 4MB compare to 128KB? With 6GB of memory, you should not run into read-ahead memory thrashing. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 19, 2010 at 04:22:11PM +0800, Jens Axboe wrote:> On Fri, Mar 19 2010, Shaohua Li wrote: > > On Fri, Mar 19, 2010 at 08:59:48AM +0800, Shaohua Li wrote: > > > On Thu, Mar 18, 2010 at 08:53:13PM +0800, Chris Mason wrote: > > > > On Thu, Mar 18, 2010 at 09:42:57AM +0800, Shaohua Li wrote: > > > > > Btrfs uses below equation to calculate ra_pages: > > > > > fs_info->bdi.ra_pages = max(fs_info->bdi.ra_pages, > > > > > 4 * 1024 * 1024 / PAGE_CACHE_SIZE); > > > > > is the max() a typo of min()? This makes the readahead size is 4M by default, > > > > > which is too big. > > > > > > > > Looks like things have changed since I tuned that number. Fengguang has > > > > been busy ;) > > > > > > > > > I have a system with 16 CPU, 6G memory and 12 sata disks. I create a btrfs for > > > > > each disk, so this isn''t a raid setup. The test is fio, which has 12 tasks to > > > > > access 12 files for each disk. The fio test is mmap sequential read. I measure > > > > > the performance with different readahead size: > > > > > ra size io throughput > > > > > 4M 268288 k/s > > > > > 2M 367616 k/s > > > > > 1M 431104 k/s > > > > > 512K 474112 k/s > > > > > 256K 512000 k/s > > > > > 128K 538624 k/s > > > > > The 4M default readahead size has poor performance. > > > > > I also does sync sequential read test, the test difference in''t that big. But > > > > > the 4M case still has about 10% drop compared to the 512k case. > > > > > > > > I''m surprised the 4M is so much slower. At any rate, the larger size > > > > was selected because btrfs checksumming means we need a bigger buffer to > > > > keep the disks saturated. Were you on a fancy intel box with hardware > > > > crc32c enabled? > > > yes, this machine supports sse4.2 instruction. Let me check the result with checksum > > > disabled. > > Sounds no big difference with checksum disabled. I format the disks and redo > > the test: > > 128k ra: 539648 k/s > > 4m ra: 285696 k/s > > 4MB is definitely a huge read-ahead size, but I do wonder why it would > perform that much worse than a 128KB window. If you narrow your test > down to a single disk (or something simpler, at least), how does 4MB > compare to 128KB? With 6GB of memory, you should not run into read-ahead > memory thrashing.test data for a single disk(just run one time so far): 128k ra: 88513k/s 4m ra:87630k/s so no big difference. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 19 2010, Shaohua Li wrote:> On Fri, Mar 19, 2010 at 04:22:11PM +0800, Jens Axboe wrote: > > On Fri, Mar 19 2010, Shaohua Li wrote: > > > On Fri, Mar 19, 2010 at 08:59:48AM +0800, Shaohua Li wrote: > > > > On Thu, Mar 18, 2010 at 08:53:13PM +0800, Chris Mason wrote: > > > > > On Thu, Mar 18, 2010 at 09:42:57AM +0800, Shaohua Li wrote: > > > > > > Btrfs uses below equation to calculate ra_pages: > > > > > > fs_info->bdi.ra_pages = max(fs_info->bdi.ra_pages, > > > > > > 4 * 1024 * 1024 / PAGE_CACHE_SIZE); > > > > > > is the max() a typo of min()? This makes the readahead size is 4M by default, > > > > > > which is too big. > > > > > > > > > > Looks like things have changed since I tuned that number. Fengguang has > > > > > been busy ;) > > > > > > > > > > > I have a system with 16 CPU, 6G memory and 12 sata disks. I create a btrfs for > > > > > > each disk, so this isn''t a raid setup. The test is fio, which has 12 tasks to > > > > > > access 12 files for each disk. The fio test is mmap sequential read. I measure > > > > > > the performance with different readahead size: > > > > > > ra size io throughput > > > > > > 4M 268288 k/s > > > > > > 2M 367616 k/s > > > > > > 1M 431104 k/s > > > > > > 512K 474112 k/s > > > > > > 256K 512000 k/s > > > > > > 128K 538624 k/s > > > > > > The 4M default readahead size has poor performance. > > > > > > I also does sync sequential read test, the test difference in''t that big. But > > > > > > the 4M case still has about 10% drop compared to the 512k case. > > > > > > > > > > I''m surprised the 4M is so much slower. At any rate, the larger size > > > > > was selected because btrfs checksumming means we need a bigger buffer to > > > > > keep the disks saturated. Were you on a fancy intel box with hardware > > > > > crc32c enabled? > > > > yes, this machine supports sse4.2 instruction. Let me check the result with checksum > > > > disabled. > > > Sounds no big difference with checksum disabled. I format the disks and redo > > > the test: > > > 128k ra: 539648 k/s > > > 4m ra: 285696 k/s > > > > 4MB is definitely a huge read-ahead size, but I do wonder why it would > > perform that much worse than a 128KB window. If you narrow your test > > down to a single disk (or something simpler, at least), how does 4MB > > compare to 128KB? With 6GB of memory, you should not run into read-ahead > > memory thrashing. > test data for a single disk(just run one time so far): > 128k ra: 88513k/s > 4m ra:87630k/s > so no big difference.That looks pretty much as expected, unless you hit some sort of memory thrashing, a huge read-ahead window should not cause a performance degredation. At least not of your magnitude. I would expect performance to reach a stable threshold once you have requests that are large enough to utilize the full device bandwidth on its own and then remain at that plateau. Any chance you could capture blktrace data for a run with 128KB and one with 4MB so we could inspect the disk IO pattern? -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html