Jewell, Paul
2019-Dec-12 15:52 UTC
[Gluster-users] Problems with gluster distributed mode and numpy memory mapped files
Hi All, I am using gluster in order to share data between four development servers. It is just four primarily compute-machine attached to a switch. Each one has a 10TB hardware raid5 volume for a total of about 40TB. Each of the machines is a gluster client and server mounted to "itself". (For example, machine1 mounts the gluster volume as machine1:gvfs) The volume is a standard "distributed" (just JBOD all of the space, no duplication) This works well in general. Recently I have been developing an application for our research that makes use of the numpy "mmap" mode. Basically, a loop is iterated and at the end of it, for each file in a list, it is opened, and an array of data is dumped into one contiguous slice in the file, basically like this: for data in chunks: for i in range(10): new_coverage = np.load("coverage_file_%d" % d, mmap_mode='r+') new_coverage[(i*10000):((i+1)*10000), : ] = data That is to say, a slice of about ~50MB of data is written to each file, one at a time. On the first iteration, the first ~50mb is written, on the second iteration, the second ~50mb is written, etc. I have timed the run of these iterations, and something very strange is happening. This is an output example of one iteration of the outer loop. The small numbers represent a reasonable amount of time to write this small data. For some reason on some files the writing time is absolutely huge for this small amount of data. Writing new coverage for majiq file 1 took 0.591965 Writing new coverage for majiq file 2 took 0.601540 Writing new coverage for majiq file 3 took 0.989093 Writing new coverage for majiq file 4 took 612.667724 Writing new coverage for majiq file 5 took 612.630379 Writing new coverage for majiq file 6 took 612.099524 Writing new coverage for majiq file 7 took 612.666857 Writing new coverage for majiq file 8 took 612.633965 Writing new coverage for majiq file 9 took 612.634483 Writing new coverage for majiq file 10 took 612.096660 Writing new coverage for majiq file 11 took 612.634655 Writing new coverage for majiq file 12 took 612.632943 Writing new coverage for majiq file 13 took 612.631088 Writing new coverage for majiq file 14 took 612.675435 Writing new coverage for majiq file 15 took 341.445473 Writing new coverage for majiq file 16 took 0.550073 Writing new coverage for majiq file 17 took 0.655051 Writing new coverage for majiq file 18 took 0.641738 Writing new coverage for majiq file 19 took 0.808618 Writing new coverage for majiq file 20 took 0.583891 Writing new coverage for majiq file 21 took 0.617392 Writing new coverage for majiq file 22 took 0.749011 Writing new coverage for majiq file 23 took 0.609862 At first I thought this must be something like the fast files are on the same disk as the server where the program is running, the the slow files were distributed to other machines, but it turns out that for every iteration of the outer loop, the slow group of files actually changes. In top the python process is stuck in D state throughout these writes. I have tested my algorithm on my development machine, as well as a scratch (single SSD drive) on the same servers that I ran the other very slow problem test on above, and all of the writes are under 1 second. It is an extremely strange problem I have never seen before. Because all of the high values are pretty much exactly the same amount of time, it seems like this is some kind of timeout happening perhaps? What would be causing it to happen that often? Thanks for reading! ________________________________ From: Jewell, Paul <Paul.Jewell at pennmedicine.upenn.edu> Sent: Monday, December 9, 2019 1:40 PM To: gluster-users at gluster.org <gluster-users at gluster.org> Subject: Re: Problems with gluster distributed mode and numpy memory mapped files Hi All, I am using gluster in order to share data between four development servers. It is just four primarily compute-machine attached to a switch. Each one has a 10TB hardware raid5 volume for a total of about 40TB. Each of the machines is a gluster client and server mounted to "itself". (For example, machine1 mounts the gluster volume as machine1:gvfs) The volume is a standard "distributed" (just JBOD all of the space, no duplication) This works well in general. Recently I have been developing an application for our research that makes use of the numpy "mmap" mode. Basically, a loop is iterated and at the end of it, for each file in a list, it is opened, and an array of data is dumped into one contiguous slice in the file, basically like this: for data in chunks: for i in range(10): new_coverage = np.load("coverage_file_%d" % d, mmap_mode='r+') new_coverage[(i*10000):((i+1)*10000), : ] = data That is to say, a slice of about ~50MB of data is written to each file, one at a time. On the first iteration, the first ~50mb is written, on the second iteration, the second ~50mb is written, etc. I have timed the run of these iterations, and something very strange is happening. This is an output example of one iteration of the outer loop. The small numbers represent a reasonable amount of time to write this small data. For some reason on some files the writing time is absolutely huge for this small amount of data. Writing new coverage for majiq file 1 took 0.591965 Writing new coverage for majiq file 2 took 0.601540 Writing new coverage for majiq file 3 took 0.989093 Writing new coverage for majiq file 4 took 612.667724 Writing new coverage for majiq file 5 took 612.630379 Writing new coverage for majiq file 6 took 612.099524 Writing new coverage for majiq file 7 took 612.666857 Writing new coverage for majiq file 8 took 612.633965 Writing new coverage for majiq file 9 took 612.634483 Writing new coverage for majiq file 10 took 612.096660 Writing new coverage for majiq file 11 took 612.634655 Writing new coverage for majiq file 12 took 612.632943 Writing new coverage for majiq file 13 took 612.631088 Writing new coverage for majiq file 14 took 612.675435 Writing new coverage for majiq file 15 took 341.445473 Writing new coverage for majiq file 16 took 0.550073 Writing new coverage for majiq file 17 took 0.655051 Writing new coverage for majiq file 18 took 0.641738 Writing new coverage for majiq file 19 took 0.808618 Writing new coverage for majiq file 20 took 0.583891 Writing new coverage for majiq file 21 took 0.617392 Writing new coverage for majiq file 22 took 0.749011 Writing new coverage for majiq file 23 took 0.609862 At first I thought this must be something like the fast files are on the same disk as the server where the program is running, the the slow files were distributed to other machines, but it turns out that for every iteration of the outer loop, the slow group of files actually changes. In top the python process is stuck in D state throughout these writes. I have tested my algorithm on my development machine, as well as a scratch (single SSD drive) on the same servers that I ran the other very slow problem test on above, and all of the writes are under 1 second. It is an extremely strange problem I have never seen before. Because all of the high values are pretty much exactly the same amount of time, it seems like this is some kind of timeout happening perhaps? What would be causing it to happen that often? Thanks for reading! ________________________________ From: Jewell, Paul Sent: Thursday, December 5, 2019 4:18 PM To: gluster-users at gluster.org <gluster-users at gluster.org> Subject: Problems with gluster distributed mode and numpy memory mapped files Hi All, I am using gluster in order to share data between four development servers. It is just four primarily compute-machine attached to a switch. Each one has a 10TB hardware raid5 volume for a total of about 40TB. Each of the machines is a gluster client and server mounted to "itself". (For example, machine1 mounts the gluster volume as machine1:gvfs) The volume is a standard "distributed" (just JBOD all of the space, no duplication) This works well in general. Recently I have been developing an application for our research that makes use of the numpy "mmap" mode. Basically, a loop is iterated and at the end of it, for each file in a list, it is opened, and an array of data is dumped into one contiguous slice in the file, basically like this: for data in chunks: for i in range(10): new_coverage = np.load("coverage_file_%d" % d, mmap_mode='r+') new_coverage[(i*10000):((i+1)*10000), : ] = data That is to say, a slice of about ~50MB of data is written to each file, one at a time. On the first iteration, the first ~50mb is written, on the second iteration, the second ~50mb is written, etc. I have timed the run of these iterations, and something very strange is happening. This is an output example of one iteration of the outer loop. The small numbers represent a reasonable amount of time to write this small data. For some reason on some files the writing time is absolutely huge for this small amount of data. Writing new coverage for majiq file 1 took 0.591965 Writing new coverage for majiq file 2 took 0.601540 Writing new coverage for majiq file 3 took 0.989093 Writing new coverage for majiq file 4 took 612.667724 Writing new coverage for majiq file 5 took 612.630379 Writing new coverage for majiq file 6 took 612.099524 Writing new coverage for majiq file 7 took 612.666857 Writing new coverage for majiq file 8 took 612.633965 Writing new coverage for majiq file 9 took 612.634483 Writing new coverage for majiq file 10 took 612.096660 Writing new coverage for majiq file 11 took 612.634655 Writing new coverage for majiq file 12 took 612.632943 Writing new coverage for majiq file 13 took 612.631088 Writing new coverage for majiq file 14 took 612.675435 Writing new coverage for majiq file 15 took 341.445473 Writing new coverage for majiq file 16 took 0.550073 Writing new coverage for majiq file 17 took 0.655051 Writing new coverage for majiq file 18 took 0.641738 Writing new coverage for majiq file 19 took 0.808618 Writing new coverage for majiq file 20 took 0.583891 Writing new coverage for majiq file 21 took 0.617392 Writing new coverage for majiq file 22 took 0.749011 Writing new coverage for majiq file 23 took 0.609862 At first I thought this must be something like the fast files are on the same disk as the server where the program is running, the the slow files were distributed to other machines, but it turns out that for every iteration of the outer loop, the slow group of files actually changes. In top the python process is stuck in D state throughout these writes. I have tested my algorithm on my development machine, as well as a scratch (single SSD drive) on the same servers that I ran the other very slow problem test on above, and all of the writes are under 1 second. It is an extremely strange problem I have never seen before. Because all of the high values are pretty much exactly the same amount of time, it seems like this is some kind of timeout happening perhaps? What would be causing it to happen that often? Thanks for reading! -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20191212/c5517518/attachment.html>
Jewell, Paul
2020-Jan-07 14:21 UTC
[Gluster-users] [External] Re: Problems with gluster distributed mode and numpy memory mapped files
Hi All, Just as a follow up on this. I changed the mount options line to look like: server:/gv0 /data glusterfs defaults,_netdev,direct-io-mode=disable 0 0 The change was adding the part "direct-io-mode=disable". After this and a complete reboot of the cluster, it appears the problem has been solved. So I guess I would recommend using this "direct-io-mode=disable" when working with numpy files. Thanks, -Paul ________________________________ From: gluster-users-bounces at gluster.org <gluster-users-bounces at gluster.org> on behalf of Jewell, Paul <Paul.Jewell at pennmedicine.upenn.edu> Sent: Thursday, December 12, 2019 10:52 AM To: gluster-users at gluster.org <gluster-users at gluster.org> Subject: [External] Re: [Gluster-users] Problems with gluster distributed mode and numpy memory mapped files Hi All, I am using gluster in order to share data between four development servers. It is just four primarily compute-machine attached to a switch. Each one has a 10TB hardware raid5 volume for a total of about 40TB. Each of the machines is a gluster client and server mounted to "itself". (For example, machine1 mounts the gluster volume as machine1:gvfs) The volume is a standard "distributed" (just JBOD all of the space, no duplication) This works well in general. Recently I have been developing an application for our research that makes use of the numpy "mmap" mode. Basically, a loop is iterated and at the end of it, for each file in a list, it is opened, and an array of data is dumped into one contiguous slice in the file, basically like this: for data in chunks: for i in range(10): new_coverage = np.load("coverage_file_%d" % d, mmap_mode='r+') new_coverage[(i*10000):((i+1)*10000), : ] = data That is to say, a slice of about ~50MB of data is written to each file, one at a time. On the first iteration, the first ~50mb is written, on the second iteration, the second ~50mb is written, etc. I have timed the run of these iterations, and something very strange is happening. This is an output example of one iteration of the outer loop. The small numbers represent a reasonable amount of time to write this small data. For some reason on some files the writing time is absolutely huge for this small amount of data. Writing new coverage for majiq file 1 took 0.591965 Writing new coverage for majiq file 2 took 0.601540 Writing new coverage for majiq file 3 took 0.989093 Writing new coverage for majiq file 4 took 612.667724 Writing new coverage for majiq file 5 took 612.630379 Writing new coverage for majiq file 6 took 612.099524 Writing new coverage for majiq file 7 took 612.666857 Writing new coverage for majiq file 8 took 612.633965 Writing new coverage for majiq file 9 took 612.634483 Writing new coverage for majiq file 10 took 612.096660 Writing new coverage for majiq file 11 took 612.634655 Writing new coverage for majiq file 12 took 612.632943 Writing new coverage for majiq file 13 took 612.631088 Writing new coverage for majiq file 14 took 612.675435 Writing new coverage for majiq file 15 took 341.445473 Writing new coverage for majiq file 16 took 0.550073 Writing new coverage for majiq file 17 took 0.655051 Writing new coverage for majiq file 18 took 0.641738 Writing new coverage for majiq file 19 took 0.808618 Writing new coverage for majiq file 20 took 0.583891 Writing new coverage for majiq file 21 took 0.617392 Writing new coverage for majiq file 22 took 0.749011 Writing new coverage for majiq file 23 took 0.609862 At first I thought this must be something like the fast files are on the same disk as the server where the program is running, the the slow files were distributed to other machines, but it turns out that for every iteration of the outer loop, the slow group of files actually changes. In top the python process is stuck in D state throughout these writes. I have tested my algorithm on my development machine, as well as a scratch (single SSD drive) on the same servers that I ran the other very slow problem test on above, and all of the writes are under 1 second. It is an extremely strange problem I have never seen before. Because all of the high values are pretty much exactly the same amount of time, it seems like this is some kind of timeout happening perhaps? What would be causing it to happen that often? Thanks for reading! ________________________________ From: Jewell, Paul <Paul.Jewell at pennmedicine.upenn.edu> Sent: Monday, December 9, 2019 1:40 PM To: gluster-users at gluster.org <gluster-users at gluster.org> Subject: Re: Problems with gluster distributed mode and numpy memory mapped files Hi All, I am using gluster in order to share data between four development servers. It is just four primarily compute-machine attached to a switch. Each one has a 10TB hardware raid5 volume for a total of about 40TB. Each of the machines is a gluster client and server mounted to "itself". (For example, machine1 mounts the gluster volume as machine1:gvfs) The volume is a standard "distributed" (just JBOD all of the space, no duplication) This works well in general. Recently I have been developing an application for our research that makes use of the numpy "mmap" mode. Basically, a loop is iterated and at the end of it, for each file in a list, it is opened, and an array of data is dumped into one contiguous slice in the file, basically like this: for data in chunks: for i in range(10): new_coverage = np.load("coverage_file_%d" % d, mmap_mode='r+') new_coverage[(i*10000):((i+1)*10000), : ] = data That is to say, a slice of about ~50MB of data is written to each file, one at a time. On the first iteration, the first ~50mb is written, on the second iteration, the second ~50mb is written, etc. I have timed the run of these iterations, and something very strange is happening. This is an output example of one iteration of the outer loop. The small numbers represent a reasonable amount of time to write this small data. For some reason on some files the writing time is absolutely huge for this small amount of data. Writing new coverage for majiq file 1 took 0.591965 Writing new coverage for majiq file 2 took 0.601540 Writing new coverage for majiq file 3 took 0.989093 Writing new coverage for majiq file 4 took 612.667724 Writing new coverage for majiq file 5 took 612.630379 Writing new coverage for majiq file 6 took 612.099524 Writing new coverage for majiq file 7 took 612.666857 Writing new coverage for majiq file 8 took 612.633965 Writing new coverage for majiq file 9 took 612.634483 Writing new coverage for majiq file 10 took 612.096660 Writing new coverage for majiq file 11 took 612.634655 Writing new coverage for majiq file 12 took 612.632943 Writing new coverage for majiq file 13 took 612.631088 Writing new coverage for majiq file 14 took 612.675435 Writing new coverage for majiq file 15 took 341.445473 Writing new coverage for majiq file 16 took 0.550073 Writing new coverage for majiq file 17 took 0.655051 Writing new coverage for majiq file 18 took 0.641738 Writing new coverage for majiq file 19 took 0.808618 Writing new coverage for majiq file 20 took 0.583891 Writing new coverage for majiq file 21 took 0.617392 Writing new coverage for majiq file 22 took 0.749011 Writing new coverage for majiq file 23 took 0.609862 At first I thought this must be something like the fast files are on the same disk as the server where the program is running, the the slow files were distributed to other machines, but it turns out that for every iteration of the outer loop, the slow group of files actually changes. In top the python process is stuck in D state throughout these writes. I have tested my algorithm on my development machine, as well as a scratch (single SSD drive) on the same servers that I ran the other very slow problem test on above, and all of the writes are under 1 second. It is an extremely strange problem I have never seen before. Because all of the high values are pretty much exactly the same amount of time, it seems like this is some kind of timeout happening perhaps? What would be causing it to happen that often? Thanks for reading! ________________________________ From: Jewell, Paul Sent: Thursday, December 5, 2019 4:18 PM To: gluster-users at gluster.org <gluster-users at gluster.org> Subject: Problems with gluster distributed mode and numpy memory mapped files Hi All, I am using gluster in order to share data between four development servers. It is just four primarily compute-machine attached to a switch. Each one has a 10TB hardware raid5 volume for a total of about 40TB. Each of the machines is a gluster client and server mounted to "itself". (For example, machine1 mounts the gluster volume as machine1:gvfs) The volume is a standard "distributed" (just JBOD all of the space, no duplication) This works well in general. Recently I have been developing an application for our research that makes use of the numpy "mmap" mode. Basically, a loop is iterated and at the end of it, for each file in a list, it is opened, and an array of data is dumped into one contiguous slice in the file, basically like this: for data in chunks: for i in range(10): new_coverage = np.load("coverage_file_%d" % d, mmap_mode='r+') new_coverage[(i*10000):((i+1)*10000), : ] = data That is to say, a slice of about ~50MB of data is written to each file, one at a time. On the first iteration, the first ~50mb is written, on the second iteration, the second ~50mb is written, etc. I have timed the run of these iterations, and something very strange is happening. This is an output example of one iteration of the outer loop. The small numbers represent a reasonable amount of time to write this small data. For some reason on some files the writing time is absolutely huge for this small amount of data. Writing new coverage for majiq file 1 took 0.591965 Writing new coverage for majiq file 2 took 0.601540 Writing new coverage for majiq file 3 took 0.989093 Writing new coverage for majiq file 4 took 612.667724 Writing new coverage for majiq file 5 took 612.630379 Writing new coverage for majiq file 6 took 612.099524 Writing new coverage for majiq file 7 took 612.666857 Writing new coverage for majiq file 8 took 612.633965 Writing new coverage for majiq file 9 took 612.634483 Writing new coverage for majiq file 10 took 612.096660 Writing new coverage for majiq file 11 took 612.634655 Writing new coverage for majiq file 12 took 612.632943 Writing new coverage for majiq file 13 took 612.631088 Writing new coverage for majiq file 14 took 612.675435 Writing new coverage for majiq file 15 took 341.445473 Writing new coverage for majiq file 16 took 0.550073 Writing new coverage for majiq file 17 took 0.655051 Writing new coverage for majiq file 18 took 0.641738 Writing new coverage for majiq file 19 took 0.808618 Writing new coverage for majiq file 20 took 0.583891 Writing new coverage for majiq file 21 took 0.617392 Writing new coverage for majiq file 22 took 0.749011 Writing new coverage for majiq file 23 took 0.609862 At first I thought this must be something like the fast files are on the same disk as the server where the program is running, the the slow files were distributed to other machines, but it turns out that for every iteration of the outer loop, the slow group of files actually changes. In top the python process is stuck in D state throughout these writes. I have tested my algorithm on my development machine, as well as a scratch (single SSD drive) on the same servers that I ran the other very slow problem test on above, and all of the writes are under 1 second. It is an extremely strange problem I have never seen before. Because all of the high values are pretty much exactly the same amount of time, it seems like this is some kind of timeout happening perhaps? What would be causing it to happen that often? Thanks for reading! -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200107/bd64cce7/attachment.html>