Jay Vyas
2013-May-31 15:05 UTC
[Gluster-users] local caching of file across global cluster
Is there any value / way to tell all the gluster nodes to make a file highly available, potentially at the cost of consistency (i.e. forget about locks for all files named XXXX and cache them in local disk)? Scenario: Imagine I have a workflow of processing 1 million files, and I want to compare all 1 billion files to all the words in , say, a set of ten files, each of which are 10MB. It would be easy to cache the ten files (100MB of data) on every local gluster node. Or even in memory for that matter. Admittedly...Im not an expert on disk caching so, maybe, this is already done using heuristics for us... and its just a matter of time for FUSE/Underlying filesystem/Gluster mount to figure out that a file is important before it starts caching it in some magical sort of way. -- Jay Vyas http://jayunit100.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130531/0741388a/attachment.html>
Brian Foster
2013-May-31 20:37 UTC
[Gluster-users] local caching of file across global cluster
On 05/31/2013 11:05 AM, Jay Vyas wrote:> Is there any value / way to tell all the gluster nodes to make a file > highly available, potentially at the cost of consistency (i.e. forget about > locks for all files named XXXX and cache them in local disk)? > > Scenario: Imagine I have a workflow of processing 1 million files, and I > want to compare all 1 billion files to all the words in , say, a set of ten > files, each of which are 10MB. > > It would be easy to cache the ten files (100MB of data) on every local > gluster node. Or even in memory for that matter. >I'm assuming by gluster node you're referring to a gluster client. Given that, fuse already does this kind of read-only caching. The caveat to be aware of is the default behavior has an invalidate on open heuristic. So if your implementation will involve repeated open() calls for your 10x10MB files (i.e., running a script for every source file you're checking against), you could be repeatedly reading/caching and flushing the data you want to retain. In that case, you might want to try the --fopen-keep-cache glusterfs (mount) option to bypass said behavior. I suppose the subsequent question is whether the reads of the 1 billion files push out that other 100MB, but I _think_ this is something the VM should get right over time (i.e., via repeated accesses of that 100MB set). That's probably something that warrants experimentation to verify though. Brian> Admittedly...Im not an expert on disk caching so, maybe, this is already > done using heuristics for us... and its just a matter of time for > FUSE/Underlying filesystem/Gluster mount to figure out that a file is > important before it starts caching it in some magical sort of way. > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://supercolony.gluster.org/mailman/listinfo/gluster-users >