Pranas Baliuka
2009-Aug-10 04:11 UTC
[Lustre-discuss] Distributed Object storage lookup of small files
Dear Lustre experts/users, I looking for optimal solution of the task: Internet-scale applications must be designed to process high volumes of transactions. Describe a design for a system that must process on average 30,000 HTTP requests per second. For each request, the system must perform a lookup into a dictionary of 50 million words, using a key word passed in via the URL query string. Each response will consist of a string containing the definition of the word (10 KB or less). My initial though was using MySQL/Berkeley DB pointing to SAN, but probably lower level solution would be more affordable. Can I use e.g. QFS storage via Java without DB severer instead. Can SAN be avoided and local HDDs joined to Lustre system? Task is hypothetical, but would be nice to get feedback from specific technology experts... Some ideas ;) I''ve send similar request to QFS forum and really not sure which product would fit better. Both works as distributed file systems ... and both sounds as convenient storage for particular task. Thanks, Pranas -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090810/249d09c6/attachment-0001.html
Jim McCusker
2009-Aug-11 15:05 UTC
[Lustre-discuss] Distributed Object storage lookup of small files
We have had good performance using Lucene as a search engine in Java backed by Lustre (mentioned in a previous email): http://krauthammerlab.med.yale.edu/imagefinder The images are in a hashed directory structure that provides O(1) access to the image file contents, and the search engine in turn serves as a flexible hash table that provides O(1) per search term access to keywords, metadata, and full text. Lucene is available at http://lucene.apache.org and is a joy to work with. Jim On Mon, Aug 10, 2009 at 12:11 AM, Pranas Baliuka<pranas at orangecap.net> wrote:> Dear Lustre experts/users, > > > > I looking for optimal solution of the task: > > Internet-scale applications must be designed to process high volumes of > transactions. > > Describe a design for a system that must process on average 30,000 HTTP > requests per second. > > For each request, the system must perform a lookup into a dictionary of 50 > million words, using a key word passed in via the URL query string. > > Each response will consist of a string containing the definition of the word > (10 KB or less). > > > > My initial though was using MySQL/Berkeley DB pointing to SAN, but probably > lower level solution would be more affordable. > > Can I use e.g. QFS storage via Java without DB severer instead. Can SAN be > avoided and local HDDs joined to Lustre system? > > > > Task is hypothetical, but would be nice to get feedback from specific > technology experts... > > Some ideas ;) > > > > I?ve send similar request to QFS forum and really not sure which product > would fit better. Both works as distributed file systems ... and both sounds > as convenient storage for particular task. > > > > Thanks, > > Pranas > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >-- Jim -- Jim McCusker Programmer Analyst Krauthammer Lab, Pathology Informatics Yale School of Medicine james.mccusker at yale.edu | (203) 785-6330 http://krauthammerlab.med.yale.edu
David Pratt
2009-Aug-11 17:14 UTC
[Lustre-discuss] Distributed Object storage lookup of small files
Hi Jim. That is pretty cool. See there are more than 300,000 records at present. Curious about how this will work when you get into much larger scale with RAM requirement to perform search since this goes up substantially with lucene as number of docs goes up. I have have tended to look at sharding and parallel multisearch as means of horizontally scaling Lucene by breaking into chunks. This approach is interesting and just interested how you anticipate scale and performance with document growth. Many thanks. Regards David On 11-Aug-09, at 12:05 PM, Jim McCusker wrote:> We have had good performance using Lucene as a search engine in Java > backed by Lustre (mentioned in a previous email): > > http://krauthammerlab.med.yale.edu/imagefinder > > The images are in a hashed directory structure that provides O(1) > access to the image file contents, and the search engine in turn > serves as a flexible hash table that provides O(1) per search term > access to keywords, metadata, and full text. > > Lucene is available at http://lucene.apache.org and is a joy to work > with. > > Jim > > On Mon, Aug 10, 2009 at 12:11 AM, Pranas > Baliuka<pranas at orangecap.net> wrote: >> Dear Lustre experts/users, >> >> >> >> I looking for optimal solution of the task: >> >> Internet-scale applications must be designed to process high >> volumes of >> transactions. >> >> Describe a design for a system that must process on average 30,000 >> HTTP >> requests per second. >> >> For each request, the system must perform a lookup into a >> dictionary of 50 >> million words, using a key word passed in via the URL query string. >> >> Each response will consist of a string containing the definition of >> the word >> (10 KB or less). >> >> >> >> My initial though was using MySQL/Berkeley DB pointing to SAN, but >> probably >> lower level solution would be more affordable. >> >> Can I use e.g. QFS storage via Java without DB severer instead. Can >> SAN be >> avoided and local HDDs joined to Lustre system? >> >> >> >> Task is hypothetical, but would be nice to get feedback from specific >> technology experts... >> >> Some ideas ;) >> >> >> >> I?ve send similar request to QFS forum and really not sure which >> product >> would fit better. Both works as distributed file systems ... and >> both sounds >> as convenient storage for particular task. >> >> >> >> Thanks, >> >> Pranas >> >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> > > > > -- > Jim > -- > Jim McCusker > Programmer Analyst > Krauthammer Lab, Pathology Informatics > Yale School of Medicine > james.mccusker at yale.edu | (203) 785-6330 > http://krauthammerlab.med.yale.edu > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Jim McCusker
2009-Aug-11 17:40 UTC
[Lustre-discuss] Distributed Object storage lookup of small files
On Tue, Aug 11, 2009 at 1:14 PM, David Pratt<fairwinds.dp at gmail.com> wrote:> Hi Jim. That is pretty cool. See there are more than 300,000 records at > present. Curious about how this will work when you get into much larger > scale with RAM requirement to perform search since this goes up > substantially with lucene as number of docs goes up. I have have tended to > look at sharding and parallel multisearch as means of horizontally scaling > Lucene by breaking into chunks. This approach is interesting and just > interested how you anticipate scale and performance with document growth.We haven''t had significant RAM requirements with the numbers of documents we have at the moment. Nutch is a more complete solution for search that has support for parallel search, and I imagine that there are other good ways of doing parallel search. Back when JXTA was still something I used it to create parallel distributed search across people''s desktops with pretty good results. Combining the search results can end up taking some work, though. Jim -- Jim McCusker Programmer Analyst Krauthammer Lab, Pathology Informatics Yale School of Medicine james.mccusker at yale.edu | (203) 785-6330 http://krauthammerlab.med.yale.edu
David Pratt
2009-Aug-11 18:16 UTC
[Lustre-discuss] Distributed Object storage lookup of small files
Hi Jim. Sure. Is the reason that you are doing this on Lustre then the fact you already had a large clustered filesystem to work from or is your lustre cluster dedicated to your image project. I have been investigating Lustre for a smallish scale filesystem as storage pool for virtual machines for scalable storage of 10TB+. Your use of lustre is interesting to me since I use Lucene also and a good part amount of the data of the virtual machine disk images I will be storing is index data that I will doing parallel searches across. Regards, David On 11-Aug-09, at 2:40 PM, Jim McCusker wrote:> On Tue, Aug 11, 2009 at 1:14 PM, David Pratt<fairwinds.dp at gmail.com> > wrote: >> Hi Jim. That is pretty cool. See there are more than 300,000 >> records at >> present. Curious about how this will work when you get into much >> larger >> scale with RAM requirement to perform search since this goes up >> substantially with lucene as number of docs goes up. I have have >> tended to >> look at sharding and parallel multisearch as means of horizontally >> scaling >> Lucene by breaking into chunks. This approach is interesting and just >> interested how you anticipate scale and performance with document >> growth. > > We haven''t had significant RAM requirements with the numbers of > documents we have at the moment. Nutch is a more complete solution for > search that has support for parallel search, and I imagine that there > are other good ways of doing parallel search. Back when JXTA was still > something I used it to create parallel distributed search across > people''s desktops with pretty good results. Combining the search > results can end up taking some work, though. > > Jim > -- > Jim McCusker > Programmer Analyst > Krauthammer Lab, Pathology Informatics > Yale School of Medicine > james.mccusker at yale.edu | (203) 785-6330 > http://krauthammerlab.med.yale.edu
Jim McCusker
2009-Aug-11 18:23 UTC
[Lustre-discuss] Distributed Object storage lookup of small files
On Tue, Aug 11, 2009 at 2:16 PM, David Pratt<fairwinds.dp at gmail.com> wrote:> Hi Jim. Sure. Is the reason that you are doing this on Lustre then the fact > you already had a large clustered filesystem to work from or is your lustre > cluster dedicated to your image project.We set up the cluster for general use by the lab. It supports a compute cluster, a generic file server for collaborators, and serves home directories for all the members of our lab. It made the project possible and easy, but wasn''t set up specifically for it. Jim -- Jim McCusker Programmer Analyst Krauthammer Lab, Pathology Informatics Yale School of Medicine james.mccusker at yale.edu | (203) 785-6330 http://krauthammerlab.med.yale.edu
tao.a.wu at nokia.com
2009-Aug-12 14:17 UTC
[Lustre-discuss] Distributed Object storage lookup of small files
If you serve 30000 req/s, you will likely need a distributed in-memory cache. Things like Terracotta or Coherence may work well for your dataset (500 GB), although I haven''t used either. -Tao ________________________________ From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of ext Pranas Baliuka Sent: Monday, August 10, 2009 12:11 AM To: lustre-discuss at lists.lustre.org Subject: [Lustre-discuss] Distributed Object storage lookup of small files Dear Lustre experts/users, I looking for optimal solution of the task: Internet-scale applications must be designed to process high volumes of transactions. Describe a design for a system that must process on average 30,000 HTTP requests per second. For each request, the system must perform a lookup into a dictionary of 50 million words, using a key word passed in via the URL query string. Each response will consist of a string containing the definition of the word (10 KB or less). My initial though was using MySQL/Berkeley DB pointing to SAN, but probably lower level solution would be more affordable. Can I use e.g. QFS storage via Java without DB severer instead. Can SAN be avoided and local HDDs joined to Lustre system? Task is hypothetical, but would be nice to get feedback from specific technology experts... Some ideas ;) I''ve send similar request to QFS forum and really not sure which product would fit better. Both works as distributed file systems ... and both sounds as convenient storage for particular task. Thanks, Pranas -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090812/ebcbbb53/attachment.html
Peter Grandi
2009-Aug-15 19:03 UTC
[Lustre-discuss] Distributed Object storage lookup of small files
>From Pranas Baliuka:pranas> Sent: Monday, August 10, 2009 12:11 AM pranas> To: lustre-discuss at lists.lustre.org pranas> Subject: [Lustre-discuss] Distributed Object storage lookup of small files pranas> Dear Lustre experts/users, pranas> I looking for optimal solution of the task: pranas> Internet-scale applications must be designed to process pranas> high volumes of transactions. Describe a design for a pranas> system that must process on average 30,000 HTTP requests pranas> per second. For each request, the system must perform a pranas> lookup into a dictionary of 50 million words, using a pranas> key word passed in via the URL query string. Each pranas> response will consist of a string containing the pranas> definition of the word (10 KB or less). pranas> Task is hypothetical, but would be nice to get feedback pranas> from specific technology experts... Some ideas ;) This looks like to me an attempt at cheating on a university assignment or a job interview challenge. Especially given that something like Lustre looks ridiculous overkill for such a task (50m word, each on average 7-10 chars long => only 500MB table, and read-only too), so posting the question here makes little sense. But then I have seen a number of ignoramuses happy to use filesystems instead of incore or storage databases (typical questions about having several million or hundred million files almost all of which less than a block long, or directories with hundreds of thousands or millions of files, especially on the XFS mailing list).