thr3ads.net - Lustre discuss - [Lustre-discuss] Distributed Object storage lookup of small files [Aug 2009]

If this information is useful, please help other people find it:
Share via:

Pranas Baliuka

2009-Aug-10 04:11 UTC

[Lustre-discuss] Distributed Object storage lookup of small files

Dear Lustre experts/users,

 

I looking for optimal solution of the task:

Internet-scale applications must be designed to process high volumes of
transactions. 

Describe a design for a system that must process on average 30,000 HTTP
requests per second. 

For each request, the system must perform a lookup into a dictionary of 50
million words, using a key word passed in via the URL query string. 

Each response will consist of a string containing the definition of the word
(10 KB or less).

 

My initial though was using MySQL/Berkeley DB pointing to SAN, but probably
lower level solution would be more affordable. 

Can I use e.g. QFS storage via Java without DB severer instead. Can SAN be
avoided and local HDDs joined to Lustre system?

 

Task is hypothetical, but would be nice to get feedback from specific
technology experts...

Some ideas ;) 

 

I''ve send similar request to QFS forum and really not sure which
product
would fit better. Both works as distributed file systems ... and both sounds
as convenient storage for particular task.

 

Thanks,

Pranas

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090810/249d09c6/attachment-0001.html

Jim McCusker

2009-Aug-11 15:05 UTC

head link

[Lustre-discuss] Distributed Object storage lookup of small files

We have had good performance using Lucene as a search engine in Java
backed by Lustre (mentioned in a previous email):

http://krauthammerlab.med.yale.edu/imagefinder

The images are in a hashed directory structure that provides O(1)
access to the image file contents, and the search engine in turn
serves as a flexible hash table that provides O(1) per search term
access to keywords, metadata, and full text.

Lucene is available at http://lucene.apache.org and is a joy to work with.

Jim

On Mon, Aug 10, 2009 at 12:11 AM, Pranas Baliuka<pranas at orangecap.net>
wrote:> Dear Lustre experts/users,
>
>
>
> I looking for optimal solution of the task:
>
> Internet-scale applications must be designed to process high volumes of
> transactions.
>
> Describe a design for a system that must process on average 30,000 HTTP
> requests per second.
>
> For each request, the system must perform a lookup into a dictionary of 50
> million words, using a key word passed in via the URL query string.
>
> Each response will consist of a string containing the definition of the
word
> (10 KB or less).
>
>
>
> My initial though was using MySQL/Berkeley DB pointing to SAN, but probably
> lower level solution would be more affordable.
>
> Can I use e.g. QFS storage via Java without DB severer instead. Can SAN be
> avoided and local HDDs joined to Lustre system?
>
>
>
> Task is hypothetical, but would be nice to get feedback from specific
> technology experts...
>
> Some ideas ;)
>
>
>
> I?ve send similar request to QFS forum and really not sure which product
> would fit better. Both works as distributed file systems ... and both
sounds
> as convenient storage for particular task.
>
>
>
> Thanks,
>
> Pranas
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>


-- 
Jim
--
Jim McCusker
Programmer Analyst
Krauthammer Lab, Pathology Informatics
Yale School of Medicine
james.mccusker at yale.edu | (203) 785-6330
http://krauthammerlab.med.yale.edu

David Pratt

2009-Aug-11 17:14 UTC

head link

[Lustre-discuss] Distributed Object storage lookup of small files

Hi Jim. That is pretty cool. See there are more than 300,000 records  
at present. Curious about how this will work when you get into much  
larger scale with RAM requirement to perform search since this goes up  
substantially with lucene as number of docs goes up. I have have  
tended to look at sharding and parallel multisearch as means of  
horizontally scaling Lucene by breaking into chunks. This approach is  
interesting and just interested how you anticipate scale and  
performance with document growth. Many thanks.

Regards
David


On 11-Aug-09, at 12:05 PM, Jim McCusker wrote:
> We have had good performance using Lucene as a search engine in Java
> backed by Lustre (mentioned in a previous email):
>
> http://krauthammerlab.med.yale.edu/imagefinder
>
> The images are in a hashed directory structure that provides O(1)
> access to the image file contents, and the search engine in turn
> serves as a flexible hash table that provides O(1) per search term
> access to keywords, metadata, and full text.
>
> Lucene is available at http://lucene.apache.org and is a joy to work  
> with.
>
> Jim
>
> On Mon, Aug 10, 2009 at 12:11 AM, Pranas  
> Baliuka<pranas at orangecap.net> wrote:
>> Dear Lustre experts/users,
>>
>>
>>
>> I looking for optimal solution of the task:
>>
>> Internet-scale applications must be designed to process high  
>> volumes of
>> transactions.
>>
>> Describe a design for a system that must process on average 30,000  
>> HTTP
>> requests per second.
>>
>> For each request, the system must perform a lookup into a  
>> dictionary of 50
>> million words, using a key word passed in via the URL query string.
>>
>> Each response will consist of a string containing the definition of  
>> the word
>> (10 KB or less).
>>
>>
>>
>> My initial though was using MySQL/Berkeley DB pointing to SAN, but  
>> probably
>> lower level solution would be more affordable.
>>
>> Can I use e.g. QFS storage via Java without DB severer instead. Can  
>> SAN be
>> avoided and local HDDs joined to Lustre system?
>>
>>
>>
>> Task is hypothetical, but would be nice to get feedback from specific
>> technology experts...
>>
>> Some ideas ;)
>>
>>
>>
>> I?ve send similar request to QFS forum and really not sure which  
>> product
>> would fit better. Both works as distributed file systems ... and  
>> both sounds
>> as convenient storage for particular task.
>>
>>
>>
>> Thanks,
>>
>> Pranas
>>
>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>
>
>
> -- 
> Jim
> --
> Jim McCusker
> Programmer Analyst
> Krauthammer Lab, Pathology Informatics
> Yale School of Medicine
> james.mccusker at yale.edu | (203) 785-6330
> http://krauthammerlab.med.yale.edu
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Jim McCusker

2009-Aug-11 17:40 UTC

head link

[Lustre-discuss] Distributed Object storage lookup of small files

On Tue, Aug 11, 2009 at 1:14 PM, David Pratt<fairwinds.dp at gmail.com>
wrote:> Hi Jim. That is pretty cool. See there are more than 300,000 records at
> present. Curious about how this will work when you get into much larger
> scale with RAM requirement to perform search since this goes up
> substantially with lucene as number of docs goes up. I have have tended to
> look at sharding and parallel multisearch as means of horizontally scaling
> Lucene by breaking into chunks. This approach is interesting and just
> interested how you anticipate scale and performance with document growth.
We haven''t had significant RAM requirements with the numbers of
documents we have at the moment. Nutch is a more complete solution for
search that has support for parallel search, and I imagine that there
are other good ways of doing parallel search. Back when JXTA was still
something I used it to create parallel distributed search across
people''s desktops with pretty good results. Combining the search
results can end up taking some work, though.

Jim
--
Jim McCusker
Programmer Analyst
Krauthammer Lab, Pathology Informatics
Yale School of Medicine
james.mccusker at yale.edu | (203) 785-6330
http://krauthammerlab.med.yale.edu

David Pratt

2009-Aug-11 18:16 UTC

head link

[Lustre-discuss] Distributed Object storage lookup of small files

Hi Jim. Sure. Is the reason that you are doing this on Lustre then the  
fact you already had a large clustered filesystem to work from or is  
your lustre cluster dedicated to your image project. I have been  
investigating Lustre for a smallish scale filesystem as storage pool  
for virtual machines for scalable storage of 10TB+. Your use of lustre  
is interesting to me since I use Lucene also and a good part amount of  
the data of the virtual machine disk images I will be storing is index  
data that I will doing parallel searches across.

Regards,
David


On 11-Aug-09, at 2:40 PM, Jim McCusker wrote:
> On Tue, Aug 11, 2009 at 1:14 PM, David Pratt<fairwinds.dp at
gmail.com>
> wrote:
>> Hi Jim. That is pretty cool. See there are more than 300,000  
>> records at
>> present. Curious about how this will work when you get into much  
>> larger
>> scale with RAM requirement to perform search since this goes up
>> substantially with lucene as number of docs goes up. I have have  
>> tended to
>> look at sharding and parallel multisearch as means of horizontally  
>> scaling
>> Lucene by breaking into chunks. This approach is interesting and just
>> interested how you anticipate scale and performance with document  
>> growth.
>
> We haven''t had significant RAM requirements with the numbers of
> documents we have at the moment. Nutch is a more complete solution for
> search that has support for parallel search, and I imagine that there
> are other good ways of doing parallel search. Back when JXTA was still
> something I used it to create parallel distributed search across
> people''s desktops with pretty good results. Combining the search
> results can end up taking some work, though.
>
> Jim
> --
> Jim McCusker
> Programmer Analyst
> Krauthammer Lab, Pathology Informatics
> Yale School of Medicine
> james.mccusker at yale.edu | (203) 785-6330
> http://krauthammerlab.med.yale.edu

Jim McCusker

2009-Aug-11 18:23 UTC

head link

[Lustre-discuss] Distributed Object storage lookup of small files

On Tue, Aug 11, 2009 at 2:16 PM, David Pratt<fairwinds.dp at gmail.com>
wrote:> Hi Jim. Sure. Is the reason that you are doing this on Lustre then the fact
> you already had a large clustered filesystem to work from or is your lustre
> cluster dedicated to your image project.
We set up the cluster for general use by the lab. It supports a
compute cluster, a generic file server for collaborators, and serves
home directories for all the members of our lab. It made the project
possible and easy, but wasn''t set up specifically for it.

Jim
--
Jim McCusker
Programmer Analyst
Krauthammer Lab, Pathology Informatics
Yale School of Medicine
james.mccusker at yale.edu | (203) 785-6330
http://krauthammerlab.med.yale.edu

tao.a.wu at nokia.com

2009-Aug-12 14:17 UTC

head link

[Lustre-discuss] Distributed Object storage lookup of small files

If you serve 30000 req/s, you will likely need a distributed in-memory cache. 
Things like Terracotta or Coherence may work well for your dataset (500 GB),
although I haven''t used either.

-Tao


________________________________
From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces
at lists.lustre.org] On Behalf Of ext Pranas Baliuka
Sent: Monday, August 10, 2009 12:11 AM
To: lustre-discuss at lists.lustre.org
Subject: [Lustre-discuss] Distributed Object storage lookup of small files

Dear Lustre experts/users,

I looking for optimal solution of the task:
Internet-scale applications must be designed to process high volumes of
transactions.
Describe a design for a system that must process on average 30,000 HTTP requests
per second.
For each request, the system must perform a lookup into a dictionary of 50
million words, using a key word passed in via the URL query string.
Each response will consist of a string containing the definition of the word (10
KB or less).

My initial though was using MySQL/Berkeley DB pointing to SAN, but probably
lower level solution would be more affordable.
Can I use e.g. QFS storage via Java without DB severer instead. Can SAN be
avoided and local HDDs joined to Lustre system?

Task is hypothetical, but would be nice to get feedback from specific technology
experts...
Some ideas ;)

I''ve send similar request to QFS forum and really not sure which
product would fit better. Both works as distributed file systems ... and both
sounds as convenient storage for particular task.

Thanks,
Pranas

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090812/ebcbbb53/attachment.html

Peter Grandi

2009-Aug-15 19:03 UTC

head link

[Lustre-discuss] Distributed Object storage lookup of small files

>From Pranas Baliuka:
pranas> Sent: Monday, August 10, 2009 12:11 AM
pranas> To: lustre-discuss at lists.lustre.org
pranas> Subject: [Lustre-discuss] Distributed Object storage lookup of small
files

pranas> Dear Lustre experts/users,
pranas> I looking for optimal solution of the task:

pranas> Internet-scale applications must be designed to process
pranas> high volumes of transactions.  Describe a design for a
pranas> system that must process on average 30,000 HTTP requests
pranas> per second. For each request, the system must perform a
pranas> lookup into a dictionary of 50 million words, using a
pranas> key word passed in via the URL query string.  Each
pranas> response will consist of a string containing the
pranas> definition of the word (10 KB or less).

pranas> Task is hypothetical, but would be nice to get feedback
pranas> from specific technology experts...  Some ideas ;)

This looks like to me an attempt at cheating on a university
assignment or a job interview challenge.

Especially given that something like Lustre looks ridiculous
overkill for such a task (50m word, each on average 7-10 chars
long => only 500MB table, and read-only too), so posting the
question here makes little sense.

But then I have seen a number of ignoramuses happy to use
filesystems instead of incore or storage databases (typical
questions about having several million or hundred million files
almost all of which less than a block long, or directories with
hundreds of thousands or millions of files, especially on the
XFS mailing list).

Lustre discuss - Aug 2009 - Distributed Object storage lookup of small files

[Lustre-discuss] Distributed Object storage lookup of small files

[Lustre-discuss] Distributed Object storage lookup of small files

[Lustre-discuss] Distributed Object storage lookup of small files

[Lustre-discuss] Distributed Object storage lookup of small files

[Lustre-discuss] Distributed Object storage lookup of small files

[Lustre-discuss] Distributed Object storage lookup of small files

[Lustre-discuss] Distributed Object storage lookup of small files

[Lustre-discuss] Distributed Object storage lookup of small files