thr3ads.net - Xapian discuss - [Xapian-discuss] Tag-based filesystem with xapian, advice? [Feb 2009]

If this information is useful, please help other people find it:
Share via:

Karel Marissens

2009-Feb-28 21:21 UTC

[Xapian-discuss] Tag-based filesystem with xapian, advice?

Hi.

For my thesis, I'm working on a combination of a hierarchical and tag-
based filesystem written in python. I'm using FUSE to "write" the
filesystem. Now I am thinking about using xapian but could use some
advise.

Before I go into my questions, I'll explain the idea of the system
(questions after the horizontal line). The idea is that there's a
(hidden) directory using a hi?rarchical filesystem. I then virtually
replicate this directory at a logical place (say, the homefolder of a
user) using FUSE. It can be used exactly the same as a user would
normally, but, I add extra functionality: tags. Every file will have
the ability to have different tags (keywords) associated with it. An
image of the christmas tree for example might be hi?rarchically
located in /photo's/2008/christmas and be tagged as "tree, christmas,
photo, 2008".

How tags are added to files etc. is not important here. What is
important is that the association of a file with several tags will be
saved in a database, as this information needs to be searchable.

Every directory in the hierarchy will have a special folder: +FIND.
When one goes to /photo's/2008/+FIND, all tags associated with files
in the directory /photo's/2008, or any of its subdirectories, will be
visible as subdirectories. By opening such a subdirectory, a list of
tags (in the form of subdirectories) that can be combined with it will
be showed. So /photo's/2008/+FIND/christmas will show all tags
associated with files in the directory /photo's/2008, or any of its
subdirectories, which are tagged as christmas.

At any moment, the user can go in the special subdirectory +FILES to
see a list of all the files that comply to the selection. /photo's/
2008/+FIND/christmas/tree/+FILES will thus show all files in /photo's/
2008, or any of its subdirectories, which are tagged as christmas and
tree.

----------------------------------------------------------------------------------------------------

So, as I was searching for the best way to save all the needed
information in a database and find it back, I stumbled upon xapian. I
read the information pages and the whole API, looked at the few
examples I could find and did some small tests. I will only use
boolean search functionality as I have no need to "guess" which file
is most relevant, I just need to show them all.

Now my 1th question is, what is the path of a file? The content of a
document? A term? A value? I need to be able to use the path when
searching as I need to be able to limit the file-results to files in a
certain directory. Thus for example, only files that have a path of /
photo's/2008/*. Or do I have to work with a relevance-set or something?

I tried using the path as a tag itself, but when I do a query for "/
photo's/2008/*", it is automatically translated to 2 separated terms I
think? (a file tagged as 2008 also showed up for example)

My 2th question is, what is the easiest way to get a list of all the
tags associated with files in the resultset? I want to have a list of
all tags associated with files in /photo's/2008. One method would be
to do a search for all files in /photo's/2008, or any subdirectory,
loop all the results, and per document, loop the terms associated with
it and add these to a list.

My 3th question is how I can get ALL results? Get_mset() requires a
maximum amount of results. Do I just set it to an extremely big number
and see it as a safety-limitation that shouldn't be reached?

----------------------------------------------------------------------------------------------------

To sum it all up:
1) Where do I store the path of a file?
2) How do I get a list of all terms associated with documents in the
resultset?
3) How do I get ALL results, not a limited amount?

Thanks in advance for any advice!

Karel

Olly Betts

2009-Mar-02 05:27 UTC

head link

[Xapian-discuss] Tag-based filesystem with xapian, advice?

On Sat, Feb 28, 2009 at 10:21:15PM +0100, Karel Marissens
wrote:> Now my 1th question is, what is the path of a file? The content of a  
> document? A term? A value? I need to be able to use the path when  
> searching as I need to be able to limit the file-results to files in a  
> certain directory. Thus for example, only files that have a path of / 
> photo's/2008/*. Or do I have to work with a relevance-set or something?
I would put the path in the document data for reading when you get
results, and also index all the directories which the file is in as
terms (e.g. P/photo's and P/photo's/2008 for a file in
/photo's/2008).
> I tried using the path as a tag itself, but when I do a query for "/ 
> photo's/2008/*", it is automatically translated to 2 separated
terms I
> think? (a file tagged as 2008 also showed up for example)
I don't think you want to use QueryParser here - just build your Query
objects up by hand.

If you want to allow "free text queries" after +FIND, then you can
parse
that part with QueryParser and then filter the result using the
appropriate "P"-prefixed term, e.g. in C++:

    Xapian::Query q = qp.parse_query(query_string);
    q = Xapian::Query(q.OP_FILTER, q, Xapian::Query("P/photo's"));
> My 2th question is, what is the easiest way to get a list of all the  
> tags associated with files in the resultset? I want to have a list of  
> all tags associated with files in /photo's/2008. One method would be  
> to do a search for all files in /photo's/2008, or any subdirectory,  
> loop all the results, and per document, loop the terms associated with  
> it and add these to a list.
You can add all documents in the MSet to an RSet and use
Enquire::get_eset() to get a set of all the terms in all the documents.
That's not so different to what you describe, though Xapian does most
of the work for you, including eliminating duplicates.

If you just want the "tag" terms (and not P/photo's, etc), you can
use
an "ExpandDecider" to only pick out those.

I'd suggest for efficiency that you might want to consider adding a
special case for "/" and use Database::allterms_begin() to iterate
over
all the terms in the database.
> My 3th question is how I can get ALL results? Get_mset() requires a  
> maximum amount of results. Do I just set it to an extremely big number  
> and see it as a safety-limitation that shouldn't be reached?
If you can handle result sets of any size, just pass db.get_doccount() -
there can't be more matching documents than there are documents in the
database.

I'll add a note to the documentation comment for Enquire::get_mset() as
this has been asked a few times before.

Cheers,
    Olly

Xapian discuss - Feb 2009 - Tag-based filesystem with xapian, advice?

[Xapian-discuss] Tag-based filesystem with xapian, advice?

[Xapian-discuss] Tag-based filesystem with xapian, advice?