Karel Marissens
2009-Feb-28 21:21 UTC
[Xapian-discuss] Tag-based filesystem with xapian, advice?
Hi. For my thesis, I'm working on a combination of a hierarchical and tag- based filesystem written in python. I'm using FUSE to "write" the filesystem. Now I am thinking about using xapian but could use some advise. Before I go into my questions, I'll explain the idea of the system (questions after the horizontal line). The idea is that there's a (hidden) directory using a hi?rarchical filesystem. I then virtually replicate this directory at a logical place (say, the homefolder of a user) using FUSE. It can be used exactly the same as a user would normally, but, I add extra functionality: tags. Every file will have the ability to have different tags (keywords) associated with it. An image of the christmas tree for example might be hi?rarchically located in /photo's/2008/christmas and be tagged as "tree, christmas, photo, 2008". How tags are added to files etc. is not important here. What is important is that the association of a file with several tags will be saved in a database, as this information needs to be searchable. Every directory in the hierarchy will have a special folder: +FIND. When one goes to /photo's/2008/+FIND, all tags associated with files in the directory /photo's/2008, or any of its subdirectories, will be visible as subdirectories. By opening such a subdirectory, a list of tags (in the form of subdirectories) that can be combined with it will be showed. So /photo's/2008/+FIND/christmas will show all tags associated with files in the directory /photo's/2008, or any of its subdirectories, which are tagged as christmas. At any moment, the user can go in the special subdirectory +FILES to see a list of all the files that comply to the selection. /photo's/ 2008/+FIND/christmas/tree/+FILES will thus show all files in /photo's/ 2008, or any of its subdirectories, which are tagged as christmas and tree. ---------------------------------------------------------------------------------------------------- So, as I was searching for the best way to save all the needed information in a database and find it back, I stumbled upon xapian. I read the information pages and the whole API, looked at the few examples I could find and did some small tests. I will only use boolean search functionality as I have no need to "guess" which file is most relevant, I just need to show them all. Now my 1th question is, what is the path of a file? The content of a document? A term? A value? I need to be able to use the path when searching as I need to be able to limit the file-results to files in a certain directory. Thus for example, only files that have a path of / photo's/2008/*. Or do I have to work with a relevance-set or something? I tried using the path as a tag itself, but when I do a query for "/ photo's/2008/*", it is automatically translated to 2 separated terms I think? (a file tagged as 2008 also showed up for example) My 2th question is, what is the easiest way to get a list of all the tags associated with files in the resultset? I want to have a list of all tags associated with files in /photo's/2008. One method would be to do a search for all files in /photo's/2008, or any subdirectory, loop all the results, and per document, loop the terms associated with it and add these to a list. My 3th question is how I can get ALL results? Get_mset() requires a maximum amount of results. Do I just set it to an extremely big number and see it as a safety-limitation that shouldn't be reached? ---------------------------------------------------------------------------------------------------- To sum it all up: 1) Where do I store the path of a file? 2) How do I get a list of all terms associated with documents in the resultset? 3) How do I get ALL results, not a limited amount? Thanks in advance for any advice! Karel
Olly Betts
2009-Mar-02 05:27 UTC
[Xapian-discuss] Tag-based filesystem with xapian, advice?
On Sat, Feb 28, 2009 at 10:21:15PM +0100, Karel Marissens wrote:> Now my 1th question is, what is the path of a file? The content of a > document? A term? A value? I need to be able to use the path when > searching as I need to be able to limit the file-results to files in a > certain directory. Thus for example, only files that have a path of / > photo's/2008/*. Or do I have to work with a relevance-set or something?I would put the path in the document data for reading when you get results, and also index all the directories which the file is in as terms (e.g. P/photo's and P/photo's/2008 for a file in /photo's/2008).> I tried using the path as a tag itself, but when I do a query for "/ > photo's/2008/*", it is automatically translated to 2 separated terms I > think? (a file tagged as 2008 also showed up for example)I don't think you want to use QueryParser here - just build your Query objects up by hand. If you want to allow "free text queries" after +FIND, then you can parse that part with QueryParser and then filter the result using the appropriate "P"-prefixed term, e.g. in C++: Xapian::Query q = qp.parse_query(query_string); q = Xapian::Query(q.OP_FILTER, q, Xapian::Query("P/photo's"));> My 2th question is, what is the easiest way to get a list of all the > tags associated with files in the resultset? I want to have a list of > all tags associated with files in /photo's/2008. One method would be > to do a search for all files in /photo's/2008, or any subdirectory, > loop all the results, and per document, loop the terms associated with > it and add these to a list.You can add all documents in the MSet to an RSet and use Enquire::get_eset() to get a set of all the terms in all the documents. That's not so different to what you describe, though Xapian does most of the work for you, including eliminating duplicates. If you just want the "tag" terms (and not P/photo's, etc), you can use an "ExpandDecider" to only pick out those. I'd suggest for efficiency that you might want to consider adding a special case for "/" and use Database::allterms_begin() to iterate over all the terms in the database.> My 3th question is how I can get ALL results? Get_mset() requires a > maximum amount of results. Do I just set it to an extremely big number > and see it as a safety-limitation that shouldn't be reached?If you can handle result sets of any size, just pass db.get_doccount() - there can't be more matching documents than there are documents in the database. I'll add a note to the documentation comment for Enquire::get_mset() as this has been asked a few times before. Cheers, Olly