Sandri Francesco
2007-Apr-05 16:06 UTC
[Xapian-discuss] Particular Informations about Xapian
Greetings, we are a group of students attending a course of Information Retrival at the University of Padua. We are interested to have many informations about Xapian: - the posting file data structure; - if there is any index compression and what type; - what are standard ID schemes (DOI, URI, Purl, etc.); - if Xapian builds any authority files; - if the system manage term polysemy (lexical ambiguity); Thanks for all the time you dedicate us. Dindo Stefano Sandri Francesco Scremin Thomas Tono Raffaella Vantini Marco
On Thu, Mar 01, 2007 at 11:16:53AM +0100, Sandri Francesco wrote:> Greetings, we are a group of students attending a course of Information > Retrival at the University of Padua. We are interested to have many > informations about Xapian: > - the posting file data structure;For the flint backend, see: http://wiki.xapian.org/FlintBackend The Btree tables used by flint are very similar to those in quartz (the keys are different, and the filenames use "." instead of "_"): http://www.xapian.org/docs/quartzdesign.html> - if there is any index compression and what type;Yes - see the above documentation.> - what are standard ID schemes (DOI, URI, Purl, etc.);You can use whatever you like as an ID, provided it's not overly long (the limit is 240 bytes or so).> - if Xapian builds any authority files;Not by itself, though you should be able to build and maintain authority files using Xapian.> - if the system manage term polysemy (lexical ambiguity);Not directly. However, relevance feedback can be used to "steer" a query towards a particular meaning of a term with multiple meanings though. For example, a search for "stock" can turn up investments, cookery, warehouses, and so on. If you mark a few documents in the results which are relevant, Xapian can suggest more terms (so for investments, it might suggest "shares" or "market", for cookery it might suggest "recipe", etc). Alternatively, you can get Xapian to suggests terms based on the top N results, and then the user can pick from those terms. Cheers, Olly