Felix Antonius Wilhelm Ostmann
2007-Jan-04 11:16 UTC
[Xapian-discuss] What kind of data in the datafield
we are building the next google ... you know ;) But, what should we save in the data-field? the hole content? the first 4.096 byte from one dokument? the best 400 byte from one dokument? or nothing and save the content raw to disk in a file named by the doc_id? And the title, the timestamp and other stuff? save in a value or at the data too? I am confused :( Thanks :) -- Mit freundlichen Gr??en Felix Antonius Wilhelm Ostmann -------------------------------------------------- Websuche Search Technology GmbH & Co. KG Martinistra?e 3 - D-49080 Osnabr?ck - Germany Tel.: +49 541 40666-0 - Fax: +49 541 40666-22 Email: info@websuche.de - Website: www.websuche.de -------------------------------------------------- AG Osnabr?ck - HRA 200252 - Ust-Ident: DE814737310 Komplement?rin: Websuche Search Technology Verwaltungs GmbH - AG Osnabr?ck - HRB 200359 Gesch?ftsf?hrer: Diplom Kaufmann Martin Steinkamp --------------------------------------------------
Felix Antonius Wilhelm Ostmann wrote:> we are building the next google ... you know ;) But, what should we save > in the data-field?It really depends what you want to do with the data. In general, you should save what you have a use for, and no more: obviously, the less you save, the smaller the database, and the faster you'll be able to access the data. If you have the original data on disk, it's often useful just to save a URL/file path to the data. But, even in this case, if the data has to pass through an expensive parsing step to extract text, it may be useful to store a sample of the parsed text for display in the result list. You might even want to store the whole parsed text, and generate a summary based on the phrases relevant to the query.> And the title, the timestamp and other stuff? save in a value or at the > data too? I am confused :(Save in the data if you want to display them, or use them in some other way, once you've got the document results. Note that if you're saving something like a timestamp in a value anyway (e.g., for sorting), you can just read the timestamp from the value when displaying the result list, so there's no need to duplicate this. -- Richard
On Thu, Jan 04, 2007 at 12:15:32PM +0100, Felix Antonius Wilhelm Ostmann wrote:> we are building the next google ... you know ;) But, what should we save > in the data-field? > > the hole content? the first 4.096 byte from one dokument? the best 400 > byte from one dokument? or nothing and save the content raw to disk in a > file named by the doc_id? > > And the title, the timestamp and other stuff? save in a value or at the > data too? I am confused :(It depends entirely on how you want to display the data. Google (I believe) keeps copies of everything, so you ideally want the source document somewhere. I'd probably recommend having Xapian document data containing some summary fields plus a key to the storage on disk (or, as you suggest, use the doc_id), so that overview search results pages can be built without loading vast quantities of raw data and on-the-spot summarising them, but still giving you the opportunity of doing more detailed work (full-document search result highlighting, for instance) when required. Speaking of which, has anyone else noticed some sites doing search result highlight when driven from natural search? Not sure I'm in favour of it - strikes me that it could be done better as a browser extension - but still interesting. J -- /--------------------------------------------------------------------------\ James Aylett xapian.org james@tartarus.org uncertaintydivision.org