Hey readers, Currently I have been working on a indexing systeem using scriptindex and a search tool using omega. The general idea behind the system is that it will index various sections of the site in one big omega database. For example, there is a news section and a review section. These are all to be stored in 1 big database. First, my scriptindex script: internal : boolean=Q unique=Q field=internal titel : unhtml index=XC weight=3 index field=caption topicstart : unhtml index=XA weight=2 index truncate=200 field=sample bericht : unhtml index type : boolean=XT url : field=url itemid : field=itemid itemtype : field=itemtype topicstarter : boolean=XS poster : boolean=XP Most of them are pretty easy. Although the internal field needs a little explenation. I use a section ID for each section. I multiply this with 1 million and add the actual database row id to this. This way my article always has the same unique ID. Now for my problems :D There are 2: - Some of the records in the omega database simply do not return data. They do not contain document data although I'm absolutely sure I do not deliver empty documents to scriptindex. Under what conditions is it possible for scriptindex to 'discard' a document. - The second issue is indexing. The first time I index all the documents I get 14222 added documents. This is the correct number since it's a new database. When I want to re-index the database I just throw all the documents again at the database and I'd expected to get 14222 updated documents (assuming no documents are added during the index periods). However scriptindex returns 2/3th of the total documents as added and 1/3th of the documents as updated. However I want ALL the documents to be updated ... not added. I thought adding the unique field would solve it. However this is not the case. If anybody can help me, please do so :> Thx in advance, Arjan Holscher __________________________________ Yahoo! Mail Mobile Take Yahoo! Mail with you! Check email on your mobile phone. http://mobile.yahoo.com/learn/mail
On Tue, May 10, 2005 at 01:31:26PM -0700, arjan holscher wrote:> Most of them are pretty easy. Although the internal > field needs a little explenation. I use a section ID > for each section. I multiply this with 1 million and > add the actual database row id to this. This way my > article always has the same unique ID.It might be cleaner to make internal <section id>:<row id> so you aren't assuming less than a million rows, but there's nothing actually wrong with your current scheme.> - Some of the records in the omega database simply do > not return data. They do not contain document data > although I'm absolutely sure I do not deliver empty > documents to scriptindex. Under what conditions is it > possible for scriptindex to 'discard' a document.I can't really see how it can. If a term is too long, the implicit flush can fail, but that will exit scriptindex with an error. The "index" command doesn't allow this to happen, but "boolean" doesn't check (mostly because it's not clear what it should do if a term is too big - dropping the term doesn't really seem correct and dropping the document doesn't seem ideal either). But that would mean the document failed to be added, not that it would be added with no document data. Hmm, if the input has newlines in fields, are you escaping them as scriptindex expects?> - The second issue is indexing. The first time I index > all the documents I get 14222 added documents. This is > the correct number since it's a new database.And there are 14222 documents in the input? Might be worth checking (with delve from xapian-examples) how many documents are in the database now.> When I want to re-index the database I just throw all > the documents again at the database and I'd expected > to get 14222 updated documents (assuming no documents > are added during the index periods). However > scriptindex returns 2/3th of the total documents as > added and 1/3th of the documents as updated.And again here.> However I want ALL the documents to be updated ... not > added. I thought adding the unique field would solve > it. However this is not the case.As far as I can see, what you have should work... Cheers, Olly
--- Olly Betts <olly@survex.com> wrote:> On Tue, May 10, 2005 at 01:31:26PM -0700, arjan > holscher wrote: > > Most of them are pretty easy. Although the > internal > > field needs a little explenation. I use a section > ID > > for each section. I multiply this with 1 million > and > > add the actual database row id to this. This way > my > > article always has the same unique ID. > > It might be cleaner to make internal <section > id>:<row id> so you aren't > assuming less than a million rows, but there's > nothing actually wrong > with your current scheme. >Actually I tried this first. However, it seems that this is not the case after all.> > - Some of the records in the omega database simply > do > > not return data. They do not contain document data > > although I'm absolutely sure I do not deliver > empty > > documents to scriptindex. Under what conditions is > it > > possible for scriptindex to 'discard' a document. > > I can't really see how it can. > > If a term is too long, the implicit flush can fail, > but that will exit > scriptindex with an error. The "index" command > doesn't allow this to > happen, but "boolean" doesn't check (mostly because > it's not clear what > it should do if a term is too big - dropping the > term doesn't really > seem correct and dropping the document doesn't seem > ideal either). > > But that would mean the document failed to be added, > not that it would > be added with no document data. > > Hmm, if the input has newlines in fields, are you > escaping them as > scriptindex expects? >I escape newlines as expected, since the documents already in the database already do contain spaces in the texts with spaces. Could it have anything to do with the fact that I pipe the buffer at once to scriptindex? I believe my buffer is several Mb's of size. Could it help if I split my buffer in pieces? Or isn't there a possiblity that this will solve my problem?> > - The second issue is indexing. The first time I > index > > all the documents I get 14222 added documents. > This is > > the correct number since it's a new database. > > And there are 14222 documents in the input? > > Might be worth checking (with delve from > xapian-examples) how many > documents are in the database now. >Delve isn't installed on the server omega is running on. However I'll try to install it ;)> > When I want to re-index the database I just throw > all > > the documents again at the database and I'd > expected > > to get 14222 updated documents (assuming no > documents > > are added during the index periods). However > > scriptindex returns 2/3th of the total documents > as > > added and 1/3th of the documents as updated. > > And again here. > > > However I want ALL the documents to be updated ... > not > > added. I thought adding the unique field would > solve > > it. However this is not the case. > > As far as I can see, what you have should work... >So far, it doesn't work as expected and I hope that somebody here is able to work out a working solution. Thx in advance, Arjan Holscher __________________________________ Yahoo! Mail Mobile Take Yahoo! Mail with you! Check email on your mobile phone. http://mobile.yahoo.com/learn/mail
On Wed, May 11, 2005 at 05:55:25AM -0700, arjan holscher wrote:> Here is the data I feed to scriptindex. It looks okay > to me.OK, it has DOS/Windows end of lines which isn't a problem - scriptindex will remove a \r if there is one before the \n. But some lines have multiple \r characters before the \n, not just one. Which is rather odd, but shouldn't actually cause problems except that boolean terms will include these extra characters! Anyway, with the latest development version on Linux, I get 4284 records indexed. Updating adds one record and updates 4283, leaving 4285 in the database. If I count the number of "internal=" lines, there are 4283 in the file, so one record apparently has no internal field, and it makes sense that it will be readded each time (because UNIQUE won't fire). This doesn't match the 14222 you get, though the "third updated" sort of fits...> I discovered something else in the database. Some > records are only partially filled. Some fields are > filled with something and some fields are plain empty. > Maybe this will ring a bell in your head? :SAre you running on Windows? That might be the important difference. Or perhaps it's a bug which was fixed since 0.8.5, although scriptindex.cc hasn't changed materially since then - just a few comment fixes. Cheers, Olly
On Wed, May 11, 2005 at 07:13:27AM -0700, arjan holscher wrote:> --- Olly Betts <olly@survex.com> wrote: > > But some lines have multiple \r characters before > > the \n, > > not just one. Which is rather odd, but shouldn't > > actually cause > > problems except that boolean terms will include > > these extra characters! > > So, it would be wise to get rid of these \r > characters.I'd say so. It perhaps makes sense for scriptindex to strip multiple trailing \r characters. You wouldn't expect them normally, but if they are there it probably makes sense to remove them.> Are you sure that this doesn't cause the problem with the empty records?I can't see how it could, and it doesn't for me.> > Anyway, with the latest development version on > > Linux, I get 4284 records > > indexed. > > Don't ask me why and how, but now I actually get 4k of > documents added. However, some of the records are > still empty. How is this possible?When you say "4k" do you mean exactly 4000, or 4096, or the same "about 4k" that I got (i.e. 4284)?> So, apart from those \r characters no strange material > is contained within the data dump?Hmmm. I wonder if the double blank lines between records are a problem. Or that coupled with the extra \r characters. Running under valgrind, double blank lines cause us to look at character -1 of a string. I'll fix that. Maybe that's the cause of your blank records. It would explain why they seem to come and go...> I have removed the \r characters and so far the issue > seems fixed. I have 2 remaining issues now: > > - How do I sort a document by time. I could add a > timestamp field which would contain a unix timestamp.Put the timestamp in a document value (with Document::add_value()).> Then one question remains, how do I sort it ascending > or descending?0.8.5 only allows sorting by value in one direction. 0.9.0 will add the ability to reverse sort (previously people have worked around this by storing <large number> - <timestamp> in another value). I'm pretty sure I'll get 0.9.0 released this week.> - Furthermore, I can't find the internal which is > empty. There has to be 1 document with an empty > internal. However, if you could point me to it (since > you have seem to found it :>) then i'd be glad.I only deduced it must exist, since there's one more record that there are internal fields in the file! Actually, looking again I bet it's because there's a double blank line at the end of the file. And if I run update and look at the last record (which is the one readded) it has no data and no terms, so that fits. Cheers, Olly