thr3ads.net - Xapian discuss - [Xapian-discuss] Emtpy records & unique key... [May 2005]

If this information is useful, please help other people find it:
Share via:

arjan holscher

2005-May-18 16:10 UTC

[Xapian-discuss] Emtpy records & unique key...

Hey readers,

Currently I have been working on a indexing systeem
using  scriptindex and a search tool using omega. The
general idea behind the system is that it will index
various sections of the site in one big omega
database. For example, there is a news section and a
review section. These are all to be stored in 1 big
database.

First, my scriptindex script:

internal	:	boolean=Q unique=Q field=internal
titel		: 	unhtml index=XC weight=3 index field=caption
topicstart	:	unhtml index=XA weight=2 index
truncate=200 field=sample
bericht		:	unhtml index
type		:	boolean=XT
url		:	field=url
itemid		:	field=itemid
itemtype	:	field=itemtype
topicstarter	:	boolean=XS
poster		:	boolean=XP

Most of them are pretty easy. Although the internal
field needs a little explenation. I use a section ID
for each section. I multiply this with 1 million and
add the actual database row id to this. This way my
article always has the same unique ID. 

Now for my problems :D There are 2:

- Some of the records in the omega database simply do
not return data. They do not contain document data
although I'm absolutely sure I do not deliver empty
documents to scriptindex. Under what conditions is it
possible for scriptindex to 'discard' a document.

- The second issue is indexing. The first time I index
all the documents I get 14222 added documents. This is
the correct number since it's a new database.

When I want to re-index the database I just throw all
the documents again at the database and I'd expected
to get 14222 updated documents (assuming no documents
are added during the index periods). However
scriptindex returns 2/3th of the total documents as
added and 1/3th of the documents as updated.

However I want ALL the documents to be updated ... not
added. I thought adding the unique field would solve
it. However this is not the case.

If anybody can help me, please do so :>

Thx in advance, 

Arjan Holscher


		
__________________________________ 
Yahoo! Mail Mobile 
Take Yahoo! Mail with you! Check email on your mobile phone. 
http://mobile.yahoo.com/learn/mail

Olly Betts

2005-May-18 16:10 UTC

head link

[Xapian-discuss] Emtpy records & unique key...

On Tue, May 10, 2005 at 01:31:26PM -0700, arjan holscher
wrote:> Most of them are pretty easy. Although the internal
> field needs a little explenation. I use a section ID
> for each section. I multiply this with 1 million and
> add the actual database row id to this. This way my
> article always has the same unique ID. 
It might be cleaner to make internal <section id>:<row id> so you
aren't
assuming less than a million rows, but there's nothing actually wrong
with your current scheme.
> - Some of the records in the omega database simply do
> not return data. They do not contain document data
> although I'm absolutely sure I do not deliver empty
> documents to scriptindex. Under what conditions is it
> possible for scriptindex to 'discard' a document.
I can't really see how it can.

If a term is too long, the implicit flush can fail, but that will exit
scriptindex with an error.  The "index" command doesn't allow this
to
happen, but "boolean" doesn't check (mostly because it's not
clear what
it should do if a term is too big - dropping the term doesn't really
seem correct and dropping the document doesn't seem ideal either).

But that would mean the document failed to be added, not that it would
be added with no document data.

Hmm, if the input has newlines in fields, are you escaping them as
scriptindex expects?
> - The second issue is indexing. The first time I index
> all the documents I get 14222 added documents. This is
> the correct number since it's a new database.
And there are 14222 documents in the input?

Might be worth checking (with delve from xapian-examples) how many
documents are in the database now.
> When I want to re-index the database I just throw all
> the documents again at the database and I'd expected
> to get 14222 updated documents (assuming no documents
> are added during the index periods). However
> scriptindex returns 2/3th of the total documents as
> added and 1/3th of the documents as updated.
And again here.
> However I want ALL the documents to be updated ... not
> added. I thought adding the unique field would solve
> it. However this is not the case.
As far as I can see, what you have should work...

Cheers,
    Olly

arjan holscher

2005-May-18 16:10 UTC

head link

[Xapian-discuss] Emtpy records & unique key...

--- Olly Betts <olly@survex.com> wrote:> On Tue, May 10, 2005 at 01:31:26PM -0700, arjan
> holscher wrote:
> > Most of them are pretty easy. Although the
> internal
> > field needs a little explenation. I use a section
> ID
> > for each section. I multiply this with 1 million
> and
> > add the actual database row id to this. This way
> my
> > article always has the same unique ID. 
> 
> It might be cleaner to make internal <section
> id>:<row id> so you aren't
> assuming less than a million rows, but there's
> nothing actually wrong
> with your current scheme.
> 
Actually I tried this first. However, it seems that
this is not the case after all.
> > - Some of the records in the omega database simply
> do
> > not return data. They do not contain document data
> > although I'm absolutely sure I do not deliver
> empty
> > documents to scriptindex. Under what conditions is
> it
> > possible for scriptindex to 'discard' a document.
> 
> I can't really see how it can.
> 
> If a term is too long, the implicit flush can fail,
> but that will exit
> scriptindex with an error.  The "index" command
> doesn't allow this to
> happen, but "boolean" doesn't check (mostly because
> it's not clear what
> it should do if a term is too big - dropping the
> term doesn't really
> seem correct and dropping the document doesn't seem
> ideal either).
> 
> But that would mean the document failed to be added,
> not that it would
> be added with no document data.
> 
> Hmm, if the input has newlines in fields, are you
> escaping them as
> scriptindex expects?
> 
I escape newlines as expected, since the documents
already in the database already do contain spaces in
the texts with spaces.

Could it have anything to do with the fact that I pipe
the buffer at once to scriptindex? I believe my buffer
is several Mb's of size. Could it help if I split my
buffer in pieces? Or isn't there a possiblity that
this will solve my problem?
> > - The second issue is indexing. The first time I
> index
> > all the documents I get 14222 added documents.
> This is
> > the correct number since it's a new database.
> 
> And there are 14222 documents in the input?
> 
> Might be worth checking (with delve from
> xapian-examples) how many
> documents are in the database now.
> 
Delve isn't installed on the server omega is running
on. However I'll try to install it ;)
> > When I want to re-index the database I just throw
> all
> > the documents again at the database and I'd
> expected
> > to get 14222 updated documents (assuming no
> documents
> > are added during the index periods). However
> > scriptindex returns 2/3th of the total documents
> as
> > added and 1/3th of the documents as updated.
> 
> And again here.
> 
> > However I want ALL the documents to be updated ...
> not
> > added. I thought adding the unique field would
> solve
> > it. However this is not the case.
> 
> As far as I can see, what you have should work...
> So far, it doesn't work as expected and I hope that
somebody here is able to work out a working solution.

Thx in advance,

Arjan Holscher


		
__________________________________ 
Yahoo! Mail Mobile 
Take Yahoo! Mail with you! Check email on your mobile phone. 
http://mobile.yahoo.com/learn/mail

Olly Betts

2005-May-18 16:11 UTC

head link

[Xapian-discuss] Emtpy records & unique key...

On Wed, May 11, 2005 at 05:55:25AM -0700, arjan holscher
wrote:> Here is the data I feed to scriptindex. It looks okay
> to me.
OK, it has DOS/Windows end of lines which isn't a problem - scriptindex
will remove a \r if there is one before the \n.

But some lines have multiple \r characters before the \n,
not just one.  Which is rather odd, but shouldn't actually cause
problems except that boolean terms will include these extra characters!

Anyway, with the latest development version on Linux, I get 4284 records
indexed.  Updating adds one record and updates 4283, leaving 4285 in the
database.  If I count the number of "internal=" lines, there are 4283
in
the file, so one record apparently has no internal field, and it makes
sense that it will be readded each time (because UNIQUE won't fire).

This doesn't match the 14222 you get, though the "third updated"
sort
of fits...
> I discovered something else in the database. Some
> records are only partially filled. Some fields are
> filled with something and some fields are plain empty.
> Maybe this will ring a bell in your head? :S
Are you running on Windows?  That might be the important difference.

Or perhaps it's a bug which was fixed since 0.8.5, although
scriptindex.cc hasn't changed materially since then - just a few comment
fixes.

Cheers,
    Olly

Olly Betts

2005-May-18 16:11 UTC

head link

[Xapian-discuss] Emtpy records & unique key...

On Wed, May 11, 2005 at 07:13:27AM -0700, arjan holscher
wrote:> --- Olly Betts <olly@survex.com> wrote:
> > But some lines have multiple \r characters before
> > the \n,
> > not just one.  Which is rather odd, but shouldn't
> > actually cause
> > problems except that boolean terms will include
> > these extra characters!
> 
> So, it would be wise to get rid of these \r
> characters.
I'd say so.  It perhaps makes sense for scriptindex to strip
multiple trailing \r characters.  You wouldn't expect them
normally, but if they are there it probably makes sense to
remove them.
> Are you sure that this doesn't cause the problem with the empty
records?
I can't see how it could, and it doesn't for me.
> > Anyway, with the latest development version on
> > Linux, I get 4284 records
> > indexed.
> 
> Don't ask me why and how, but now I actually get 4k of
> documents added. However, some of the records are
> still empty. How is this possible?
When you say "4k" do you mean exactly 4000, or 4096, or
the same "about 4k" that I got (i.e. 4284)?
> So, apart from those \r characters no strange material
> is contained within the data dump?
Hmmm.  I wonder if the double blank lines between records are a problem.
Or that coupled with the extra \r characters.

Running under valgrind, double blank lines cause us to look at character
-1 of a string.  I'll fix that.  Maybe that's the cause of your blank
records.  It would explain why they seem to come and go...
> I have removed the \r characters and so far the issue
> seems fixed. I have 2 remaining issues now:
> 
> - How do I sort a document by time. I could add a
> timestamp field which would contain a unix timestamp.
Put the timestamp in a document value (with Document::add_value()).
> Then one question remains, how do I sort it ascending
> or descending?
0.8.5 only allows sorting by value in one direction.  0.9.0 will add the
ability to reverse sort (previously people have worked around this by
storing <large number> - <timestamp> in another value).  I'm
pretty sure
I'll get 0.9.0 released this week.
> - Furthermore, I can't find the internal which is
> empty. There has to be 1 document with an empty
> internal. However, if you could point me to it (since
> you have seem to found it :>) then i'd be glad.
I only deduced it must exist, since there's one more record that there
are internal fields in the file!

Actually, looking again I bet it's because there's a double blank line
at the end of the file.  And if I run update and look at the last record
(which is the one readded) it has no data and no terms, so that fits.

Cheers,
    Olly

Xapian discuss - May 2005 - Emtpy records & unique key...

[Xapian-discuss] Emtpy records & unique key...

[Xapian-discuss] Emtpy records & unique key...

[Xapian-discuss] Emtpy records & unique key...

[Xapian-discuss] Emtpy records & unique key...

[Xapian-discuss] Emtpy records & unique key...