thr3ads.net - Xapian discuss - [Xapian-discuss] stub-file and get

If this information is useful, please help other people find it:
Share via:

QE :: Felix Ostmann

2015-Mar-11 18:01 UTC

[Xapian-discuss] stub-file and get_doccount

Hello,

i switched from one big index to a stub file with many indexes and running
into a problem.

i have a tool to fetch a random document via:

get_doccount
random id up to get_doccount
get_document with that id

after changing to stub file this failes. Is there a nice way to get a
random document from a stub file?


?MfG?

Felix Ostmann

Olly Betts

2015-Mar-12 22:44 UTC

head link

[Xapian-discuss] stub-file and get_doccount

On Wed, Mar 11, 2015 at 07:01:48PM +0100, QE :: Felix Ostmann
wrote:> i switched from one big index to a stub file with many indexes and running
> into a problem.
> 
> i have a tool to fetch a random document via:
> 
> get_doccount
> random id up to get_doccount
> get_document with that id
> 
> after changing to stub file this failes. Is there a nice way to get a
> random document from a stub file?
Note that the above only works with a single database if you've never
deleted any documents.

With multiple databases, the document ids are interleaved - see here for
details of how:

http://trac.xapian.org/wiki/FAQ/MultiDatabaseDocumentID

This is done so that the numbering for is stable when documents are
added to the individual databases.

So unless all the databases have equal numbers of documents (or some
have one fewer and they are arranged suitably), you'll end up with gaps
in the numbering at the upper end.

One option is to pick a random id up to get_lastdocid(), and retry if
DocNotFoundError is thrown.  That may be inefficient if get_lastdocid()
is much larger than get_doccount().

To avoid the exceptions, I think you'll need to pick a subdatabase and
then a document within that.  If you aren't fussy about how even the
random distribution is, you could pick 1 out of N subdatabases at
random, and then randomly pick a docid within that subdatabase.
Otherwise you'll want to pick the subdatabases with probability
proportional to the number of documents they contain.

Cheers,
    Olly

QE :: Felix Ostmann

2015-Mar-13 18:09 UTC

head link

[Xapian-discuss] stub-file and get_doccount

OK, after a short brainstorm i implemented the following:

I don't modify my indexe, i only build new ones.

While generating i save the doccount (same as lastdocid) and database in a
array in the metadata.
Also i save the absolute doccount over all databases.

Now i can get a random integer up to the absolute doccount and iterate over
the array and decrement the random integer if it is greater than the
doccount from the current database.

If the doccount is equal or smaller than the doccount from the current
database i can open this database and use get_document with the random
integer.

perfect random for me!

Thanks for your help!



Mit freundlichem Gru?
Felix Ostmann

-----------------------------------------------------------
QE GmbH & Co. KG, Martinistra?e 3, D-49080 Osnabr?ck
-----------------------------------------------------------
Tel.: +49 (0) 541 / 40666 0, Fax: +49 (0) 541 / 40666 22
Email: info at qe.de, Web: www.qe.de
-----------------------------------------------------------
AG Osnabr?ck - HRA 200252, Ust-IdNr.: DE814737310
-----------------------------------------------------------
Komplement?rin: QE24 GmbH, AG Osnabr?ck - HRB 200359,
Gesch?ftsf?hrer: Ansas Meyer, Firmensitz: Osnabr?ck
-----------------------------------------------------------

Die in dieser Email enthaltenen Informationen sind vertrau-
lich zu behandeln und ausschlie?lich f?r den Adressaten be-
stimmt. Jegliche Ver?ffentlichung, Verteilung oder sonstige
in diesem Zusammenhang stehende Handlung wird ausdr?cklich
untersagt.

2015-03-12 23:44 GMT+01:00 Olly Betts <olly at survex.com>:
> On Wed, Mar 11, 2015 at 07:01:48PM +0100, QE :: Felix Ostmann wrote:
> > i switched from one big index to a stub file with many indexes and
> running
> > into a problem.
> >
> > i have a tool to fetch a random document via:
> >
> > get_doccount
> > random id up to get_doccount
> > get_document with that id
> >
> > after changing to stub file this failes. Is there a nice way to get a
> > random document from a stub file?
>
> Note that the above only works with a single database if you've never
> deleted any documents.
>
> With multiple databases, the document ids are interleaved - see here for
> details of how:
>
> http://trac.xapian.org/wiki/FAQ/MultiDatabaseDocumentID
>
> This is done so that the numbering for is stable when documents are
> added to the individual databases.
>
> So unless all the databases have equal numbers of documents (or some
> have one fewer and they are arranged suitably), you'll end up with gaps
> in the numbering at the upper end.
>
> One option is to pick a random id up to get_lastdocid(), and retry if
> DocNotFoundError is thrown.  That may be inefficient if get_lastdocid()
> is much larger than get_doccount().
>
> To avoid the exceptions, I think you'll need to pick a subdatabase and
> then a document within that.  If you aren't fussy about how even the
> random distribution is, you could pick 1 out of N subdatabases at
> random, and then randomly pick a docid within that subdatabase.
> Otherwise you'll want to pick the subdatabases with probability
> proportional to the number of documents they contain.
>
> Cheers,
>     Olly
>

Reasonably Related Threads

Search for more maybe matching threads

Xapian discuss - Mar 2015 - stub-file and get_doccount

[Xapian-discuss] stub-file and get_doccount

[Xapian-discuss] stub-file and get_doccount

[Xapian-discuss] stub-file and get_doccount

Reasonably Related Threads