thr3ads.net - similar to: "128 bit Document IDs (Please don't hurt me)"

Displaying 20 results from an estimated 10000 matches similar to: "128 bit Document IDs (Please don't hurt me)"

2014 May 10

some trouble when devising skiplist

Hi, I was confronted with some trouble, I describe the trouble in my journal http://trac.xapian.org/wiki/GSoC2014/Posting%20list%20encoding%20improvements/Journal#May10 And corresponding code is in my git. Would you like to give me some help? ------------------ Shangtong Zhang,Second Year Undergraduate, School of Computer Science, Fudan University, China. -------------- next part

Compact databases and removing stale records at the same time

2013 Jun 19

Compact databases and removing stale records at the same time

On Wed, Jun 19, 2013, at 03:49 PM, Olly Betts wrote: > On Wed, Jun 19, 2013 at 01:29:16PM +1000, Bron Gondwana wrote: > > The advantage of compact - it runs approximately 8 times as fast (we > > are CPU limited in each case - writing to tmpfs first, then rsyncing > > to the destination) and it takes approximately 75% of the space of a > > fresh database with maximum

Example of work with mongo

2011 Mar 04

Example of work with mongo

Hi there, I''m newbie here and I have a problem with the connection with MongoDB. The connection among Rails and MongoDB works, but I don''t know, how to print only one "column" from document. If I''m trying following a part of code: puts db["testCollection"].find_one().inspect So I''will get the entire structure of BSON, as:

Logging the click data

2017 Jun 05

Logging the click data

Hi James, > ID: some identifier for each query > QUERY: text of the query (when the query is run) > URLs: every URL displayed (or alternatively, the Xapian docid — this > might be easier) > OFFSET: otherwise you'll have difficulty coping with result pages other > than the first page (when this happens, the query ID should probably > remain the same, and when you aggregate

XAPIAN_FLUSH_THRESHOLD

2009 Jul 15

XAPIAN_FLUSH_THRESHOLD

I'm playing around with a machine that has 2 GB of memory. Indexing about 5GB of data average of 2MB per document. The documents are plain text. I notice the omindex's memory fott print get's biger an bigger then the machine starts to swap and it all slows down to a crawl. In regards to export XAPIAN_FLUSH_THRESHOLD I know the default is 10000 Am I right in saying that for my setup

Project: Posting list encoding improvements

2012 Mar 31

Project: Posting list encoding improvements

Hi Xapianers: My name is Weixian Zhou, Computer Science student of University at Buffalo, State University of New York. I am interested in the project of posting list encoding improvements and weighting schemes. I have some questions toward them. 1) After read the comments in brass_postlist.cc, I am still not very clear about the detailed structure of postings list. If you can provide some simple

Adding an external library to Xapian

2014 Apr 13

Adding an external library to Xapian

My code is not on Github. I am using the tarball as of now. The following it the error that occurred: http://pastebin.com/cVJrjUZX On Sun, Apr 13, 2014 at 8:16 PM, James Aylett <james-xapian at tartarus.org>wrote: > On 13 Apr 2014, at 15:37, Pallavi Gudipati <pallavigudipati at gmail.com> > wrote: > > > A linker error is encountered even after following the above

what is the fastest way to fetch results which are sorted by timestamp ?

2011 Aug 09

what is the fastest way to fetch results which are sorted by timestamp ?

what is the fastest way to fetch results which are sorted by timestamp ? i want to use xapian as my search engine , use add_boolean_term(something) and add_value(0,sortable_serialise(get_timestamp())) to a doc. search through enquire.set_weighting_scheme(xapian.BoolWeight()) and enquire.set_sort_by_value(0,True) to ensure that the results are sorted by the timestamp. This method is ok , but

overlapping docids when searching on multiple databases?

2010 Oct 22

overlapping docids when searching on multiple databases?

Just a quick question - it seems to me that it's entirely possible to get overlapping docids when searching on multiple databases? For instance: open database1 add database2 to database1 search db1+db2 if docid 10 exists in both databases, is there any way of telling which which database to retrieve the document from? /Per Jessen, Z?rich

Storing the documents text: data record or value ?

2018 Jan 03

Storing the documents text: data record or value ?

Hi, Following the Recoll snippets generation performance problem caused by the new positions list storage scheme in Xapian 1.4, I am experimenting with generating snippets from the complete document text stored in the index. This increases the index size much less than I would have expected (around 10-15% apparently with my home directory data), which is good news obviously. I have tried

Search::Xapian add_database'd search results are odd?

2004 Dec 21

Search::Xapian add_database'd search results are odd?

Sorry if this is the wrong forum to discuss Search::Xapian issues -- this just seems like the best place.. Anyways, I've been testing out using $db->add_database() when searching, and it seems like the docids I'm getting out of it are incorrect, almost as though they're "double" what they should be (numerically)... the docids that exist should be around 950,000 and

manual flushing thresholds for deletes?

2023 May 03

manual flushing thresholds for deletes?

On Wed, May 03, 2023 at 12:38:15PM +0000, Eric Wong wrote: > Olly Betts <olly at survex.com> wrote: > > This will also effectively ignore boolean terms, assuming you're giving > > them wdf of 0 (because $3 here is the collection frequency, which is > > sum(wdf(term)) over all documents). > > Should boolean terms be ignored when estimating flushing >

How to get the serialise score returned in Xapian::KeyMaker->operator().

2017 Dec 18

How to get the serialise score returned in Xapian::KeyMaker->operator().

On Sat, Dec 16, 2017 at 10:11:40PM +0000, Olly Betts wrote: > Unfortunately the sort key isn't currently exposed via the public API. > It's available internally and it seems like it ought to be accessible > but there's no accessor method for it - I can add one but that won't > help for existing releases. I've added MSetIterator::get_sort_key() to master in

Indexing more than 15 billion documents

2009 Jun 23

Indexing more than 15 billion documents

Hi, Sorry to follow up on an old thread, but I am wondering if there has been any work done on, or interest in, increasing the maximum document id beyond a 32bit limit? Daniel On Mon, Jun 18, 2007 at 04:11:54AM +0100, Olly Betts wrote: > > In particular, there is currently a limit of 4 billion documents in a > > database, due to using a 32 bit type for document IDs, but I don't

Testing document size preallocation.

2012 Jan 08

Testing document size preallocation.

https://gist.github.com/ad2accc5b4655753923d So here I am creating a database with no values for each small document and one with a bunch of blank values (uuid_blank). Once those are flushed then I reopen them and start replacing the documents of each with identical documents that have an identical large set of values. I am using replace_document and a specific document ID. Is there a specific

manual flushing thresholds for deletes?

2023 Mar 27

manual flushing thresholds for deletes?

On Mon, Mar 27, 2023 at 11:22:09AM +0000, Eric Wong wrote: > Olly Betts <olly at survex.com> wrote: > > 10 seems too long. You want the mean word length weighted by frequency > > of occurrence. For English that's typically around 5 characters, which > > is 5 bytes. If we go for +1 that's: > > Actually, 10 may be too short in my case since there's a

manual flushing thresholds for deletes?

2023 May 03

manual flushing thresholds for deletes?

Olly Betts <olly at survex.com> wrote: > On Mon, Mar 27, 2023 at 11:22:09AM +0000, Eric Wong wrote: > > Olly Betts <olly at survex.com> wrote: > > > 10 seems too long. You want the mean word length weighted by frequency > > > of occurrence. For English that's typically around 5 characters, which > > > is 5 bytes. If we go for +1 that's:

GSOC-2016 Project : Clustering of search results

2016 Mar 07

GSOC-2016 Project : Clustering of search results

On Mon, Mar 07, 2016 at 01:36:43AM +0530, Richhiey Thomas wrote: > My questions are: > 1) Can you direct me on how to convert this raw idea into a proposal in > context to Xapian with more detail? What areas do I focus on? Our GSoC guide has an application template <https://trac.xapian.org/wiki/GSoCApplicationTemplate> which you should use to structure your proposal. It has some

Failure trying to update document.

2010 Jan 30

Failure trying to update document.

Hi list. I have a specific document that does not handle updates sitting in the index. What can I do about that? 2010-01-30T13:58:07 Eval failure: Exception: No termlist for document 287376 at /usr/lib/perl5/Search/Xapian/Enquire.pm line 56. 2010-01-30T13:58:07 job failed. considering retry. is max_retries of 1000 >= failures of 1? 2010-01-30T13:58:07 job failed: Exception: No

Flint failed to deliver indexing performance to Quartz.

2007 Jun 17

Flint failed to deliver indexing performance to Quartz.

Flint failed to deliver indexing performance to Quartz. I am proposing to remove Flint as default database and place Quartz database back as default. The catch is not that Flint database is smaller and faster during searches then Quartz database as developers were concerning when were measuring and neglecting to measure performance when creating the large indexes. The truth is that Flint

similar to: 128 bit Document IDs (Please don't hurt me)