Neville Burnell
2006-Sep-14 07:19 UTC
[Ferret-talk] Possiible Bug ? indexWriter#doc_count countsdeleted docs after #commit
Hi David,> Deleted documents don''t get deleted until commit is calledOk, but FYI, my experiments show that #commit doesn''t affect #doc_count, even across ruby sessions. On a different note, I''d like to request a variation of #add_document which returns the doc_id of the document added, as opposed to self. I''m trying to track down an issue with a large test index [600MB, 500k docs] in which I need to update a document. The old document is deleted then added again, but doesn''t show up in my searches. A #doc_count on the writer before and after #add_document shows that the index is 1 document larger, but I still cant #search for the updated doc. What do you think about having #add_document "yield" the doc_id if block_given? Neville
David Balmain
2006-Sep-14 07:34 UTC
[Ferret-talk] Possiible Bug ? indexWriter#doc_count countsdeleted docs after #commit
On 9/14/06, Neville Burnell <Neville.Burnell at bmsoft.com.au> wrote:> Hi David, > > > Deleted documents don''t get deleted until commit is called > > Ok, but FYI, my experiments show that #commit doesn''t affect #doc_count, > even across ruby sessions.Sorry, I guess I wan''t very clear on that point. The deletes don''t get commited until commit is called which is why I don''t have a num_docs method in IndexWriter to because there is no way to reliably tell until commit is called. IndexWriter#doc_count is like IndexReader#max_doc. It tells you how many documents there are in the index, deleted or not.> On a different note, I''d like to request a variation of #add_document > which returns the doc_id of the document added, as opposed to self. > > I''m trying to track down an issue with a large test index [600MB, 500k > docs] in which I need to update a document. The old document is deleted > then added again, but doesn''t show up in my searches. > > A #doc_count on the writer before and after #add_document shows that the > index is 1 document larger, but I still cant #search for the updated > doc. > > What do you think about having #add_document "yield" the doc_id if > block_given? > > NevilleHow about just using the doc_count method. Call it after you add the document and subtract one and you''ll have the document ID of the last document added. Don''t call it before you add the document as a merge might happen when you add the document, possibly changing all document IDs when deletes are completely removed. Cheers, Dave
David Balmain
2006-Sep-14 07:37 UTC
[Ferret-talk] Possiible Bug ? indexWriter#doc_count countsdeleted docs after #commit
On 9/14/06, David Balmain <dbalmain.ml at gmail.com> wrote:> On 9/14/06, Neville Burnell <Neville.Burnell at bmsoft.com.au> wrote: > > Hi David, > > > > > Deleted documents don''t get deleted until commit is called > > > > Ok, but FYI, my experiments show that #commit doesn''t affect #doc_count, > > even across ruby sessions. > > Sorry, I guess I wan''t very clear on that point. The deletes don''t get > commited until commit is called which is why I don''t have a num_docs > method in IndexWriter to because there is no way to reliably tell > until commit is called. IndexWriter#doc_count is like > IndexReader#max_doc. It tells you how many documents there are in the > index, deleted or not. > > > On a different note, I''d like to request a variation of #add_document > > which returns the doc_id of the document added, as opposed to self. > > > > I''m trying to track down an issue with a large test index [600MB, 500k > > docs] in which I need to update a document. The old document is deleted > > then added again, but doesn''t show up in my searches. > > > > A #doc_count on the writer before and after #add_document shows that the > > index is 1 document larger, but I still cant #search for the updated > > doc. > > > > What do you think about having #add_document "yield" the doc_id if > > block_given? > > > > Neville > > How about just using the doc_count method. Call it after you add the > document and subtract one and you''ll have the document ID of the last > document added. Don''t call it before you add the document as a merge > might happen when you add the document, possibly changing all document > IDs when deletes are completely removed. > > Cheers, > Dave >I should also mention the reason I wouldn''t want to return the document ID from any IndexWriter method is that the document ID could become invalid when the next document is added (if a segment merge is triggered and deletes exist). At least when using an IndexReader, the document ID is valid for the life of the reader.
Seemingly Similar Threads
- Possiible Bug ? indexWriter#doc_count counts deleted docs after #commit
- Ferret 0.11.4.win32 indexing speed vs Ferret 0.10.9.win32
- Error with :create => true and existing index
- Help with Multiple Readers, 1 Writer scenario
- Possiible Bug ? indexWriter#doc_countcountsdeleted docs after #commit