thr3ads.net - Xapian discuss - [Xapian-discuss] Xapian and indexing text with layout [Apr 2006]

If this information is useful, please help other people find it:
Share via:

Lionel

2006-Apr-30 12:43 UTC

[Xapian-discuss] Xapian and indexing text with layout

Hi:
     First sorry if the question had been discussed before, I did check in
the list archives but I didn't found any answer, second I'm relative new
to
Xapian, I managed to install it and use the python bindings and play with
the examples, it was really pretty straight forward.

I'm trying to index a large amount of PDF documents, all coming from
publications like newspapers and magazines, where the PDF file it has
multiple columns and complex layout, anyway is not a problem, my PDF's are
properly structured by the OCR.
By Using pdftotext - layout I get the text with the original layout from the
pdf file and I pass it trough stdin to the indexer,  nothing complicated
there. Now I'm using the example indexer for python (simpleindex.py) and
I'm
getting confused with the paragraphs and documentsID.

If I look for a term, it returns it ok, but each paragraph is returned as a
separate documentID, that's is not good, due that for me and my application
logic each PDF it's a single document itself.

If I remove the layout option, pdftotext generate a text file pretty well
organized, converting any column to paragraph, being still handy due that
preserve the original distance and the "NEAR" search wont be affected.
But
again after indexing, the same problem: Paragraphs are indexed as separated
documents.

I know that the code does that here exactly:

                    # At each point, find the next alnum character (i), then
                    # find the first non-alnum character after that (j).
Find
                    # the first non-plusminus character after that (k), and
if
                    # k is non-alnum (or is off the end of the para), set
j=k.
                    # The term generation string is [i,j), so len = j-i
                    i = 0
                    while i < len(para):
                        i = find_p(para, i, p_alnum)
                        j = find_p(para, i, p_notalnum)
                        k = find_p(para, j, p_notplusminus)
                        if k == len(para) or not p_alnum(para[k]):
                            j = k
                        if (j - i) <= MAX_PROB_TERM_LENGTH and j > i:
                            term stemmer.stem_word(string.lower(para[i:j]))
                            doc.add_posting(term, pos)
                            pos += 1
                        i = j
                    database.add_document(doc)


If I remove part of the code to ignore paragraphs, it wont be any problem
with the indexing? I'm really getting confused.

I definitely need to have each single PDF document being represented as a
single document in Xapian, otherwise it will be return duplicated hits to
the same file.

Any suggestions or ideas?


Thank you.

Olly Betts

2006-Apr-30 12:57 UTC

head link

[Xapian-discuss] Xapian and indexing text with layout

On Sun, Apr 30, 2006 at 12:43:08PM +0100, Lionel wrote:> If I look for a term, it returns it ok, but each paragraph is returned as a
> separate documentID, that's is not good, due that for me and my
application
> logic each PDF it's a single document itself.
That's only how simpleindex works - there's no requirement in Xapian
to produce a Document for each paragraph, you choose what you want to
make a "Document".  The intention is that simpleindex is a simple
dummy example meant to show how you might use the Xapian API so we want
to keep down the amount of code which identifies documents, etc.

Perhaps it would be clearer as an indexer which takes a list of
filenames on the command line - that would probably have a similar
amount of non-Xapian-related code and should more closely match what
many users are trying to do.
> If I remove part of the code to ignore paragraphs, it wont be any problem
> with the indexing?
That's fine, or you might find it easier to write your code from a clean
start, just using simpleindex to see how to call Xapian (that's really
how it's intended to be useful).

So you want to start a fresh Xapian.Document for each PDF file, and only
call Xapian.Database.add_document when you've finished handling that PDF
file.

Cheers,
    Olly

Xapian discuss - Apr 2006 - Xapian and indexing text with layout

[Xapian-discuss] Xapian and indexing text with layout

[Xapian-discuss] Xapian and indexing text with layout