Folks, I'm writing simply to say thanks for building such a great piece of software! I've been working with Xapian since the summer to construct an interface for searching the complete US patent collection back to 1836 (the US Patent Office only offers searches back to 1976). It's been a joy to use Xapian and I'm quite pleased with the search results we've received. Our initial test system is now online and allows searching of documents between 1836 and 1925 (about 1.5 million documents): http://search.allpatents.org/ We're working with folks at HP Labs to perform the OCR extraction on the original page images which were scanned by the patent office. As the OCR effort continues we'll expand our collection to include all US patents. Feel free to explore and make comments - if you have any thoughts on how we might improve the interface or the search indexing I'd be glad to chat! I may also write up a summary on how we've managed the text over image markup implementation with Xapian (it's nothing particularly fancy but may be of use to others none the less!)... Thanks again for all your work! Kevin Webb
Very cool stuff! I'm glad there are folks out there such as yourself working on interesting and helpful projects like this! A quick note after looking over some searches: * The highlighting is cool. You should use Xapian to tell you what the stemmed version of your query is, index your words->{x,y} transformation based on the stemmed words rather than simply the explicit word->{x,y} transformation it seems you're doing now. For example, searching for "computer" yields, as expected, results containing only the word "computing," yet the word "computing is not highlighted. * It might be worthwhile to have traditional text snippets next to the images for helping with quick "is this relevant?" look-throughs. --Philip Neustrom On 10/27/06, Kevin Webb <kevin@tackledesign.com> wrote:> Folks, > > I'm writing simply to say thanks for building such a great piece of > software! I've been working with Xapian since the summer to construct > an interface for searching the complete US patent collection back to > 1836 (the US Patent Office only offers searches back to 1976). It's > been a joy to use Xapian and I'm quite pleased with the search results > we've received. Our initial test system is now online and allows > searching of documents between 1836 and 1925 (about 1.5 million > documents): > > http://search.allpatents.org/ > > We're working with folks at HP Labs to perform the OCR extraction on > the original page images which were scanned by the patent office. As > the OCR effort continues we'll expand our collection to include all US > patents. > > Feel free to explore and make comments - if you have any thoughts on > how we might improve the interface or the search indexing I'd be glad > to chat! I may also write up a summary on how we've managed the text > over image markup implementation with Xapian (it's nothing > particularly fancy but may be of use to others none the less!)... > > Thanks again for all your work! > Kevin Webb > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss@lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss >
On Fri, Oct 27, 2006 at 06:17:59PM -0400, Kevin Webb wrote:> Our initial test system is now online and allows searching of > documents between 1836 and 1925 (about 1.5 million documents):Is this intended to be a public test system? I ask because it doesn't seem to be linked from allpatents.org currently. I can add it to our "current users" page if you like. A "powered by Xapian" reciprocal link is always appreciated (and helping more people discover Xapian helps us improve it faster so it benefits you as well!)> Feel free to explore and make comments - if you have any thoughts on > how we might improve the interface or the search indexing I'd be glad > to chat! I may also write up a summary on how we've managed the text > over image markup implementation with Xapian (it's nothing > particularly fancy but may be of use to others none the less!)...I think that would be interesting. I suspect most people can guess how it might be done, but there's nothing quite like the wisdom from actual implementation experience. Does this highlighting rely on cookies to work? It doesn't work for me using Firefox 2. I declined your cookie and suspect that's why since the search terms don't seem to be passed to the results page by any other mechanism I can see. It might be nicer not to use a cookie here, since as it is I can't email someone a URL for a particular patent including the highlighting. Cheers, Olly