oscaruser@programmer.net
2006-Jul-30 07:47 UTC
[Xapian-discuss] How to search URL field (unique Q key)?
Folks, I ran script index on about 1800 documents, and somewhere during the scriptindex process of adding files to the flint db, I terminated the program (^C). I don't know which docs were added, from those that are remaining, and wanted to know if there is a way to search the db based on the URL field which was defined as "url : field=url hash boolean=Q unique=Q" in the script index? Example as follows. Thanks, OSC oscar@server:/svr/hda1/tmp/bsp0010$ quest -d ./db -m 1 'test' 277 [100%] caption=TestProducts.com | National Safety Products, Inc. | 877-412-3600 | <- sample=[ 0 Items $0.00 0 lbs ] | HOME | WHAT'S NEW? | SPECIALS | REVIEWS | CATALOG | FEATURES | ABOUT US | CONTACT INFO | Top ? Catalog ? Gas Leak Detectors ? NGD268 Log In | My Account | Cart Contents | Checkout url=http://www.testproducts.com/safecart/product_info.php/products_id/411?osCsid=ce97a217e68568f590d675dc8feaf2f0 oscar@server:/svr/hda1/tmp/bsp0010$ quest -d ./db -m 1 'www.limitedtoo.com/detail/4592907/9/3/"/detail/4592907/6/clearance/","0' 553 [9%] caption=Bhive Awards - Item Detail sample=About Us|Testimonials|Policy|Contact Us|Track Order|Retrieve Saved Order Online Store >> All Occasion Awards >> Loving Cups Volume Discount Size: A: 13" Qty Price 1 - 3 $25.85 4 - 11 $23.50 12 - 24 $21.95 25 - UP $20.70 Size: B: 13-3/4" Qty Price 1 url=http://www.bhiveawards.com/store/item.asp?ITEM_ID=4215&DEPARTMENT_ID=138 oscar@server:/svr/hda1/tmp/bsp0010$ -- ___________________________________________________ Play 100s of games for FREE! http://games.mail.com/
James Aylett
2006-Jul-30 09:58 UTC
[Xapian-discuss] How to search URL field (unique Q key)?
On Sat, Jul 29, 2006 at 10:46:46PM -0800, oscaruser@programmer.net wrote:> I ran script index on about 1800 documents, and somewhere during the > scriptindex process of adding files to the flint db, I terminated > the program (^C). I don't know which docs were added, from those > that are remaining, and wanted to know if there is a way to search > the db based on the URL field which was defined as "url : field=url > hash boolean=Q unique=Q" in the script index?No idea if this will actually work as it's a boundary case, but how about omega with: $setmap{prefix,url,Q} $hitlist{url = $field{url} } and search it for url:* The bit I'm uncertain about is whether wildcards will work like that or not. If not, write a python, php or whatever script using the bindings to list all terms in the database, skip_to "Q" and keep printing until they don't start with "Q" any more. James -- /--------------------------------------------------------------------------\ James Aylett xapian.org james@tartarus.org uncertaintydivision.org
oscaruser@programmer.net
2006-Jul-31 18:51 UTC
[Xapian-discuss] How to search URL field (unique Q key)?
Hi James, If I'm not mistaken, the approach of printing all of the Q terms basically dumps the database contents, but does no searching. If the URL field was set to unique, I would imagine that there is a method of searching against the key. Based on the functionality of scriptindex, it determines whether to the record already exists or not before updating or adding to the index. This latter functionality is the one I want to utilize because based on the nature of the datastructure and keys, searching should be very fast. I'll check the scriptindex sources to see if I can understand what's going on there. Thanks, -OSC> ----- Original Message ----- > From: "James Aylett" <james-xapian@tartarus.org> > To: xapian-discuss@lists.xapian.org > Subject: Re: [Xapian-discuss] How to search URL field (unique Q key)? > Date: Sun, 30 Jul 2006 09:57:58 +0100 > > > On Sat, Jul 29, 2006 at 10:46:46PM -0800, oscaruser@programmer.net wrote: > > > I ran script index on about 1800 documents, and somewhere during the > > scriptindex process of adding files to the flint db, I terminated > > the program (^C). I don't know which docs were added, from those > > that are remaining, and wanted to know if there is a way to search > > the db based on the URL field which was defined as "url : field=url > > hash boolean=Q unique=Q" in the script index? > > No idea if this will actually work as it's a boundary case, but how > about omega with: > > $setmap{prefix,url,Q} > $hitlist{url = $field{url} > } > > and search it for url:* > > The bit I'm uncertain about is whether wildcards will work like that > or not. If not, write a python, php or whatever script using the > bindings to list all terms in the database, skip_to "Q" and keep > printing until they don't start with "Q" any more. > > James > > -- > /--------------------------------------------------------------------------\ > James Aylett xapian.org > james@tartarus.org uncertaintydivision.org-- ___________________________________________________ Play 100s of games for FREE! http://games.mail.com/
On Sat, Jul 29, 2006 at 10:46:46PM -0800, oscaruser@programmer.net wrote:> I ran script index on about 1800 documents, and somewhere during the > scriptindex process of adding files to the flint db, I terminated the > program (^C). I don't know which docs were added, from those that are > remaining, and wanted to know if there is a way to search the db based > on the URL field which was defined as "url : field=url hash boolean=Q > unique=Q" in the script index? Example as follows.For 1800 documents, I'd just reindex rather than mess around. But you can easily check if a particular term is in a database using delve, e.g.: delve -t Qhttp://www.google.com/ ./db If the URL is long (~240 characters or more) the "hash" will mean you can't just use the URL as is to produce this term. Or you could just check the URL for the last document successfully added: delve ./db Which will tell you how many documents there are - assuming there were no deletions, this will be the docid of the last document added and you can list its termlist like so: delve -r 1800 ./db Cheers, Olly