Hi all, I am using QueryParser::add_boolean_prefix("url", "U") to restrict searches to documents that have a specific URL. When the input has a URL containing a space, how should it be quoted ? For instance, the input url:"file:///some/where/a file" will generate the terms "U"file:///some/where/a" and "file", and url:'file:///some/where/a file' will generate "U'file:///some/where/a" and "file". I also found that url:'file:///some/where/a file.txt' will generate the terms "U'file:///some/where/a", "file" and "txt", and not just "U'file:///some/where/a", "file.txt". Of course url:file:///some/where/afile.txt only generates Ufile:///some/where/afile.txt. This leads me to a second question. At indexing time, long URLs are hashed just like what omindex does with hash_long_term(). Because of this, the QueryParser will always generate the wrong term when its input has a filter on one of these long URLs. Would it be possible to have something like the following ? void Xapian::QueryParser::add_boolean_prefix( const std::string &field, const std::string &prefix, const TermTransformer *transform); where transform is a functor that is passed the term the QueryParser extracted from the input for the given field and modifies that term before the QueryParser builds the query. Fabrice
On 26/01/07, Fabrice Colin <fabrice.colin@gmail.com> wrote:> I am using QueryParser::add_boolean_prefix("url", "U") to restrict searches to > documents that have a specific URL. > When the input has a URL containing a space, how should it be quoted ?There isn't currently a way to quote such a prefixed boolean term, but shouldn't spaces be quoted as %20 in a url anyway?> This leads me to a second question. At indexing time, long URLs are hashed just > like what omindex does with hash_long_term(). Because of this, the QueryParser > will always generate the wrong term when its input has a filter on one of these > long URLs. Would it be possible to have something like the following ? > > void Xapian::QueryParser::add_boolean_prefix( > const std::string &field, > const std::string &prefix, > const TermTransformer *transform);Perhaps, though for this case it seems unlikely that a user would really type in a 240+ character URL... Cheers, Olly
On 1/31/07, James Aylett <james-xapian@tartarus.org> wrote:> On Tue, Jan 30, 2007 at 12:36:39PM +0800, Fabrice Colin wrote: > > >There isn't currently a way to quote such a prefixed boolean term, but > > >shouldn't spaces be quoted as %20 in a url anyway? > > > > Yes, for a URL, quoting makes sense, but for a file name filter, not > > so much. For instance, entering something like 'file:"My CV.txt"' > > is not completely unreasonable. > > > > Actually, this would be useful for searching indexes built by > > omindex. As far as I can tell it doesn't escape U-prefixed terms, > > so if a user wanted to find the document that has the term > > 'Uhttp://localhost/some file.txt', he would have to enter > > 'url:http://localhost/some%20file.txt', and the app would have to > > unescape the U-prefixed term in the Query object generated by the > > QueryParser. > > 'http://localhost/some file.txt' is not a valid URI; you MUST replace > the SPC with either '+' or '%20'. omindex may not be getting all of > this right, but it's the application's job rather than the user's. >To be honest, termprefixes.txt doesn't specify that the U prefix is for a "valid URI". It just says the "full URL". I know I am splitting hairs :-)> ('file:"My CV.txt"' is similarly not a valid URI. Again, the > application should be fixing things up somehow.) >Ah. This one is not a URI, it's a (valid) file name. The problem is that "fixing things up" here means pre-processing the string before it's fed to the QueryParser, which partially nullifies the QueryParser's usefulness. Never mind, I will just have to add this to my TODO list :-) Cheers. Fabrice
On 2/1/07, James Aylett <james-xapian@tartarus.org> wrote:> > >('file:"My CV.txt"' is similarly not a valid URI. Again, the > > >application should be fixing things up somehow.) > > > > Ah. This one is not a URI, it's a (valid) file name. > > Not under Windows is it?, unless you have a driver called 'file' (this > may not be true of XP+). >Who cares about Windows ? ;-) I was referring to the string after the prefix, ie "My CV.txt".> > The problem is that "fixing things up" here means pre-processing the > > string before it's fed to the QueryParser, which partially nullifies > > the QueryParser's usefulness. Never mind, I will just have to add > > this to my TODO list :-) > > Yeah... it'd be nice if QP could call back for fixups in some way; > there have been thoughts, but getting it so this would be possible is > quite a job. Thing is, getting something like a URI into the system > "pure" is a pretty specialised requirement... >Maybe it is yeah. In any case, that's something I need to cater for. Fabrice