search for: tokenisation

Displaying 20 results from an estimated 30 matches for "tokenisation".

2013 Apr 16
7
puppet-cleaner: makes puppet DSL code comply with a subset of the style guide
FWIW, I''ve wrote puppet-cleaner to help me make comply thousands of lines of puppet 2.6 DSL code to puppet 2.7 style guide and expectations. I''m uploading it to github today for anyone to use. https://github.com/santana/puppet-cleaner Externally, you run puppet-clean file.pp and it can transform this: /* multiline comment trailing white space here -> */ class
2011 Aug 02
2
Positive experiences with Xapian
Hi Guys, I just wanted to take a moment to give some positive feedback regarding my experiences with Xapian recently. I've been doing a fair amount of research into search engines recently, as we have some fairly specific requirements with what we're attempting to do with them. Long story short, after a few weeks of playing around with just about everything under the sun (or at least,
2005 Dec 30
1
Query Parser, filenames and compound words
When I submit a filename to the query parser it breaks it up Example: /home/user/file_name.ext becomes Xapian::Query((home:(pos=1) PHRASE 5 user:(pos=2) PHRASE 5 file:(pos=3) PHRASE 5 name:(pos=4) PHRASE 5 ext:(pos=5))) which does not find the document. If I do an single term query not using the query parser then I find the document. The Query Parser also breaks up hyphenated terms
2007 Nov 16
1
problem with searching plurals (with apostrophe)
hello guys, i am using acts_as_ferret plugin(0.4.1 Latest) with ferret gem(0.11.4 Latest) on rails 1.2.5 and ruby 1.8.6(UBUNTU Gutsy) i have this :Stores Model acts_as_ferret :fields => {:name => { :boost => 2 ,:store => :yes}, :short_desc => { :boost => 1.5,:store => :yes }, :tag_list => {:boost => 1
2024 Jan 07
1
Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints
...s findings includes this block of full- and half-width latin forms, coupled with an assumption that there's no lowercase vs uppercase forms in these alphabets. Assuming the latter is valid, just removing this block (or removing the parts of it which are Lu or Ll) should fix the problem as then tokenisation will switch mode - I tried this and it fixes your case at least: diff --git a/xapian-core/queryparser/word-breaker.cc b/xapian-core/queryparser/word-breaker.cc index 8108523ccd53..4fabc23f4b56 100644 --- a/xapian-core/queryparser/word-breaker.cc +++ b/xapian-core/queryparser/word-breaker.cc @@ -10...
2014 May 14
2
Starting work on Perf Test Module
Hello, I am beginning work on the perf test module. The initial steps that I aim to accomplish are :- -> Download the wikipedia dumps for multiple languages . -> Write python scripts to tokenize the dump (will probably use something like nltk which has powerful inbuilt tokenizers) -> Discuss and finalize the design of the search and query expansion perf tests as I want to complete them
2013 Feb 18
8
Error with service: "invalid byte sequence in US-ASCII"
I just built a new puppet master, and whenever I run puppet on it, it throws an error while processing a service resource: # puppet agent -t > Info: Retrieving plugin > Info: Caching catalog for i-45dc2b1d > Info: Applying configuration version ''g > 9ea47ad19bc706a754c00f00a024309948d3ea03'' > Error: /Stage[main]/Ipa::Client::Basic/Service[sssd]: Could not
2022 Sep 22
7
[Bug 3474] New: ssh_config can escape double quotes with a backslash
https://bugzilla.mindrot.org/show_bug.cgi?id=3474 Bug ID: 3474 Summary: ssh_config can escape double quotes with a backslash Product: Portable OpenSSH Version: v9.0p1 Hardware: Other OS: Linux Status: NEW Severity: enhancement Priority: P5 Component: ssh Assignee:
2010 Nov 15
4
Stopword addition and stemming
Hi, Two questions which I'm unsure about: Stemming: I've turned on stemming, etc, but how can I confirm that it's being used in searches? What should I look/search for? Stopwords: I'm trying out xapian on a regional dataset (searching data from a *.co.us TLD, eg) . I've noticed that searching for [bob co.us] results in *very* slow search times (tens of seconds), since it
2002 Jan 27
0
IdentityFile patch
By the way, I noticed in the previous IdentityFile patch I forgot to expand tilde. I fixed this by making the change in ssh.c instead of readconf.c, which is probably where it belongs, as far as the existing code is concerned: diff -ur openssh-3.0.2p1/auth.c openssh-3.0.2p1I/auth.c --- openssh-3.0.2p1/auth.c Sun Nov 11 17:06:07 2001 +++ openssh-3.0.2p1I/auth.c Sun Jan 27 12:05:14 2002 @@ -44,7
2018 Dec 17
2
LLVM Backend for a platform with no (normal) stack
Not only do FPGAs not support recursion, we don’t even support calls! All user code must be inlined into one kernel/component, which is then used to create HDL for the FPGA. Mark From: Bruce Hoult <brucehoult at sifive.com> Sent: December 17, 2018 9:28 AM To: Mendell, Mark P <mark.p.mendell at intel.com> Cc: jjones at prc-hsv.com; LLVM Developers Mailing List <llvm-dev at
2002 Jan 27
1
[PATCH] Add user-dependent IdentityFile to OpenSSH-3.0.2p1
Here is a patch to allow private key files to be placed system wide (for all users) in a secure (non-NFS) mounted location on systems where home directories are NFS mounted. This is especially important for users who use blank passphrases rather than ssh-agent (a good example of where this is necessary is for tunnelling lpd through ssh on systems that run lpd as user lp). IdentityFile now accepts
2016 Sep 19
2
Pull requests: CJK words and Snippet generator
...ity of the algorithms. > I'm not quite clear what your "n" above is - n is the number of terms in a document. I haven't done systematic testing of wall-clock time for the new feature. If it is crucial to go ahead with the patch, I could create a couple of benchmarks. > The tokenisation of the snippet uses the same code as indexing does, so > CJK should just work automatically, though it looks like there aren't > currently any testcases for this, so it would be worth checking (and > worth adding some) > > Normalisation could perhaps be done with a custom stemmi...
2019 Mar 07
3
Ask for advice on exact requirements to fix #699 mixed CJK numbers
I am working on "#699 Better tokenisation of mixed CJK numbers", and have implemented a partial patch of Chinese for this ticket. Current code works well with special test cases and all tests in xapian-core could still pass. But I'm confused with exact requirements of the question, for how much we could pay with performance on en...
2003 Jan 18
0
[Patch] User-dependent IdentityFile
Here is the user-dependent IdentityFile patch for openssh3.5 (BSD version), which allows private key files to be placed system wide (for all users) in a secure (non-NFS) mounted location. This addresses an important security hole on systems where home directories are NFS mounted, particularly if there are users who use blank passphrases (or when lpd is tunneled through ssh on systems running lpd
2016 Sep 07
2
Pull requests: CJK words and Snippet generator
On Tue, Sep 6, 2016, at 09:16, Olly Betts wrote: > I think my main concerns are about efficiency (since that a major > motivation for the current implementation, so slowing it down would be > annoying), and whether we can just make this the standard behaviour > rather than adding an option. The current implementation is O(n) and I took care to keep it at that. For the proposed term
2019 Jan 25
0
[klibc:update-dash] parser: Fix backquote support in here-document EOF mark
...: Fri, 25 Jan 2019 02:57:21 +0000 [klibc] parser: Fix backquote support in here-document EOF mark Currently using backquotes in a here-document EOF mark is broken because dash tries to do command substitution on it. This patch fixes it by checking whether we're looking for an EOF mark during tokenisation. Reported-by: Harald van Dijk <harald at gigawatt.nl> Signed-off-by: Herbert Xu <herbert at gondor.apana.org.au> Signed-off-by: Ben Hutchings <ben at decadent.org.uk> --- usr/dash/parser.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/usr/dash/parser.c b/usr/dash/par...
2020 Mar 28
0
[klibc:update-dash] dash: parser: Fix backquote support in here-document EOF mark
...kquote support in here-document EOF mark [ dash commit c166b718b496da63c4df7a0972df2fc6cd38256b ] Currently using backquotes in a here-document EOF mark is broken because dash tries to do command substitution on it. This patch fixes it by checking whether we're looking for an EOF mark during tokenisation. Reported-by: Harald van Dijk <harald at gigawatt.nl> Signed-off-by: Herbert Xu <herbert at gondor.apana.org.au> Signed-off-by: Ben Hutchings <ben at decadent.org.uk> --- usr/dash/parser.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/usr/dash/parser.c b/usr/dash/par...
2005 Jun 09
1
Query parser and stemming of norwegian letters
Hello, can I get an explanation of the following. Running the following code: .... pqp=new QueryParser(); Stem stem("norwegian"); cout << "DEBUG " << stem.stem_word(_sXapian)<< endl; pqp->set_stemmer(stem); pqp->set_database(*_pdatabase); pqp->set_default_op(Query::OP_AND); //Set the
2024 Jan 08
1
Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints
...tts wrote: > I've restarted trac. I now created a pull request: https://github.com/xapian/xapian/pull/329 Should I create a trac issue, too? > Assuming the latter is valid, just removing this block (or removing the > parts of it which are Lu or Ll) should fix the problem as then > tokenisation will switch mode - I tried this and it fixes your case at > least: Removing the whole block will cause word-breaker to not correctly handle halfwidth Katakana, such as "??????????" which it would treat as a single term, whereas it should be two: ??????and ????). My pull request caus...