thr3ads.net - search: "tokenisation"

Displaying 20 results from an estimated 30 matches for "tokenisation".

puppet-cleaner: makes puppet DSL code comply with a subset of the style guide

2013 Apr 16

puppet-cleaner: makes puppet DSL code comply with a subset of the style guide

FWIW, I''ve wrote puppet-cleaner to help me make comply thousands of lines of puppet 2.6 DSL code to puppet 2.7 style guide and expectations. I''m uploading it to github today for anyone to use. https://github.com/santana/puppet-cleaner Externally, you run puppet-clean file.pp and it can transform this: /* multiline comment trailing white space here -> */ class

Positive experiences with Xapian

2011 Aug 02

Positive experiences with Xapian

Hi Guys, I just wanted to take a moment to give some positive feedback regarding my experiences with Xapian recently. I've been doing a fair amount of research into search engines recently, as we have some fairly specific requirements with what we're attempting to do with them. Long story short, after a few weeks of playing around with just about everything under the sun (or at least,

Query Parser, filenames and compound words

2005 Dec 30

Query Parser, filenames and compound words

When I submit a filename to the query parser it breaks it up Example: /home/user/file_name.ext becomes Xapian::Query((home:(pos=1) PHRASE 5 user:(pos=2) PHRASE 5 file:(pos=3) PHRASE 5 name:(pos=4) PHRASE 5 ext:(pos=5))) which does not find the document. If I do an single term query not using the query parser then I find the document. The Query Parser also breaks up hyphenated terms

problem with searching plurals (with apostrophe)

2007 Nov 16

problem with searching plurals (with apostrophe)

hello guys, i am using acts_as_ferret plugin(0.4.1 Latest) with ferret gem(0.11.4 Latest) on rails 1.2.5 and ruby 1.8.6(UBUNTU Gutsy) i have this :Stores Model acts_as_ferret :fields => {:name => { :boost => 2 ,:store => :yes}, :short_desc => { :boost => 1.5,:store => :yes }, :tag_list => {:boost => 1

Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

2024 Jan 07

Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

...s findings includes this block of full- and half-width latin forms, coupled with an assumption that there's no lowercase vs uppercase forms in these alphabets. Assuming the latter is valid, just removing this block (or removing the parts of it which are Lu or Ll) should fix the problem as then tokenisation will switch mode - I tried this and it fixes your case at least: diff --git a/xapian-core/queryparser/word-breaker.cc b/xapian-core/queryparser/word-breaker.cc index 8108523ccd53..4fabc23f4b56 100644 --- a/xapian-core/queryparser/word-breaker.cc +++ b/xapian-core/queryparser/word-breaker.cc @@ -10...

Starting work on Perf Test Module

2014 May 14

Starting work on Perf Test Module

Hello, I am beginning work on the perf test module. The initial steps that I aim to accomplish are :- -> Download the wikipedia dumps for multiple languages . -> Write python scripts to tokenize the dump (will probably use something like nltk which has powerful inbuilt tokenizers) -> Discuss and finalize the design of the search and query expansion perf tests as I want to complete them

Error with service: "invalid byte sequence in US-ASCII"

2013 Feb 18

Error with service: "invalid byte sequence in US-ASCII"

I just built a new puppet master, and whenever I run puppet on it, it throws an error while processing a service resource: # puppet agent -t > Info: Retrieving plugin > Info: Caching catalog for i-45dc2b1d > Info: Applying configuration version ''g > 9ea47ad19bc706a754c00f00a024309948d3ea03'' > Error: /Stage[main]/Ipa::Client::Basic/Service[sssd]: Could not

[Bug 3474] New: ssh_config can escape double quotes with a backslash

2022 Sep 22

[Bug 3474] New: ssh_config can escape double quotes with a backslash

https://bugzilla.mindrot.org/show_bug.cgi?id=3474 Bug ID: 3474 Summary: ssh_config can escape double quotes with a backslash Product: Portable OpenSSH Version: v9.0p1 Hardware: Other OS: Linux Status: NEW Severity: enhancement Priority: P5 Component: ssh Assignee:

Stopword addition and stemming

2010 Nov 15

Stopword addition and stemming

Hi, Two questions which I'm unsure about: Stemming: I've turned on stemming, etc, but how can I confirm that it's being used in searches? What should I look/search for? Stopwords: I'm trying out xapian on a regional dataset (searching data from a *.co.us TLD, eg) . I've noticed that searching for [bob co.us] results in *very* slow search times (tens of seconds), since it

IdentityFile patch

2002 Jan 27

IdentityFile patch

By the way, I noticed in the previous IdentityFile patch I forgot to expand tilde. I fixed this by making the change in ssh.c instead of readconf.c, which is probably where it belongs, as far as the existing code is concerned: diff -ur openssh-3.0.2p1/auth.c openssh-3.0.2p1I/auth.c --- openssh-3.0.2p1/auth.c Sun Nov 11 17:06:07 2001 +++ openssh-3.0.2p1I/auth.c Sun Jan 27 12:05:14 2002 @@ -44,7

LLVM Backend for a platform with no (normal) stack

2018 Dec 17

LLVM Backend for a platform with no (normal) stack

Not only do FPGAs not support recursion, we don’t even support calls! All user code must be inlined into one kernel/component, which is then used to create HDL for the FPGA. Mark From: Bruce Hoult <brucehoult at sifive.com> Sent: December 17, 2018 9:28 AM To: Mendell, Mark P <mark.p.mendell at intel.com> Cc: jjones at prc-hsv.com; LLVM Developers Mailing List <llvm-dev at

[PATCH] Add user-dependent IdentityFile to OpenSSH-3.0.2p1

2002 Jan 27

[PATCH] Add user-dependent IdentityFile to OpenSSH-3.0.2p1

Here is a patch to allow private key files to be placed system wide (for all users) in a secure (non-NFS) mounted location on systems where home directories are NFS mounted. This is especially important for users who use blank passphrases rather than ssh-agent (a good example of where this is necessary is for tunnelling lpd through ssh on systems that run lpd as user lp). IdentityFile now accepts

Pull requests: CJK words and Snippet generator

2016 Sep 19

Pull requests: CJK words and Snippet generator

...ity of the algorithms. > I'm not quite clear what your "n" above is - n is the number of terms in a document. I haven't done systematic testing of wall-clock time for the new feature. If it is crucial to go ahead with the patch, I could create a couple of benchmarks. > The tokenisation of the snippet uses the same code as indexing does, so > CJK should just work automatically, though it looks like there aren't > currently any testcases for this, so it would be worth checking (and > worth adding some) > > Normalisation could perhaps be done with a custom stemmi...

Ask for advice on exact requirements to fix #699 mixed CJK numbers

2019 Mar 07

Ask for advice on exact requirements to fix #699 mixed CJK numbers

I am working on "#699 Better tokenisation of mixed CJK numbers", and have implemented a partial patch of Chinese for this ticket. Current code works well with special test cases and all tests in xapian-core could still pass. But I'm confused with exact requirements of the question, for how much we could pay with performance on en...

[Patch] User-dependent IdentityFile

2003 Jan 18

[Patch] User-dependent IdentityFile

Here is the user-dependent IdentityFile patch for openssh3.5 (BSD version), which allows private key files to be placed system wide (for all users) in a secure (non-NFS) mounted location. This addresses an important security hole on systems where home directories are NFS mounted, particularly if there are users who use blank passphrases (or when lpd is tunneled through ssh on systems running lpd

Pull requests: CJK words and Snippet generator

2016 Sep 07

Pull requests: CJK words and Snippet generator

On Tue, Sep 6, 2016, at 09:16, Olly Betts wrote: > I think my main concerns are about efficiency (since that a major > motivation for the current implementation, so slowing it down would be > annoying), and whether we can just make this the standard behaviour > rather than adding an option. The current implementation is O(n) and I took care to keep it at that. For the proposed term

[klibc:update-dash] parser: Fix backquote support in here-document EOF mark

2019 Jan 25

[klibc:update-dash] parser: Fix backquote support in here-document EOF mark

...: Fri, 25 Jan 2019 02:57:21 +0000 [klibc] parser: Fix backquote support in here-document EOF mark Currently using backquotes in a here-document EOF mark is broken because dash tries to do command substitution on it. This patch fixes it by checking whether we're looking for an EOF mark during tokenisation. Reported-by: Harald van Dijk <harald at gigawatt.nl> Signed-off-by: Herbert Xu <herbert at gondor.apana.org.au> Signed-off-by: Ben Hutchings <ben at decadent.org.uk> --- usr/dash/parser.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/usr/dash/parser.c b/usr/dash/par...

[klibc:update-dash] dash: parser: Fix backquote support in here-document EOF mark

2020 Mar 28

[klibc:update-dash] dash: parser: Fix backquote support in here-document EOF mark

...kquote support in here-document EOF mark [ dash commit c166b718b496da63c4df7a0972df2fc6cd38256b ] Currently using backquotes in a here-document EOF mark is broken because dash tries to do command substitution on it. This patch fixes it by checking whether we're looking for an EOF mark during tokenisation. Reported-by: Harald van Dijk <harald at gigawatt.nl> Signed-off-by: Herbert Xu <herbert at gondor.apana.org.au> Signed-off-by: Ben Hutchings <ben at decadent.org.uk> --- usr/dash/parser.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/usr/dash/parser.c b/usr/dash/par...

Query parser and stemming of norwegian letters

2005 Jun 09

Query parser and stemming of norwegian letters

Hello, can I get an explanation of the following. Running the following code: .... pqp=new QueryParser(); Stem stem("norwegian"); cout << "DEBUG " << stem.stem_word(_sXapian)<< endl; pqp->set_stemmer(stem); pqp->set_database(*_pdatabase); pqp->set_default_op(Query::OP_AND); //Set the

Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

2024 Jan 08

Possible bug using FLAG_WORD_BREAKS with fullwidth Unicode codepoints

...tts wrote: > I've restarted trac. I now created a pull request: https://github.com/xapian/xapian/pull/329 Should I create a trac issue, too? > Assuming the latter is valid, just removing this block (or removing the > parts of it which are Lu or Ll) should fix the problem as then > tokenisation will switch mode - I tried this and it fixes your case at > least: Removing the whole block will cause word-breaker to not correctly handle halfwidth Katakana, such as "??????????" which it would treat as a single term, whereas it should be two: ??????and ????). My pull request caus...

search for: tokenisation