Immanuel
2010-Jul-07 16:25 UTC
[R] use sliding window to count substrings found in large string
Hello together, I'm looking for advice on how to do some tests on strings. What I want to do is the following: (just an example, real strings/sequence are about 200-400 characters long) given set of Strings: String1 abcdefgh String2 bcdefgop use a sliding window of size x to create an vector of all subsequences of size x found in the set (order matters! ). Now create, for every string in the set, an vector containing the counts on how often each subsequence was found in this particular string. It would be great if someone could give me a vague outline on how to start and which methods to work. I did read through the man pages and goggled a lot, but still don't know how to approach this. best regards, Immanuel
Gabor Grothendieck
2010-Jul-07 16:50 UTC
[R] use sliding window to count substrings found in large string
On Wed, Jul 7, 2010 at 12:25 PM, Immanuel <mane.desk at googlemail.com> wrote:> Hello together, > > > I'm looking for advice on how to do some tests on strings. > What I want to do is the following: > > (just an example, real strings/sequence are about 200-400 characters long) > given set of Strings: > > String1 abcdefgh > String2 bcdefgop > > use a sliding window of size x ?to create an vector of all subsequences > of size x > found in the set (order matters! ). > > Now create, for every string in the set, an vector containing the counts > on how often > each subsequence was found in this particular string. > > ?It would be great if someone could give me a vague outline on how to > start and which methods to work. > I did read through the man pages and goggled a lot, but still don't know > how to > approach this. >Try this: # generate an input string n long set.seed(123) n <- 300 lets <- paste(sample(letters[1:5], n, replace = TRUE), collapse = "") # get rolling k-length sequences and count k <- 3 table(substring(lets, 1:(n-k+1), k:n))
Gabor Grothendieck
2010-Jul-07 17:26 UTC
[R] use sliding window to count substrings found in large string
On Wed, Jul 7, 2010 at 1:25 PM, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:> On Wed, Jul 7, 2010 at 1:15 PM, Immanuel <mane.desk at googlemail.com> wrote: >> Hey, >> >> big help, thanks! >> One little question remains, if I create >> more then one string and table ... >> --------------------- >> >> # generate an input string n long >> set.seed(123) >> n <- 300 >> lets_1 <- paste(sample(letters[1:5], n, replace = TRUE), collapse = "") >> lets_2 <- paste(sample(letters[1:5], n, replace = TRUE), collapse = "") >> >> >> # get rolling k-length sequences and count >> k <- 3 >> table_1 <-table(substring(lets_1, 1:(n-k+1), k:n)) >> table_2 <-table(substring(lets_2, 1:(n-k+1), k:n)) >> ----------------------- >> >> is it possible to manipulate table_1 so that it contains zero entries >> for all the substrings found in table_2 but not in table_1? >> >> best regards >> Immanuel >> > > Turn them into factors with the appropriate levels before counting > them with table: > > # generate an input string n long > set.seed(123) > n <- 300 > lets_1 <- paste(sample(letters[1:5], n, replace = TRUE), collapse = "") > lets_2 <- paste(sample(letters[1:5], n, replace = TRUE), collapse = "") > > # get rolling k-length sequences and count > k <- 3 > s1 <- substring(lets_1, 1:(n-k+1), k:n) > s2 <- substring(lets_2, 1:(n-k+1), k:n) > levs <- sort(unique(union(s1, s2))) > table(factors(s1, levs)) > table(factors(s2, levs)) >That should be factor, not factors: table(factor(s1, levs)) table(factor(s2, levs))
Immanuel
2010-Jul-07 17:45 UTC
[R] use sliding window to count substrings found in large string
Hey, saved my day. Now can watch the football semi-final thanks> Turn them into factors with the appropriate levels before counting > them with table: > > # generate an input string n long > set.seed(123) > n <- 300 > lets_1 <- paste(sample(letters[1:5], n, replace = TRUE), collapse = "") > lets_2 <- paste(sample(letters[1:5], n, replace = TRUE), collapse = "") > > # get rolling k-length sequences and count > k <- 3 > s1 <- substring(lets_1, 1:(n-k+1), k:n) > s2 <- substring(lets_2, 1:(n-k+1), k:n) > levs <- sort(unique(union(s1, s2))) > table(factor(s1, levs)) > table(factor(s2, levs)) > >