I'm new to R and very excited about its possibilities. But I'm struggling with some very simple things, probably because I haven't found the correct documentation. Here's a simple example which illustrates several of my problems. Suppose I want to have a regexp match against a string, and return all the matching substrings in a vector of strings. regexp <- "[ab]+" strlist <- c( "abc", "dbabddadd", "aaa" ) matches <- gregexpr(regexp,strlist) With this input, I'd want to return list( list("ab"), list("ab", "a"), list("aaa") ). Now the matches object prints out as [[1]] [1] 1 attr(,"match.length") [1] 2 [[2]] [1] 2 7 attr(,"match.length") [1] 3 1 [[3]] [1] 1 attr(,"match.length") [1] 3 which, if I'm interpreting this correctly, means that it is a list (not a vector, because vectors can only have atomic elements) of three elements, each of which is a vector of integers (the matching positions) with an attribute match.length (the length of the corresponding match), which is in turn a vector of integers. Question: is there a more compact standard print format for this? It's a bit disconcerting that printing the 2x2 list list(list(1,2),list(3,4)) takes 16 lines while the corresponding 2x2 array takes 2 lines! (I guess that arrays are "more native"). Now, matches[[1]], the first element of matches, describes the matches in the first string. To extract those strings, I can write substr( strlist[[1]], matches[[1]], attr(matches[[1]],"match.length")+matches[[1]]-1 ) which correctly gives "ab". Question: This looks awfully clumsy; is there some more idiomatic way to do this, in particular to refer to the match.length attribute without using a quoted string or the attr function? attributes(matches[[1]])$match.length and attributes(matches[[1]])[[1]] work, but seem even clumsier. Question: R uses names like xxx.yyy in many places. Is this just a convention to represent spaces (the way most languages use "_"), or is there some semantics attached to "."? Question: Is it good practice in R to treat a string as a vector of characters so that R's powerful vector operations can be used on it? How would I do that? Now suppose I want to list *all* the matches in matches[[2]]. I try: substr( strlist[[2]], matches[[2]], attr(matches[[2]],"match.length")+matches[[2]]-1 ) but only get the first one, so it seems that the recycling rule for vectors doesn't apply here (same thing with [2] instead of [[2]]). Where does recycling apply and not apply? Question: Is there some operator (using promises?) to make strlist[[2]] into a (lazy) infinite vector/list? Now suppose I want to list *all* the matches in all the strings. How would I do that? The naive way, substr(strlist,matches, ...) doesn't work, partly because the attr operator doesn't distribute over lists (I see why it can't, but...). Thanks in advance for your patience with these very elementary questions, -s Stavros Macrakis, Cambridge, MA
Try this: regexp <- "[ab]+" strlist <- c( "abc", "dbabddadd", "aaa" ) library(gsubfn) s <- strapply(strlist, regexp) s # compactly show 1st few in ea component str(s) See gsubfn home page at http://gsubfn.googlecode.com On Sun, Aug 10, 2008 at 5:00 PM, Stavros Macrakis <macrakis at alum.mit.edu> wrote:> I'm new to R and very excited about its possibilities. But I'm > struggling with some very simple things, probably because I haven't > found the correct documentation. Here's a simple example which > illustrates several of my problems. > > Suppose I want to have a regexp match against a string, and return all > the matching substrings in a vector of strings. > > regexp <- "[ab]+" > strlist <- c( "abc", "dbabddadd", "aaa" ) > matches <- gregexpr(regexp,strlist) > > With this input, I'd want to return list( list("ab"), list("ab", "a"), > list("aaa") ). > > Now the matches object prints out as > > [[1]] > [1] 1 > attr(,"match.length") > [1] 2 > > [[2]] > [1] 2 7 > attr(,"match.length") > [1] 3 1 > > [[3]] > [1] 1 > attr(,"match.length") > [1] 3 > > which, if I'm interpreting this correctly, means that it is a list > (not a vector, because vectors can only have atomic elements) of three > elements, each of which is a vector of integers (the matching > positions) with an attribute match.length (the length of the > corresponding match), which is in turn a vector of integers. > > Question: is there a more compact standard print format for this? It's > a bit disconcerting that printing the 2x2 list > list(list(1,2),list(3,4)) takes 16 lines while the corresponding 2x2 > array takes 2 lines! (I guess that arrays are "more native"). > > Now, matches[[1]], the first element of matches, describes the matches > in the first string. To extract those strings, I can write > > substr( strlist[[1]], > matches[[1]], > attr(matches[[1]],"match.length")+matches[[1]]-1 ) > > which correctly gives "ab". > > Question: This looks awfully clumsy; is there some more idiomatic way > to do this, in particular to refer to the match.length attribute > without using a quoted string or the attr function? > attributes(matches[[1]])$match.length and > attributes(matches[[1]])[[1]] work, but seem even clumsier. > > Question: R uses names like xxx.yyy in many places. Is this just a > convention to represent spaces (the way most languages use "_"), or is > there some semantics attached to "."? > > Question: Is it good practice in R to treat a string as a vector of > characters so that R's powerful vector operations can be used on it? > How would I do that? > > Now suppose I want to list *all* the matches in matches[[2]]. I try: > > substr( strlist[[2]], > matches[[2]], > attr(matches[[2]],"match.length")+matches[[2]]-1 ) > > but only get the first one, so it seems that the recycling rule for > vectors doesn't apply here (same thing with [2] instead of [[2]]). > Where does recycling apply and not apply? > > Question: Is there some operator (using promises?) to make > strlist[[2]] into a (lazy) infinite vector/list? > > Now suppose I want to list *all* the matches in all the strings. How > would I do that? The naive way, substr(strlist,matches, ...) doesn't > work, partly because the attr operator doesn't distribute over lists > (I see why it can't, but...). > > Thanks in advance for your patience with these very elementary questions, > > -s > > Stavros Macrakis, Cambridge, MA > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Suppose I want to have a regexp match against a string, and return all the matching substrings in a vector of strings. regexp <- "[ab]+" strlist <- c( "abc", "dbabddadd", "aaa" ) matches <- gregexpr(regexp,strlist) With this input, I'd want to return list( list("ab"), list("ab", "a"), list("aaa") ). Now the matches object prints out as [[1]] [1] 1 attr(,"match.length") [1] 2 [[2]] [1] 2 7 attr(,"match.length") [1] 3 1 [[3]] [1] 1 attr(,"match.length") [1] 3 which, if I'm interpreting this correctly, means that it is a list (not a vector, because vectors can only have atomic elements) of three elements, each of which is a vector of integers (the matching positions) with an attribute match.length (the length of the corresponding match), which is in turn a vector of integers. ==Question: is there a more compact standard print format for this? It's a bit disconcerting that printing the 2x2 list list(list(1,2),list(3,4)) takes 16 lines while the corresponding 2x2 array takes 2 lines! (I guess that arrays are "more native"). Here is one way:> (mat <- t(sapply(matches,function(x)+ list(start.index=`attributes<-`(x,NULL), + match.length=attr(x,"match.length"))))) start.index match.length [1,] 1 2 [2,] Integer,2 Integer,2 [3,] 1 3 The object returned by this function is a 3x2 matrix of mode "list" - each element of the matrix contains a list:> mat[2,1]$start.index [1] 2 7> mat[[2,1]][1] 2 7 also, see below... ==Now, matches[[1]], the first element of matches, describes the matches in the first string. To extract those strings, I can write substr( strlist[[1]], matches[[1]], attr(matches[[1]],"match.length")+matches[[1]]-1 ) which correctly gives "ab". Question: This looks awfully clumsy; is there some more idiomatic way to do this, in particular to refer to the match.length attribute without using a quoted string or the attr function? attributes(matches[[1]])$match.length and attributes(matches[[1]])[[1]] work, but seem even clumsier. Check out the gsubfn package - I'm still learning it myself, but it may provide the functionality you seek. For instance, I believe what you are trying to accomplish is> strapply(strlist,regexp,identity)or> strapply(strlist,regexp,c)[[1]] [1] "ab" [[2]] [1] "bab" "a" [[3]] [1] "aaa" ==Question: R uses names like xxx.yyy in many places. Is this just a convention to represent spaces (the way most languages use "_"), or is there some semantics attached to "."? In many examples that I have seen , programmers have used "." in the place of the traditional "_" because "_" used to be an assignment operator in earlier versions of R. Now, "_" is no longer an assignment operator and its use in variable names is permitted also. The "." notation also plays a role in the implementation of OOP by R. R has two object-oriented approaches: S3 and S4. For both approaches, methods are associated with generic functions, rather than the object itself (which I understand is similar to Lisp's CLOS). For S3 methods, function.objectclass implies the generic "function" to be applied to class "objectclass". For instance, print() is a generic function:> print.regexp <- function(x)+ for(i in seq(along=x)) + cat(i,":", x[[i]], "| match.length =", + attr(x[[i]],"match.length"),"\n")> class(matches) <- "regexp" > print(matches)1 : 1 | match.length = 2 2 : 2 7 | match.length = 3 1 3 : 1 | match.length = 3 You can assign a class (or classes) to each object; this information is used for making method dispatch decisions for generic functions. For S3 there is no checking of consistency between object classes and its attributes; S4 is a more formal implementation of OOP in R. Check out (S3) http://www-128.ibm.com/developerworks/linux/library/l-r3.html (S4) http://developer.r-project.org/howMethodsWork.pdf The first reference also mentions how to implement infinite sequences in R - which may answer part of your question below. ==Question: Is it good practice in R to treat a string as a vector of characters so that R's powerful vector operations can be used on it? How would I do that? I'm sure it can be done by defining your own objects and methods, but it's not done out-of-the-box (that I'm aware of). I believe the most common string operations used by R users are extraction and concatenation; these are effectively achieved by substr(), substring() and paste(), rather than "[", c(), or "+", as you seem to have figured out. In my experience, R's standard objects and functions for string-like objects are immediately convenient for manipulating file and variable names but not necessarily for hard-core text processing. ==Now suppose I want to list *all* the matches in matches[[2]]. I try: substr( strlist[[2]], matches[[2]], attr(matches[[2]],"match.length")+matches[[2]]-1 ) but only get the first one, so it seems that the recycling rule for vectors doesn't apply here (same thing with [2] instead of [[2]]). Where does recycling apply and not apply? I don't know if there's a hard rule for that (though I usually expect that recycling works for mathematical operators and plotting functions), but in this case hope the strapply() function above will solve your problem. Otherwise, an inelegant way would be to use Map() or mapply():> mapply(function(x,y) substr( strlist[[2]],x,y),+ matches[[2]], + attr(matches[[2]],"match.length")+matches[[2]]-1) [1] "bab" "a" ==Question: Is there some operator (using promises?) to make strlist[[2]] into a (lazy) infinite vector/list? Like an iterator? There is some mention of infinite sequences in the IBM DeveloperWorks article above, but I've personally never tried implementing one in R. Hope this helps, Satoshi