Hello List, ? I have a dataset consisting of strings that I want to split while saving the delimiter. ? Some example data: ?leucocyten + gramnegatieve staven +++ grampositieve staven ++? ?leucocyten ? grampositieve coccen +? ? I want to split the strings such that I get the following result: c(?leucocyten +?, ??gramnegatieve staven +++?, ??grampositieve staven ++?) c(?leucocyten ??, ?grampositieve coccen +?) ? I have tried strsplit with a regular expression with a positive lookahead, but I am not able to achieve the results that I want. ? I have tried: as.list(strsplit(x, split = ?(?=[\\+-]{1,3}\\s)+, perl=TRUE) ? Which results in: c(?leucocyten ?, ?+?, ??gramnegatieve staven ?, ?+?, ?+?, ?+?, ??grampositieve staven ++?) c(?leucocyten ?, ???, ?grampositieve coccen +?) ? ? Is there a function or regular expression that will make this possible? ? Kind regards, Emily ?
This seems to do the job but there are probably more elegant solutions: f <- function(s) { sub("^ ","",unlist(strsplit(gsub("\\+ ","+@ ",s),"@"))) } g <- function(s) { sub("^ ","",unlist(strsplit(gsub("- ","-@ ",s),"@"))) } h <- function(s) { g(f(s)) } To try it out: s <- ?leucocyten + gramnegatieve staven +++ grampositieve staven ++? t <- ?leucocyten ? grampositieve coccen +? h(s) h(t) HTH, Eric On Wed, Apr 12, 2023 at 7:56?PM Emily Bakker <emilybakker at outlook.com> wrote:> Hello List, > > I have a dataset consisting of strings that I want to split while saving > the delimiter. > > Some example data: > ?leucocyten + gramnegatieve staven +++ grampositieve staven ++? > ?leucocyten ? grampositieve coccen +? > > I want to split the strings such that I get the following result: > c(?leucocyten +?, ?gramnegatieve staven +++?, ?grampositieve staven ++?) > c(?leucocyten ??, ?grampositieve coccen +?) > > I have tried strsplit with a regular expression with a positive lookahead, > but I am not able to achieve the results that I want. > > I have tried: > as.list(strsplit(x, split = ?(?=[\\+-]{1,3}\\s)+, perl=TRUE) > > Which results in: > c(?leucocyten ?, ?+?, ?gramnegatieve staven ?, ?+?, ?+?, ?+?, > ?grampositieve staven ++?) > c(?leucocyten ?, ???, ?grampositieve coccen +?) > > > Is there a function or regular expression that will make this possible? > > Kind regards, > Emily > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
On Wed, 12 Apr 2023 08:29:50 +0000 Emily Bakker <emilybakker at outlook.com> wrote:> Some example data: > ?leucocyten + gramnegatieve staven +++ grampositieve staven ++? > ?leucocyten ? grampositieve coccen +? > ? > I want to split the strings such that I get the following result: > c(?leucocyten +?, ??gramnegatieve staven +++?, > ??grampositieve staven ++?) > c(?leucocyten ??, ?grampositieve coccen +?) > ? > I have tried strsplit with a regular expression with a positive > lookahead, but I am not able to achieve the results that I want.It sounds like you need positive look-behind, not look-ahead: split on spaces only if they _follow_ one to three of '+' or '-'. Unfortunately, repetition quantifiers like {n,m} or + are not directly supported in look-behind expressions (nor in Perl itself). As a special case, you can use \K, where anything to the left of \K is a zero-width positive match: x <- c( 'leucocyten + gramnegatieve staven +++ grampositieve staven ++', 'leucocyten - grampositieve coccen +' ) strsplit(x, '[+-]{1,3}+\\K ', perl = TRUE) # [[1]] # [1] "leucocyten +" "gramnegatieve staven +++" # "grampositieve staven ++" # # [[2]] # [1] "leucocyten -" "grampositieve coccen +" -- Best regards, Ivan P.S. It looks like your e-mail client has transformed every quote character into typographically-correct Unicode quotes ?? and every minus into an en dash, which makes it slightly harder to work with your code, since typographically correct Unicode quotes are not R string delimiters. Is it really ? that you'd like to split upon, or is it -?
I thought replacing the spaces following instances of +++,++,+,- with "\n" and then reading with scan should succeed. Like Ivan Krylov I was fairly sure that you meant the minus sign to be "-" rather than "?", but perhaps your were using MS Word as an editor which is inconsistent with effective use of R. If so, learn to use a proper programming editor, and in any case learn to post to rhelp in plain text. -- David scan(text=gsub("([-+]){1}\\s", "\\1\n", dat), what="", sep="\n")> On Apr 12, 2023, at 2:29 AM, Emily Bakker <emilybakker at outlook.com> wrote: > > Hello List, > > I have a dataset consisting of strings that I want to split while saving the delimiter. > > Some example data: > ?leucocyten + gramnegatieve staven +++ grampositieve staven ++? > ?leucocyten ? grampositieve coccen +? > > I want to split the strings such that I get the following result: > c(?leucocyten +?, ?gramnegatieve staven +++?, ?grampositieve staven ++?) > c(?leucocyten ??, ?grampositieve coccen +?) > > I have tried strsplit with a regular expression with a positive lookahead, but I am not able to achieve the results that I want. > > I have tried: > as.list(strsplit(x, split = ?(?=[\\+-]{1,3}\\s)+, perl=TRUE) > > Which results in: > c(?leucocyten ?, ?+?, ?gramnegatieve staven ?, ?+?, ?+?, ?+?, ?grampositieve staven ++?) > c(?leucocyten ?, ???, ?grampositieve coccen +?) > > > Is there a function or regular expression that will make this possible? > > Kind regards, > Emily > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.