8rino-Luca Pantani
2007-Jul-26 13:40 UTC
[R] substituting dots in the names of the columns (sub, gsub, regexpr)
Dear R users, I have the following two problems, related to the function sub, grep, regexpr and similia. The header of the file(s) I have to import is like this. c("y (m)", "BD (g/cm3)", "PR (Mpa)", "Ks (m/s)", "SP g./g.", "P (m3/m3)", "theta1 (g/g)", "theta2 (g/g)", "AWC (g/g)") To get rid of spaces and symbols in the names of the columns, I use read.table(... check.names=TRUE) and I get: str <- c("y..m.", "BD..g.cm3.", "PR..Mpa.", "Ks..m.s.", "SP.g..g.", "P..m3.m3.", "theta1..g.g.", "theta2..g.g.", "AWC..g.g.") Now, my problem is to remove the trailing dots, as well as the double dots, in order to get the names like the following c("y.m", "BD.g.cm3", "PR.Mpa", "Ks.m.s", "SP.g.g", "P.m3.m3.", "theta1.g.g", "theta2.g.g", "AWC.g.g") I've searched the help pages for sub, regexpr and similia, and also searched the help archives. I understand that the dot is a peculiar sign since sub("..", ".", str) [1] "..m." "...g.cm3." "...Mpa." "...m.s." "..g..g." [6] "..m3.m3." ".eta1..g.g." ".eta2..g.g." ".C..g.g." Therefore I tried sub("\\..", ".", str) [1] "y.m." "BD.g.cm3." "PR.Mpa." "Ks.m.s." "SP...g." [6] "P.m3.m3." "theta1.g.g." "theta2.g.g." "AWC.g.g." and I've been surprised by the (to me) strange behaviour in "SP.g..g." modified in "SP...g." An this is the first problem I cannot solve. Then there's the problem of trailing dot removal. In http://tolstoy.newcastle.edu.au/R/e2/help/07/01/8665.html I've found a somewhat similar problem, but it do not works in this case since: gsub("[.].*", "", str) [1] "y" "BD" "PR" "Ks" "SP" "P" "theta1" "theta2" [9] "AWC" And this the second problem Apart this particular problems I would like to know more on regexp, sub and so on, since each time I have strings to manipulate, I must face my ignorance in the topic of regular expression and its syntax. Is there any page with examples, where I can improve my knowledge and stop being frustrated each time I have to manipulate strings? 8rino -- Ottorino-Luca Pantani, Universit? di Firenze Dip. Scienza del Suolo e Nutrizione della Pianta P.zle Cascine 28 50144 Firenze Italia Tel 39 055 3288 202 (348 lab) Fax 39 055 333 273 OLPantani at unifi.it
Gabor Grothendieck
2007-Jul-26 14:06 UTC
[R] substituting dots in the names of the columns (sub, gsub, regexpr)
Use \\. or [.] with quotes to denote a literal dot (#1) or can use fixed = TRUE to remove the meaning of dot (#2) or use a zero-width lookahead assertion (?=[.]) which will be matched but is not added to the string to be replaced (#3). Try ?regexpr . Also the links on the gsubfn home page (http://code.google.com/p/gsubfn/) point to a number of good resources on regular expressions. Str <- c("y..m.", "BD..g.cm3.", "PR..Mpa.", "Ks..m.s.", "SP.g..g.", "P..m3.m3.", "theta1..g.g.", "theta2..g.g.", "AWC..g.g.") # 1 tmp <- gsub("[.]+", ".", Str) sub("[.]+$", "", tmp) # 2 tmp <- gsub("..", ".", Str, fixed = TRUE) sub("[.]+$", "", tmp) # 3 - both done at once using zero-width lookahead gsub("[.]*$|[.]*(?=[.])", "", Str, perl = TRUE) On 7/26/07, 8rino-Luca Pantani <ottorino-luca.pantani at unifi.it> wrote:> Dear R users, > I have the following two problems, related to the function sub, grep, > regexpr and similia. > > The header of the file(s) I have to import is like this. > > c("y (m)", "BD (g/cm3)", "PR (Mpa)", "Ks (m/s)", "SP g./g.", "P > (m3/m3)", "theta1 (g/g)", "theta2 (g/g)", "AWC (g/g)") > > To get rid of spaces and symbols in the names of the columns, > I use read.table(... check.names=TRUE) and I get: > str <- c("y..m.", "BD..g.cm3.", "PR..Mpa.", "Ks..m.s.", "SP.g..g.", > "P..m3.m3.", "theta1..g.g.", "theta2..g.g.", "AWC..g.g.") > > Now, my problem is to remove the trailing dots, as well as the double > dots, in order to get the names like the following > c("y.m", "BD.g.cm3", "PR.Mpa", "Ks.m.s", "SP.g.g", "P.m3.m3.", > "theta1.g.g", "theta2.g.g", "AWC.g.g") > > I've searched the help pages for sub, regexpr and similia, and also > searched the help archives. > I understand that the dot is a peculiar sign since > sub("..", ".", str) > [1] "..m." "...g.cm3." "...Mpa." "...m.s." "..g..g." > [6] "..m3.m3." ".eta1..g.g." ".eta2..g.g." ".C..g.g." > > Therefore I tried > sub("\\..", ".", str) > [1] "y.m." "BD.g.cm3." "PR.Mpa." "Ks.m.s." "SP...g." > [6] "P.m3.m3." "theta1.g.g." "theta2.g.g." "AWC.g.g." > and I've been surprised by the (to me) strange behaviour in "SP.g..g." > modified in "SP...g." > An this is the first problem I cannot solve. > > Then there's the problem of trailing dot removal. > In > http://tolstoy.newcastle.edu.au/R/e2/help/07/01/8665.html > I've found a somewhat similar problem, but it do not works in this case > since: > gsub("[.].*", "", str) > [1] "y" "BD" "PR" "Ks" "SP" "P" "theta1" "theta2" > [9] "AWC" > And this the second problem > > Apart this particular problems I would like to know more on regexp, sub > and so on, since each time > I have strings to manipulate, I must face my ignorance in the topic of > regular expression and its syntax. > > Is there any page with examples, where I can improve my knowledge and > stop being frustrated each time I have to manipulate strings? > > 8rino > > -- > Ottorino-Luca Pantani, Universit? di Firenze > Dip. Scienza del Suolo e Nutrizione della Pianta > P.zle Cascine 28 50144 Firenze Italia > Tel 39 055 3288 202 (348 lab) Fax 39 055 333 273 > OLPantani at unifi.it > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Gabor Grothendieck
2007-Jul-26 14:07 UTC
[R] substituting dots in the names of the columns (sub, gsub, regexpr)
Use \\. or [.] with quotes to denote a literal dot (#1) or can use fixed = TRUE to remove the meaning of dot (#2) or use a zero-width lookahead assertion (?=[.]) which will be matched but is not added to the string to be replaced (#3). Try ?regexpr . Also the links on the gsubfn home page (http://code.google.com/p/gsubfn/) point to a number of good resources on regular expressions. Str <- c("y..m.", "BD..g.cm3.", "PR..Mpa.", "Ks..m.s.", "SP.g..g.", "P..m3.m3.", "theta1..g.g.", "theta2..g.g.", "AWC..g.g.") # 1 tmp <- gsub("[.]+", ".", Str) sub("[.]+$", "", tmp) # 2 tmp <- gsub("..", ".", Str, fixed = TRUE) sub("[.]+$", "", tmp) # 3 - both done at once using zero-width lookahead gsub("[.]*$|[.]*(?=[.])", "", Str, perl = TRUE) On 7/26/07, 8rino-Luca Pantani <ottorino-luca.pantani at unifi.it> wrote:> Dear R users, > I have the following two problems, related to the function sub, grep, > regexpr and similia. > > The header of the file(s) I have to import is like this. > > c("y (m)", "BD (g/cm3)", "PR (Mpa)", "Ks (m/s)", "SP g./g.", "P > (m3/m3)", "theta1 (g/g)", "theta2 (g/g)", "AWC (g/g)") > > To get rid of spaces and symbols in the names of the columns, > I use read.table(... check.names=TRUE) and I get: > str <- c("y..m.", "BD..g.cm3.", "PR..Mpa.", "Ks..m.s.", "SP.g..g.", > "P..m3.m3.", "theta1..g.g.", "theta2..g.g.", "AWC..g.g.") > > Now, my problem is to remove the trailing dots, as well as the double > dots, in order to get the names like the following > c("y.m", "BD.g.cm3", "PR.Mpa", "Ks.m.s", "SP.g.g", "P.m3.m3.", > "theta1.g.g", "theta2.g.g", "AWC.g.g") > > I've searched the help pages for sub, regexpr and similia, and also > searched the help archives. > I understand that the dot is a peculiar sign since > sub("..", ".", str) > [1] "..m." "...g.cm3." "...Mpa." "...m.s." "..g..g." > [6] "..m3.m3." ".eta1..g.g." ".eta2..g.g." ".C..g.g." > > Therefore I tried > sub("\\..", ".", str) > [1] "y.m." "BD.g.cm3." "PR.Mpa." "Ks.m.s." "SP...g." > [6] "P.m3.m3." "theta1.g.g." "theta2.g.g." "AWC.g.g." > and I've been surprised by the (to me) strange behaviour in "SP.g..g." > modified in "SP...g." > An this is the first problem I cannot solve. > > Then there's the problem of trailing dot removal. > In > http://tolstoy.newcastle.edu.au/R/e2/help/07/01/8665.html > I've found a somewhat similar problem, but it do not works in this case > since: > gsub("[.].*", "", str) > [1] "y" "BD" "PR" "Ks" "SP" "P" "theta1" "theta2" > [9] "AWC" > And this the second problem > > Apart this particular problems I would like to know more on regexp, sub > and so on, since each time > I have strings to manipulate, I must face my ignorance in the topic of > regular expression and its syntax. > > Is there any page with examples, where I can improve my knowledge and > stop being frustrated each time I have to manipulate strings? > > 8rino > > -- > Ottorino-Luca Pantani, Universit? di Firenze > Dip. Scienza del Suolo e Nutrizione della Pianta > P.zle Cascine 28 50144 Firenze Italia > Tel 39 055 3288 202 (348 lab) Fax 39 055 333 273 > OLPantani at unifi.it > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Felix Andrews
2007-Jul-26 14:15 UTC
[R] substituting dots in the names of the columns (sub, gsub, regexpr)
Hi, A dot in a regular expression matches any character, so you have to escape each dot with backslash \\ (which itself is escaped in the string, to confuse things...). A plus symbol will match one or more of the preceding characters. A dollar symbol will match the end of a string. So: gsub("\\.$", "", gsub("\\.+", ".", str)) [1] "y.m" "BD.g.cm3" "PR.Mpa" "Ks.m.s" "SP.g.g" "P.m3.m3" "theta1.g.g" [8] "theta2.g.g" "AWC.g.g" Learn more at ?regexp Felix On 7/26/07, 8rino-Luca Pantani <ottorino-luca.pantani at unifi.it> wrote:> Dear R users, > I have the following two problems, related to the function sub, grep, > regexpr and similia. > > The header of the file(s) I have to import is like this. > > c("y (m)", "BD (g/cm3)", "PR (Mpa)", "Ks (m/s)", "SP g./g.", "P > (m3/m3)", "theta1 (g/g)", "theta2 (g/g)", "AWC (g/g)") > > To get rid of spaces and symbols in the names of the columns, > I use read.table(... check.names=TRUE) and I get: > str <- c("y..m.", "BD..g.cm3.", "PR..Mpa.", "Ks..m.s.", "SP.g..g.", > "P..m3.m3.", "theta1..g.g.", "theta2..g.g.", "AWC..g.g.") > > Now, my problem is to remove the trailing dots, as well as the double > dots, in order to get the names like the following > c("y.m", "BD.g.cm3", "PR.Mpa", "Ks.m.s", "SP.g.g", "P.m3.m3.", > "theta1.g.g", "theta2.g.g", "AWC.g.g") > > I've searched the help pages for sub, regexpr and similia, and also > searched the help archives. > I understand that the dot is a peculiar sign since > sub("..", ".", str) > [1] "..m." "...g.cm3." "...Mpa." "...m.s." "..g..g." > [6] "..m3.m3." ".eta1..g.g." ".eta2..g.g." ".C..g.g." > > Therefore I tried > sub("\\..", ".", str) > [1] "y.m." "BD.g.cm3." "PR.Mpa." "Ks.m.s." "SP...g." > [6] "P.m3.m3." "theta1.g.g." "theta2.g.g." "AWC.g.g." > and I've been surprised by the (to me) strange behaviour in "SP.g..g." > modified in "SP...g." > An this is the first problem I cannot solve. > > Then there's the problem of trailing dot removal. > In > http://tolstoy.newcastle.edu.au/R/e2/help/07/01/8665.html > I've found a somewhat similar problem, but it do not works in this case > since: > gsub("[.].*", "", str) > [1] "y" "BD" "PR" "Ks" "SP" "P" "theta1" "theta2" > [9] "AWC" > And this the second problem > > Apart this particular problems I would like to know more on regexp, sub > and so on, since each time > I have strings to manipulate, I must face my ignorance in the topic of > regular expression and its syntax. > > Is there any page with examples, where I can improve my knowledge and > stop being frustrated each time I have to manipulate strings? > > 8rino > > -- > Ottorino-Luca Pantani, Universit?? di Firenze > Dip. Scienza del Suolo e Nutrizione della Pianta > P.zle Cascine 28 50144 Firenze Italia > Tel 39 055 3288 202 (348 lab) Fax 39 055 333 273 > OLPantani at unifi.it > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Felix Andrews / ?????? PhD candidate Integrated Catchment Assessment and Management Centre The Fenner School of Environment and Society The Australian National University (Building 48A), ACT 0200 Beijing Bag, Locked Bag 40, Kingston ACT 2604 http://www.neurofractal.org/felix/ voice:+86_1051404394 (in China) mobile:+86_13522529265 (in China) mobile:+61_410400963 (in Australia) xmpp:foolish.android at gmail.com 3358 543D AAC6 22C2 D336 80D9 360B 72DD 3E4C F5D8