jessica.gervais at tudor.lu
2008-Feb-12 14:30 UTC
[R] regular expression for na.strings / read.table
Dear all, I am working with a csv file. Some data of the file are not valid and they are marked with a star '*'. For example : *789. I have attached with this email a example file (test.txt) that looks like the data I have to work with. I see 2 possibilities ..thast I cannot manage anyway in R: 1-first & easiest solution: Read the data with read.csv in R, and define as na strings all cells containing a star (*). Something which would looks like this ...>DATA<-read.csv("test.txt",na.strings=list(length(grep("\\*",DATA,value=T))==0))> DATAX1 X.789 LNM. X78 X56 X89 X56.1 X100 1 2 700 AUW 78 56 89 56 100 2 3 400 TOC 78 56 89 56 10 3 4 389 RMN 78 56 89 56 *89 4 5 400 LNM 78 56 *452 56 100 5 6 200 UTC 78 *40 89 56 100 6 7 100 GAT 78 56 8 56 *100 7 8 79 *LNM 78 56 9 56 100 8 9 89 TCG 78 56 800 56 *100 9 10 78* LNM 78 56 89 56 100 ...but which would work (Stars are still there)! Do anyone knows how to do that ? 2-Second solution: - first read the file with DATA<-read.csv("test.txt") - then replace all fields containing a * with NA in applying the following function to the object DATA: DATA_cleaned<-apply(DATA,c(1,2),function(x){if(length(grep("\\*",x,value=TRUE))==1){x<-NA}}) DATA_cleaned X1 X.789 LNM. X78 X56 X89 X56.1 X100 [1,] NULL NULL NULL NULL NULL NULL NULL NULL [2,] NULL NULL NULL NULL NULL NULL NULL NULL [3,] NULL NULL NULL NULL NULL NULL NULL NA [4,] NULL NULL NULL NULL NULL NA NULL NULL [5,] NULL NULL NULL NULL NA NULL NULL NULL [6,] NULL NULL NULL NULL NULL NULL NULL NA [7,] NULL NULL NA NULL NULL NULL NULL NULL [8,] NULL NULL NULL NULL NULL NULL NULL NA [9,] NULL NA NULL NULL NULL NULL NULL NULL stars have deaseper, but all the rest too ! The pb comes from the fact that if a field does not contain any *, the command if(length(grep("\\*",x,value=T))==1) return NULL instead of FALSE ! I you have any idea, please let me know ! Many thanks, Jessica ____________________________________ Jessica Gervais Mail: jessica.gervais at tudor.lu Resource Centre for Environmental Technologies, Public Research Centre Henri Tudor, Technoport Schlassgoart, 66 rue de Luxembourg, P.O. BOX 144, L-4002 Esch-sur-Alzette, Luxembourg (See attached file: test.txt) -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: test.txt Url: https://stat.ethz.ch/pipermail/r-help/attachments/20080212/b67d1cbd/attachment.txt
Using brute force you can do something like: my.df<-read.table(stdin(),head=T,sep=",") X1,X.789,LNM.,X78,X56,X89,X56.1,X100 1,2,700,AUW,78,56,89,56,100 2,3,400,TOC,78,56,89,56,10 3,4,389,RMN,78,56,89,56,*89 4,5,400,LNM,78,56,*452,56,100 5,6,200,UTC,78,*40,89,56,100 6,7,100,GAT,78,56,8,56,*100 7,8,79,*LNM,78,56,9,56,100 8,9,89,TCG,78,56,800,56,*100 9,10,78*,LNM,78,56,89,56,100 X56.fix.index<-grep("\\*",my.df$X56 <file://*%22,my.df$X56/>) my.df$X56[X56.fix.index]<-NA my.df$X56<-as.numeric(my.df$X56) On 2/12/08, jessica.gervais@tudor.lu <jessica.gervais@tudor.lu> wrote:> > > Dear all, > > I am working with a csv file. > Some data of the file are not valid and they are marked with a star '*'. > For example : *789. > > I have attached with this email a example file (test.txt) that looks like > the data I have to work with. > > > I see 2 possibilities ..thast I cannot manage anyway in R: > > 1-first & easiest solution: > Read the data with read.csv in R, and define as na strings all cells > containing a star (*). > Something which would looks like this ... > > > > DATA<-read.csv("test.txt",na.strings=list > (length(grep("\\*",DATA,value=T))==0)) > > > DATA > X1 X.789 LNM. X78 X56 X89 X56.1 X100 > 1 2 700 AUW 78 56 89 56 100 > 2 3 400 TOC 78 56 89 56 10 > 3 4 389 RMN 78 56 89 56 *89 > 4 5 400 LNM 78 56 *452 56 100 > 5 6 200 UTC 78 *40 89 56 100 > 6 7 100 GAT 78 56 8 56 *100 > 7 8 79 *LNM 78 56 9 56 100 > 8 9 89 TCG 78 56 800 56 *100 > 9 10 78* LNM 78 56 89 56 100 > > > ...but which would work (Stars are still there)! Do anyone knows how to do > that ? > > 2-Second solution: > - first read the file with DATA<-read.csv("test.txt") > - then replace all fields containing a * with NA in applying the following > function to the object DATA: > > DATA_cleaned<-apply(DATA,c(1,2),function(x){if(length(grep("\\*",x,value=TRUE))==1){x<-NA}}) > DATA_cleaned > X1 X.789 LNM. X78 X56 X89 X56.1 X100 > [1,] NULL NULL NULL NULL NULL NULL NULL NULL > [2,] NULL NULL NULL NULL NULL NULL NULL NULL > [3,] NULL NULL NULL NULL NULL NULL NULL NA > [4,] NULL NULL NULL NULL NULL NA NULL NULL > [5,] NULL NULL NULL NULL NA NULL NULL NULL > [6,] NULL NULL NULL NULL NULL NULL NULL NA > [7,] NULL NULL NA NULL NULL NULL NULL NULL > [8,] NULL NULL NULL NULL NULL NULL NULL NA > [9,] NULL NA NULL NULL NULL NULL NULL NULL > > stars have deaseper, but all the rest too ! > The pb comes from the fact that if a field does not contain any *, the > command > if(length(grep("\\*",x,value=T))==1) return NULL instead of FALSE ! > > I you have any idea, please let me know ! > > Many thanks, > > Jessica > ____________________________________ > > Jessica Gervais > Mail: jessica.gervais@tudor.lu > > Resource Centre for Environmental Technologies, > Public Research Centre Henri Tudor, > Technoport Schlassgoart, > 66 rue de Luxembourg, > P.O. BOX 144, > L-4002 Esch-sur-Alzette, Luxembourg > > (See attached file: test.txt) > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > >[[alternative HTML version deleted]]
Henrique Dallazuanna
2008-Feb-12 15:07 UTC
[R] regular expression for na.strings / read.table
as.data.frame(sapply(DATA, function(x){x[grep(patt="\\*", x)]<-NA;x})) On 12/02/2008, jessica.gervais at tudor.lu <jessica.gervais at tudor.lu> wrote:> > Dear all, > > I am working with a csv file. > Some data of the file are not valid and they are marked with a star '*'. > For example : *789. > > I have attached with this email a example file (test.txt) that looks like > the data I have to work with. > > > I see 2 possibilities ..thast I cannot manage anyway in R: > > 1-first & easiest solution: > Read the data with read.csv in R, and define as na strings all cells > containing a star (*). > Something which would looks like this ... > > > > DATA<-read.csv("test.txt",na.strings=list(length(grep("\\*",DATA,value=T))==0)) > > > DATA > X1 X.789 LNM. X78 X56 X89 X56.1 X100 > 1 2 700 AUW 78 56 89 56 100 > 2 3 400 TOC 78 56 89 56 10 > 3 4 389 RMN 78 56 89 56 *89 > 4 5 400 LNM 78 56 *452 56 100 > 5 6 200 UTC 78 *40 89 56 100 > 6 7 100 GAT 78 56 8 56 *100 > 7 8 79 *LNM 78 56 9 56 100 > 8 9 89 TCG 78 56 800 56 *100 > 9 10 78* LNM 78 56 89 56 100 > > > ...but which would work (Stars are still there)! Do anyone knows how to do > that ? > > 2-Second solution: > - first read the file with DATA<-read.csv("test.txt") > - then replace all fields containing a * with NA in applying the following > function to the object DATA: > DATA_cleaned<-apply(DATA,c(1,2),function(x){if(length(grep("\\*",x,value=TRUE))==1){x<-NA}}) > DATA_cleaned > X1 X.789 LNM. X78 X56 X89 X56.1 X100 > [1,] NULL NULL NULL NULL NULL NULL NULL NULL > [2,] NULL NULL NULL NULL NULL NULL NULL NULL > [3,] NULL NULL NULL NULL NULL NULL NULL NA > [4,] NULL NULL NULL NULL NULL NA NULL NULL > [5,] NULL NULL NULL NULL NA NULL NULL NULL > [6,] NULL NULL NULL NULL NULL NULL NULL NA > [7,] NULL NULL NA NULL NULL NULL NULL NULL > [8,] NULL NULL NULL NULL NULL NULL NULL NA > [9,] NULL NA NULL NULL NULL NULL NULL NULL > > stars have deaseper, but all the rest too ! > The pb comes from the fact that if a field does not contain any *, the > command > if(length(grep("\\*",x,value=T))==1) return NULL instead of FALSE ! > > I you have any idea, please let me know ! > > Many thanks, > > Jessica > ____________________________________ > > Jessica Gervais > Mail: jessica.gervais at tudor.lu > > Resource Centre for Environmental Technologies, > Public Research Centre Henri Tudor, > Technoport Schlassgoart, > 66 rue de Luxembourg, > P.O. BOX 144, > L-4002 Esch-sur-Alzette, Luxembourg > > (See attached file: test.txt) > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > >-- Henrique Dallazuanna Curitiba-Paran?-Brasil 25? 25' 40" S 49? 16' 22" O
Here is one way of doing it:> # read the file in as lines, do the convert and then re-read > x <- readLines(textConnection(" X1 X.789 LNM. X78 X56 X89 X56.1 X100+ 1 2 700 AUW 78 56 89 56 100 + 2 3 400 TOC 78 56 89 56 10 + 3 4 389 RMN 78 56 89 56 *89 + 4 5 400 LNM 78 56 *452 56 100 + 5 6 200 UTC 78 *40 89 56 100 + 6 7 100 GAT 78 56 8 56 *100 + 7 8 79 *LNM 78 56 9 56 100 + 8 9 89 TCG 78 56 800 56 *100 + 9 10 78* LNM 78 56 89 56 100"))> x.c <- gsub("\\*[[:alnum:]]*|[[:alnum:]]*\\*", "NA", x) > x.new <- read.table(textConnection(x.c), header=TRUE) > closeAllConnections() > > x.newX1 X.789 LNM. X78 X56 X89 X56.1 X100 1 2 700 AUW 78 56 89 56 100 2 3 400 TOC 78 56 89 56 10 3 4 389 RMN 78 56 89 56 NA 4 5 400 LNM 78 56 NA 56 100 5 6 200 UTC 78 NA 89 56 100 6 7 100 GAT 78 56 8 56 NA 7 8 79 <NA> 78 56 9 56 100 8 9 89 TCG 78 56 800 56 NA 9 10 NA LNM 78 56 89 56 100 On Feb 12, 2008 9:30 AM, <jessica.gervais at tudor.lu> wrote:> > Dear all, > > I am working with a csv file. > Some data of the file are not valid and they are marked with a star '*'. > For example : *789. > > I have attached with this email a example file (test.txt) that looks like > the data I have to work with. > > > I see 2 possibilities ..thast I cannot manage anyway in R: > > 1-first & easiest solution: > Read the data with read.csv in R, and define as na strings all cells > containing a star (*). > Something which would looks like this ... > > > > DATA<-read.csv("test.txt",na.strings=list(length(grep("\\*",DATA,value=T))==0)) > > > DATA > X1 X.789 LNM. X78 X56 X89 X56.1 X100 > 1 2 700 AUW 78 56 89 56 100 > 2 3 400 TOC 78 56 89 56 10 > 3 4 389 RMN 78 56 89 56 *89 > 4 5 400 LNM 78 56 *452 56 100 > 5 6 200 UTC 78 *40 89 56 100 > 6 7 100 GAT 78 56 8 56 *100 > 7 8 79 *LNM 78 56 9 56 100 > 8 9 89 TCG 78 56 800 56 *100 > 9 10 78* LNM 78 56 89 56 100 > > > ...but which would work (Stars are still there)! Do anyone knows how to do > that ? > > 2-Second solution: > - first read the file with DATA<-read.csv("test.txt") > - then replace all fields containing a * with NA in applying the following > function to the object DATA: > DATA_cleaned<-apply(DATA,c(1,2),function(x){if(length(grep("\\*",x,value=TRUE))==1){x<-NA}}) > DATA_cleaned > X1 X.789 LNM. X78 X56 X89 X56.1 X100 > [1,] NULL NULL NULL NULL NULL NULL NULL NULL > [2,] NULL NULL NULL NULL NULL NULL NULL NULL > [3,] NULL NULL NULL NULL NULL NULL NULL NA > [4,] NULL NULL NULL NULL NULL NA NULL NULL > [5,] NULL NULL NULL NULL NA NULL NULL NULL > [6,] NULL NULL NULL NULL NULL NULL NULL NA > [7,] NULL NULL NA NULL NULL NULL NULL NULL > [8,] NULL NULL NULL NULL NULL NULL NULL NA > [9,] NULL NA NULL NULL NULL NULL NULL NULL > > stars have deaseper, but all the rest too ! > The pb comes from the fact that if a field does not contain any *, the > command > if(length(grep("\\*",x,value=T))==1) return NULL instead of FALSE ! > > I you have any idea, please let me know ! > > Many thanks, > > Jessica > ____________________________________ > > Jessica Gervais > Mail: jessica.gervais at tudor.lu > > Resource Centre for Environmental Technologies, > Public Research Centre Henri Tudor, > Technoport Schlassgoart, > 66 rue de Luxembourg, > P.O. BOX 144, > L-4002 Esch-sur-Alzette, Luxembourg > > (See attached file: test.txt) > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve?