I am trying to process text fields scanned in from a csv file that is output from the Windows database program FileMakerPro. The characters onscreen look like regular text, but R does not like their underlying binary form. For example, one of text fields contains a name and a number, but R recognizes the number as something other than what it appears to be in plain text. The character string "Draszt 03" after being read into R using scan and ="" becomes "Draszt 03" where the 3 is displayed in my R session as a superscript. Here is the result pasted into this email I'm composing in emacs: "Draszt 0%/1???iso8859-15??" Another clue for the knowledgable: when I try to display the vector element causing trouble, I get <CHARSXP: "Draszt 0%/1???iso8859-15??"> where again the superscipt part is just "3" in my R session. I'm working in Linux, R version 1.9.1, 2004-06-21. Your help will be much appreciated. Scott Waichler Pacific Northwest National Laboratory scott.waichler at pnl.gov
Assuming that the problem is that your input file has additional embedded characters added by the data base program you could try extracting just the text using the UNIX strings program: strings myfile.csv > myfile.txt and see if myfile.txt works with R and if not check out what the differences are between it and the .csv file. Date: Thu, 14 Oct 2004 11:31:33 -0700 From: Scott Waichler <scott.waichler at pnl.gov> To: <r-help at stat.math.ethz.ch> Subject: [R] Problem with number characters I am trying to process text fields scanned in from a csv file that is output from the Windows database program FileMakerPro. The characters onscreen look like regular text, but R does not like their underlying binary form. For example, one of text fields contains a name and a number, but R recognizes the number as something other than what it appears to be in plain text. The character string "Draszt 03" after being read into R using scan and ="" becomes "Draszt 03" where the 3 is displayed in my R session as a superscript. Here is the result pasted into this email I'm composing in emacs: "Draszt 0%/1脗脗?iso8859-15脗鲁" Another clue for the knowledgable: when I try to display the vector element causing trouble, I get <CHARSXP: "Draszt 0%/1脗脗?iso8859-15脗鲁"> where again the superscipt part is just "3" in my R session. I'm working in Linux, R version 1.9.1, 2004-06-21. Your help will be much appreciated. Scott Waichler Pacific Northwest National Laboratory scott.waichler at pnl.gov
Gabor wrote:>Assuming that the problem is that your input file has >additional embedded characters added by the data base >program you could try extracting just the text using >the UNIX strings program: > > strings myfile.csv > myfile.txtSpencer wrote:>"strsplit" can break character strings into single >characters, and "%in%" can be used to classify them.The first suggestion helped me identify and remove some of the embedded characters, namely "^K". Many more remained hidden. The second suggestion gave me the idea of splitting the string on whitespace first, and seeing if the embedded character problem would go way along with the "blank" spaces. It did. In the snippet below, x is the character variable I am trying to process: str.vec <- strsplit(x, "\\s+", perl=T)[[1]] if(length(str.vec) > 0) { x <- paste(str.vec, collapse=" ") x <- gsub("^\\s+", "", x, perl=T) x <- gsub("\\s+$", "", x, perl=T) } There were no problems in processing x thereafter. Thank you, gentlemen. Scott Waichler
Hi Scott, What's the result of running the linux "file" command on your input file? Does it give "ISO-8859 text " or something else? example: [bobby at thor bobby]$ file test2.txt test2.txt: ISO-8859 text Best regards, Bobby On Thu, 14 Oct 2004 11:31:33 -0700, Scott Waichler <scott.waichler at pnl.gov> wrote:> I am trying to process text fields scanned in from a csv file that is > output from the Windows database program FileMakerPro. The characters > onscreen look like regular text, but R does not like their underlying binary form. > For example, one of text fields contains a name and a number, but > R recognizes the number as something other than what it appears > to be in plain text. The character string "Draszt 03" after being > read into R using scan and ="" becomes "Draszt 03" where the 3 is > displayed in my R session as a superscript. Here is the result pasted > into this email I'm composing in emacs: "Draszt 0%/1?iso8859-15??" > Another clue for the knowledgable: when I try to display the vector element > causing trouble, I get > <CHARSXP: "Draszt 0%/1?iso8859-15??"> > where again the superscipt part is just "3" in my R session. I'm working in > Linux, R version 1.9.1, 2004-06-21. Your help will be much appreciated. > > Scott Waichler > Pacific Northwest National Laboratory > scott.waichler at pnl.gov > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
Possibly Parallel Threads
- rimage doesn't install on Mac OS X 10.4
- Possible to get a definition of a function from a package to use without invoking the package?
- levelplot and unequal cell sizes
- Using an image background with graphics
- Problem with R-2.1.0: install.packages() doesn't work