T.Wunder at stud.uni-heidelberg.de
2010-Mar-01 14:45 UTC
[R] file reading /problems with encoding
Hello, I'm a little frightened because of a problem that occured lately as I tried to read in a xml-file (for replacing some variables in the string with values from a data frame). The almost biggest problem is the encoding of the xml-file. Since it is generated by Word 2007 its encoding is UTF-8 (as to see in the xml-header). Now I'm establishing a file connection with> channel <- file(filename,open="r+", encoding="UTF-8") > ## filename = name of the fileFor reading the whole file, I'm using the readLines()-function as follows> t <- readLines(channel, n=-1,warn=F, encoding="UTF-8")Eventually I'm merging the lines of this data frame with the following> xml <- "" > for(i in 1:length(t)) { > xml <- paste(xml,t[i],sep="") > }(is there a better way of doing this?) However, when I execute those lines, I get a warning like: "incorrect input in the input-connection" When I read the output variable xml, it's kind of clear: The string stops at a combination of chinese or japanese characters (which normally shouldn't be a problem for UTF-8 encoding). So that is the problem. How am I able to read in the whole xml-file as a string in R? I need to have the correct encoding, because I want to grep after special character like "?". Thank you for your help! Kind regards, Tom p.s. I'm not likely to use the XML-package, since I didn't want to parse the xml file :)
On 01.03.2010 15:45, T.Wunder at stud.uni-heidelberg.de wrote:> Hello, > > I'm a little frightened because of a problem that occured lately as I > tried to read in a xml-file (for replacing some variables in the string > with values from a data frame). The almost biggest problem is the > encoding of the xml-file. Since it is generated by Word 2007 its > encoding is UTF-8 (as to see in the xml-header). > Now I'm establishing a file connection with >> channel <- file(filename,open="r+", encoding="UTF-8") >> ## filename = name of the file > > For reading the whole file, I'm using the readLines()-function as follows >> t <- readLines(channel, n=-1,warn=F, encoding="UTF-8") > > Eventually I'm merging the lines of this data frame with the following >> xml <- "" >> for(i in 1:length(t)) { >> xml <- paste(xml,t[i],sep="") >> }You can arrange the former without a loop by xml <- paste(t, collapse="") For the other problem you are reporting: Can you make (the relevbant part of) your file available (say on some web site) so that we can test what is going on? Best, Uwe Ligges> > (is there a better way of doing this?) > > However, when I execute those lines, I get a warning like: > "incorrect input in the input-connection" > When I read the output variable xml, it's kind of clear: The string > stops at a combination of chinese or japanese characters (which normally > shouldn't be a problem for UTF-8 encoding). > > So that is the problem. How am I able to read in the whole xml-file as a > string in R? I need to have the correct encoding, because I want to grep > after special character like "?". > > Thank you for your help! > > Kind regards, Tom > > > p.s. I'm not likely to use the XML-package, since I didn't want to parse > the xml file :) > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.