Dear all Lets say I have a plain text file as follows:> cat(c("[ID: 001 ] [Writer: Steven Moffat ] [Rating: 8.9 ] Doctor Who",+ "[ID: 002 ] [Writer: Joss Whedon ] [Rating: 8.8 ] Buffy", + "[ID: 003 ] [Writer: J. Michael Straczynski ] [Rating: 7.4 ] Babylon [5]"), + sep = "\n", file = "tmp.txt") I would somehow like to read in this file to R and covert it into a data frame like this:> DF <- data.frame(ID = c("001", "002", "003"),+ Writer = c("Steven Moffat", "Joss Whedon", "J. Michael Straczynski"), + Rating = c("8.9", "8.8", "7.4"), + Text = c("Doctor Who", "Buffy", "Babylon [5]"), stringsAsFactors = FALSE) My initial thoughts were to use readLines on the text file and maybe do some regular expressions and also use strsplit(..); but having confused myself after several attempts I was wondering if there is a way, perhaps using maybe read.table instead? My end goal is to hopefully convert DF into an XML structure. Thank you kindly in advance for your time, Tony Breyal # Windows Vista> sessionInfo()R version 2.11.0 (2010-04-22) i386-pc-mingw32 locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C LC_TIME=English_United Kingdom. 1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] XML_2.8-1 loaded via a namespace (and not attached): [1] tools_2.11.0
Hi Tony, On Thu, May 6, 2010 at 9:58 AM, Tony B <tony.breyal at googlemail.com> wrote:> Dear all > > Lets say I have a plain text file as follows: > >> cat(c("[ID: 001 ] [Writer: Steven Moffat ] [Rating: 8.9 ] Doctor Who", > + ? ? ? "[ID: 002 ] [Writer: Joss Whedon ] [Rating: 8.8 ] Buffy", > + ? ? ? "[ID: 003 ] [Writer: J. Michael Straczynski ] [Rating: 7.4 ] > Babylon [5]"), > + ? ? ? sep = "\n", file = "tmp.txt") > > I would somehow like to read in this file to R and covert it into a > data frame like this: > >> DF <- data.frame(ID = c("001", "002", "003"), > + ? ? ? ? ? ? ? ? Writer = c("Steven Moffat", "Joss Whedon", "J. > Michael Straczynski"), > + ? ? ? ? ? ? ? ? Rating = c("8.9", "8.8", "7.4"), > + ? ? ? ? ? ? ? ? Text = c("Doctor Who", "Buffy", "Babylon [5]"), > stringsAsFactors = FALSE) > > > My initial thoughts were to use readLines on the text file and maybe > do some regular expressions and also use strsplit(..); but having > confused myself after several attempts I was wondering if there is a > way, perhaps using maybe read.table instead? ?My end goal is to > hopefully convert DF into an XML structure.I can't think of an easy way to do it with a simple read.table call. As you suggested, I'd try to whip this into shape by loading into a character vector using "readLines" / strsplit / regular expression. If your data is so well behaved, why not try splitting your lines by "]", then do some mincing. For instance: ## Simulate a readLines on your file lines<- c("[ID: 001 ] [Writer: Steven Moffat ] [Rating: 8.9 ] Doctor Who", + "[ID: 002 ] [Writer: Joss Whedon ] [Rating: 8.8 ] Buffy", + "[ID: 003 ] [Writer: J. Michael Straczynski ] [Rating: 7.4 ] + Babylon [5]") ## Create an empty data.frame df <- data.frame(id=character(length(lines)), writer=character(length(lines)), rating=numeric(length(lines)), text=character(length(lines))) pieces <- strsplit(lines, "]", fixed=TRUE) ## Store into their seperate pieces for more processing ids <- sapply(pieces, '[[', 1) writers <- sapply(pieces, '[[', 2) ratings <- sapply(pieces, '[[', 3) texts <- sapply(pieces, '[[', 4) ## You can use regexes again, or strsplit judiciously clean.ids <- sapply(strsplit(ids, ' '), '[', 2) clean.writers <- sapply(strsplit(writers, ':', fixed=TRUE), '[', 2) ... Honestly, if your data isn't all that well behaved, I'd probably do this in another language like Python to whip it into a "cleaner" tab separated file that can easily be read into R. I tend to like Python's matching behavior with regex's a bit better ... -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
Try this:> cat(c("[ID: 001 ] [Writer: Steven Moffat ] [Rating: 8.9 ] Doctor Who",+ "[ID: 002 ] [Writer: Joss Whedon ] [Rating: 8.8 ] Buffy", + "[ID: 003 ] [Writer: J. Michael Straczynski ] [Rating: 7.4 ]Babylon"), + sep = "\n", file = "tmp.txt")> > # read in the data and parse it assuming it has the same structure > input <- readLines('tmp.txt') > # parse it item by item > x.id <- sub(".*\\[ID: ([[:digit:]]+).*", "\\1 <file://0.0.0.1/>", input) > x.writer <- sub(".*\\[Writer:([^]]+).*", '\\1', input) > x.rating <- sub(".*\\[Rating: ([0-9.]+).*", '\\1', input) > x.prog <- sub(".*\\](.*)", '\\1', input) > #create dataframe > data.frame(id=x.id, writer=x.writer, rating=x.rating, prog=x.prog)id writer rating prog 1 001 Steven Moffat 8.9 Doctor Who 2 002 Joss Whedon 8.8 Buffy 3 003 J. Michael Straczynski 7.4 Babylon>On Thu, May 6, 2010 at 9:58 AM, Tony B <tony.breyal@googlemail.com> wrote:> Dear all > > Lets say I have a plain text file as follows: > > > cat(c("[ID: 001 ] [Writer: Steven Moffat ] [Rating: 8.9 ] Doctor Who", > + "[ID: 002 ] [Writer: Joss Whedon ] [Rating: 8.8 ] Buffy", > + "[ID: 003 ] [Writer: J. Michael Straczynski ] [Rating: 7.4 ] > Babylon [5]"), > + sep = "\n", file = "tmp.txt") > > I would somehow like to read in this file to R and covert it into a > data frame like this: > > > DF <- data.frame(ID = c("001", "002", "003"), > + Writer = c("Steven Moffat", "Joss Whedon", "J. > Michael Straczynski"), > + Rating = c("8.9", "8.8", "7.4"), > + Text = c("Doctor Who", "Buffy", "Babylon [5]"), > stringsAsFactors = FALSE) > > > My initial thoughts were to use readLines on the text file and maybe > do some regular expressions and also use strsplit(..); but having > confused myself after several attempts I was wondering if there is a > way, perhaps using maybe read.table instead? My end goal is to > hopefully convert DF into an XML structure. > > Thank you kindly in advance for your time, > Tony Breyal > > # Windows Vista > > sessionInfo() > R version 2.11.0 (2010-04-22) > i386-pc-mingw32 > > locale: > [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United > Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 > LC_NUMERIC=C LC_TIME=English_United Kingdom. > 1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] XML_2.8-1 > > loaded via a namespace (and not attached): > [1] tools_2.11.0 > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? [[alternative HTML version deleted]]