thr3ads.net - R help - [R] reading formatted txt file into a data frame [May 2010]

If this information is useful, please help other people find it:
Share via:

Tony B

2010-May-06 13:58 UTC

[R] reading formatted txt file into a data frame

Dear all

Lets say I have a plain text file as follows:
> cat(c("[ID: 001 ] [Writer: Steven Moffat ] [Rating: 8.9 ] Doctor
Who",+       "[ID: 002 ] [Writer: Joss Whedon ] [Rating: 8.8 ] Buffy",
+       "[ID: 003 ] [Writer: J. Michael Straczynski ] [Rating: 7.4 ]
Babylon [5]"),
+       sep = "\n", file = "tmp.txt")

I would somehow like to read in this file to R and covert it into a
data frame like this:
> DF <- data.frame(ID = c("001", "002",
"003"),+                 Writer = c("Steven Moffat", "Joss Whedon",
"J.
Michael Straczynski"),
+                 Rating = c("8.9", "8.8", "7.4"),
+                 Text = c("Doctor Who", "Buffy",
"Babylon [5]"),
stringsAsFactors = FALSE)


My initial thoughts were to use readLines on the text file and maybe
do some regular expressions and also use strsplit(..); but having
confused myself after several attempts I was wondering if there is a
way, perhaps using maybe read.table instead?  My end goal is to
hopefully convert DF into an XML structure.

Thank you kindly in advance for your time,
Tony Breyal

# Windows Vista> sessionInfo()R version 2.11.0 (2010-04-22)
i386-pc-mingw32

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United
Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
LC_NUMERIC=C                            LC_TIME=English_United Kingdom.
1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods
base

other attached packages:
[1] XML_2.8-1

loaded via a namespace (and not attached):
[1] tools_2.11.0

Steve Lianoglou

2010-May-06 16:14 UTC

head link

[R] reading formatted txt file into a data frame

Hi Tony,

On Thu, May 6, 2010 at 9:58 AM, Tony B <tony.breyal at googlemail.com>
wrote:> Dear all
>
> Lets say I have a plain text file as follows:
>
>> cat(c("[ID: 001 ] [Writer: Steven Moffat ] [Rating: 8.9 ] Doctor
Who",
> + ? ? ? "[ID: 002 ] [Writer: Joss Whedon ] [Rating: 8.8 ] Buffy",
> + ? ? ? "[ID: 003 ] [Writer: J. Michael Straczynski ] [Rating: 7.4 ]
> Babylon [5]"),
> + ? ? ? sep = "\n", file = "tmp.txt")
>
> I would somehow like to read in this file to R and covert it into a
> data frame like this:
>
>> DF <- data.frame(ID = c("001", "002",
"003"),
> + ? ? ? ? ? ? ? ? Writer = c("Steven Moffat", "Joss
Whedon", "J.
> Michael Straczynski"),
> + ? ? ? ? ? ? ? ? Rating = c("8.9", "8.8",
"7.4"),
> + ? ? ? ? ? ? ? ? Text = c("Doctor Who", "Buffy",
"Babylon [5]"),
> stringsAsFactors = FALSE)
>
>
> My initial thoughts were to use readLines on the text file and maybe
> do some regular expressions and also use strsplit(..); but having
> confused myself after several attempts I was wondering if there is a
> way, perhaps using maybe read.table instead? ?My end goal is to
> hopefully convert DF into an XML structure.
I can't think of an easy way to do it with a simple read.table call.

As you suggested, I'd try to whip this into shape by loading into a
character vector using "readLines" / strsplit / regular expression.

If your data is so well behaved, why not try splitting your lines by
"]", then do some mincing.

For instance:
## Simulate a readLines on your file
lines<- c("[ID: 001 ] [Writer: Steven Moffat ] [Rating: 8.9 ] Doctor
Who",
+ "[ID: 002 ] [Writer: Joss Whedon ] [Rating: 8.8 ] Buffy",
+ "[ID: 003 ] [Writer: J. Michael Straczynski ] [Rating: 7.4 ]
+ Babylon [5]")

## Create an empty data.frame
df <- data.frame(id=character(length(lines)),
writer=character(length(lines)),
                        rating=numeric(length(lines)),
text=character(length(lines)))

pieces <- strsplit(lines, "]", fixed=TRUE)

## Store into their seperate pieces for more processing
ids <- sapply(pieces, '[[', 1)
writers <- sapply(pieces, '[[', 2)
ratings <- sapply(pieces, '[[', 3)
texts <- sapply(pieces, '[[', 4)

## You can use regexes again, or strsplit judiciously
clean.ids <- sapply(strsplit(ids, ' '), '[', 2)
clean.writers <- sapply(strsplit(writers, ':', fixed=TRUE),
'[', 2)
...

Honestly, if your data isn't all that well behaved, I'd probably do
this in another language like Python to whip it into a "cleaner" tab
separated file that can easily be read into R. I tend to like Python's
matching behavior with regex's a bit better ...

-steve
-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

jim holtman

2010-May-06 16:24 UTC

head link

[R] reading formatted txt file into a data frame

Try this:
> cat(c("[ID: 001 ] [Writer: Steven Moffat ] [Rating: 8.9 ] Doctor
Who",+       "[ID: 002 ] [Writer: Joss Whedon ] [Rating: 8.8 ] Buffy",
+       "[ID: 003 ] [Writer: J. Michael Straczynski ] [Rating: 7.4
]Babylon"),
+       sep = "\n", file =
"tmp.txt")>
> # read in the data and parse it assuming it has the same structure
> input <- readLines('tmp.txt')
> # parse it item by item
> x.id <- sub(".*\\[ID: ([[:digit:]]+).*", "\\1
<file://0.0.0.1/>", input)
> x.writer <- sub(".*\\[Writer:([^]]+).*", '\\1', input)
> x.rating <- sub(".*\\[Rating: ([0-9.]+).*", '\\1',
input)
> x.prog <- sub(".*\\](.*)", '\\1', input)
> #create dataframe
> data.frame(id=x.id, writer=x.writer, rating=x.rating, prog=x.prog)   id                   writer rating        prog
1 001           Steven Moffat     8.9  Doctor Who
2 002             Joss Whedon     8.8       Buffy
3 003  J. Michael Straczynski     7.4     Babylon>

On Thu, May 6, 2010 at 9:58 AM, Tony B <tony.breyal@googlemail.com> wrote:
> Dear all
>
> Lets say I have a plain text file as follows:
>
> > cat(c("[ID: 001 ] [Writer: Steven Moffat ] [Rating: 8.9 ] Doctor
Who",
> +       "[ID: 002 ] [Writer: Joss Whedon ] [Rating: 8.8 ] Buffy",
> +       "[ID: 003 ] [Writer: J. Michael Straczynski ] [Rating: 7.4 ]
> Babylon [5]"),
> +       sep = "\n", file = "tmp.txt")
>
> I would somehow like to read in this file to R and covert it into a
> data frame like this:
>
> > DF <- data.frame(ID = c("001", "002",
"003"),
> +                 Writer = c("Steven Moffat", "Joss
Whedon", "J.
> Michael Straczynski"),
> +                 Rating = c("8.9", "8.8",
"7.4"),
> +                 Text = c("Doctor Who", "Buffy",
"Babylon [5]"),
> stringsAsFactors = FALSE)
>
>
> My initial thoughts were to use readLines on the text file and maybe
> do some regular expressions and also use strsplit(..); but having
> confused myself after several attempts I was wondering if there is a
> way, perhaps using maybe read.table instead?  My end goal is to
> hopefully convert DF into an XML structure.
>
> Thank you kindly in advance for your time,
> Tony Breyal
>
> # Windows Vista
> > sessionInfo()
> R version 2.11.0 (2010-04-22)
> i386-pc-mingw32
>
> locale:
> [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United
> Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
> LC_NUMERIC=C                            LC_TIME=English_United Kingdom.
> 1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods
> base
>
> other attached packages:
> [1] XML_2.8-1
>
> loaded via a namespace (and not attached):
> [1] tools_2.11.0
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
>
http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

	[[alternative HTML version deleted]]

Maybe Matching Threads

Search for more maybe matching threads

R help - May 2010 - reading formatted txt file into a data frame

[R] reading formatted txt file into a data frame

[R] reading formatted txt file into a data frame

[R] reading formatted txt file into a data frame

Maybe Matching Threads