thr3ads.net - R help - [R] extracting data using strings as delimiters [Sep 2007]

If this information is useful, please help other people find it:
Share via:

lucy b

2007-Sep-25 20:39 UTC

[R] extracting data using strings as delimiters

Dear List,

I have an ascii text file with data I'd like to extract. Example:

Year Built:  1873 Gross Building Area:  578 sq ft
Total Rooms:  6 Living Area:  578 sq ft

There is a lot of data I'd like to ignore in each record, so I'm
hoping there is a way to use strings as delimiters to get the data I
want (e.g. tell R to take data between "Built:" and "Gross"
-
incidentally, not always numeric). I think an ugly way would be to
start at the end of each record and use a substitution expression to
chip away at it, but I'm afraid it will take forever to run. Is there
a way to use strings as delimiters in an expression?

Thanks in advance for ideas.

LB

Katharine Mullen

2007-Sep-25 21:07 UTC

head link

[R] extracting data using strings as delimiters

have you seen help(strsplit)?

On Tue, 25 Sep 2007, lucy b wrote:
> Dear List,
>
> I have an ascii text file with data I'd like to extract. Example:
>
> Year Built:  1873 Gross Building Area:  578 sq ft
> Total Rooms:  6 Living Area:  578 sq ft
>
> There is a lot of data I'd like to ignore in each record, so I'm
> hoping there is a way to use strings as delimiters to get the data I
> want (e.g. tell R to take data between "Built:" and
"Gross" -
> incidentally, not always numeric). I think an ugly way would be to
> start at the end of each record and use a substitution expression to
> chip away at it, but I'm afraid it will take forever to run. Is there
> a way to use strings as delimiters in an expression?
>
> Thanks in advance for ideas.
>
> LB
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Marc Schwartz

2007-Sep-25 21:15 UTC

head link

[R] extracting data using strings as delimiters

On Tue, 2007-09-25 at 16:39 -0400, lucy b wrote:> Dear List,
> 
> I have an ascii text file with data I'd like to extract. Example:
> 
> Year Built:  1873 Gross Building Area:  578 sq ft
> Total Rooms:  6 Living Area:  578 sq ft
> 
> There is a lot of data I'd like to ignore in each record, so I'm
> hoping there is a way to use strings as delimiters to get the data I
> want (e.g. tell R to take data between "Built:" and
"Gross" -
> incidentally, not always numeric). I think an ugly way would be to
> start at the end of each record and use a substitution expression to
> chip away at it, but I'm afraid it will take forever to run. Is there
> a way to use strings as delimiters in an expression?
> 
> Thanks in advance for ideas.
> 
> LB
I don't know that any of the default base functions enable the use of a
regex as a delimiter. If your text file is consistent in the use of the
colon ':' as a separator, you might be able to use that. Each of the
above lines then would be broken into 3 fields using:

DF <- read.table("YourFile.txt", sep = ":")
> DF           V1                         V2          V3
1  Year Built   1873 Gross Building Area   578 sq ft
2 Total Rooms              6 Living Area   578 sq ft


You could then parse them further using appropriate functions if needed,
such as gsub():
> as.data.frame(lapply(DF[, -1], function(x) gsub("[^0-9]",
"", x)))    V2  V3
1 1873 578
2    6 578


This now gives you the numeric data in two columns. You would now need
to know that data in the rows are perhaps in some predictable or
alternating order for further processing.  See ?gsub and ?regex for more
information.

Hope that provides some help. You also might want to look at ?readLines
and ?strsplit as other ways to read in the data and then post-process it
once in an R object.

Marc Schwartz

jim holtman

2007-Sep-25 21:25 UTC

head link

[R] extracting data using strings as delimiters

Here is one way.  You can setup a list of the patterns to match
against and then apply it to the string.  I am not sure  what the rest
of the text file look like, but this will return all the values that
match.
> x <- readLines(textConnection("Year Built:  1873 Gross Building
Area:  578 sq ft+ Total Rooms:  6 Living Area:  578 sq ft
+ Year Built:  1873 Gross Building Area:  578 sq ft
+ Total Rooms:  6 Living Area:  578 sq ft"))>
> # list for pattern matches
> m.list <- list(year=".*Year Built:(.*)Gross.*",+     Buildarea=".*Building Area:(.*)sq ft.*",
+     rooms=".*Rooms:(.*)Liv.*",
+     Livingarea=".*Living Area:(.*)sq ft.*")>
> # use lapply to process the patterns and return a list with the name of the
> # pattern and its value
> lapply(names(m.list), function(.pat){+     # see which lines have the desired patterns
+     whichLines <- grep(m.list[[.pat]], x)
+     if (length(whichLines) > 0){
+         return(list(pattern=.pat, values=sub(m.list[[.pat]], "\\1",
x[whichLines])))
+     }
+     else return(NULL)
+ })
[[1]]
[[1]]$pattern
[1] "year"

[[1]]$values
[1] "  1873 " "  1873 "


[[2]]
[[2]]$pattern
[1] "Buildarea"

[[2]]$values
[1] "  578 " "  578 "


[[3]]
[[3]]$pattern
[1] "rooms"

[[3]]$values
[1] "  6 " "  6 "


[[4]]
[[4]]$pattern
[1] "Livingarea"

[[4]]$values
[1] "  578 " "  578 "




On 9/25/07, lucy b <lucy.lists at gmail.com>
wrote:> Dear List,
>
> I have an ascii text file with data I'd like to extract. Example:
>
> Year Built:  1873 Gross Building Area:  578 sq ft
> Total Rooms:  6 Living Area:  578 sq ft
>
> There is a lot of data I'd like to ignore in each record, so I'm
> hoping there is a way to use strings as delimiters to get the data I
> want (e.g. tell R to take data between "Built:" and
"Gross" -
> incidentally, not always numeric). I think an ugly way would be to
> start at the end of each record and use a substitution expression to
> chip away at it, but I'm afraid it will take forever to run. Is there
> a way to use strings as delimiters in an expression?
>
> Thanks in advance for ideas.
>
> LB
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

(Ted Harding)

2007-Sep-25 22:20 UTC

head link

[R] extracting data using strings as delimiters

On 25-Sep-07 20:39:11, lucy b wrote:> Dear List,
> 
> I have an ascii text file with data I'd like to extract. Example:
> 
> Year Built:  1873 Gross Building Area:  578 sq ft
> Total Rooms:  6 Living Area:  578 sq ft
> 
> There is a lot of data I'd like to ignore in each record, so I'm
> hoping there is a way to use strings as delimiters to get the data I
> want (e.g. tell R to take data between "Built:" and
"Gross" -
> incidentally, not always numeric). I think an ugly way would be to
> start at the end of each record and use a substitution expression to
> chip away at it, but I'm afraid it will take forever to run. Is there
> a way to use strings as delimiters in an expression?
> 
> Thanks in advance for ideas.
> 
> LB
The scope of what you're trying to achieve is not clear,
though on the basis of your examples above you'd have to
use a different separator pattern for each type of line.

For your first example, a simple method is on the lines of

gsub(".*Built:" , "",
     "Year Built:  1873 Gross Building Area:  578 sq ft")
[1] "  1873 Gross Building Area:  578 sq ft"

and then just take the first white-space-delimited field
from the result.

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 25-Sep-07                                       Time: 23:20:01
------------------------------ XFMail ------------------------------

Gabor Grothendieck

2007-Sep-25 22:59 UTC

head link

[R] extracting data using strings as delimiters

Perhaps you could clarify what the general rule is but assuming
that what you want is any word after a colon it can be done with
strapply in the gsubfn package like this:

Lines <- c("Year Built:  1873 Gross Building Area:  578 sq ft",
"Total Rooms:  6 Living Area:  578 sq ft")

library(gsubfn)
strapply(Lines, ": *(\\w+)", backref = -1)

# or if each line has same number of returned words
strapply(Lines, ": *(\\w+)", backref = -1, simplify = rbind)

This matches a colon (:) followed by zero or more spaces ( *)
followed by a word ((\\w+)) and backref= - 1 causes it to return
only the first backreference (i..e. the portion within parentheses)
but not the match itself.

On 9/25/07, lucy b <lucy.lists at gmail.com>
wrote:> Dear List,
>
> I have an ascii text file with data I'd like to extract. Example:
>
> Year Built:  1873 Gross Building Area:  578 sq ft
> Total Rooms:  6 Living Area:  578 sq ft
>
> There is a lot of data I'd like to ignore in each record, so I'm
> hoping there is a way to use strings as delimiters to get the data I
> want (e.g. tell R to take data between "Built:" and
"Gross" -
> incidentally, not always numeric). I think an ugly way would be to
> start at the end of each record and use a substitution expression to
> chip away at it, but I'm afraid it will take forever to run. Is there
> a way to use strings as delimiters in an expression?
>
> Thanks in advance for ideas.
>
> LB
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Maybe Matching Threads

Search for more possibly parallel threads

R help - Sep 2007 - extracting data using strings as delimiters

[R] extracting data using strings as delimiters

[R] extracting data using strings as delimiters

[R] extracting data using strings as delimiters

[R] extracting data using strings as delimiters

[R] extracting data using strings as delimiters

[R] extracting data using strings as delimiters

Maybe Matching Threads