Dear List, I have an ascii text file with data I'd like to extract. Example: Year Built: 1873 Gross Building Area: 578 sq ft Total Rooms: 6 Living Area: 578 sq ft There is a lot of data I'd like to ignore in each record, so I'm hoping there is a way to use strings as delimiters to get the data I want (e.g. tell R to take data between "Built:" and "Gross" - incidentally, not always numeric). I think an ugly way would be to start at the end of each record and use a substitution expression to chip away at it, but I'm afraid it will take forever to run. Is there a way to use strings as delimiters in an expression? Thanks in advance for ideas. LB
have you seen help(strsplit)? On Tue, 25 Sep 2007, lucy b wrote:> Dear List, > > I have an ascii text file with data I'd like to extract. Example: > > Year Built: 1873 Gross Building Area: 578 sq ft > Total Rooms: 6 Living Area: 578 sq ft > > There is a lot of data I'd like to ignore in each record, so I'm > hoping there is a way to use strings as delimiters to get the data I > want (e.g. tell R to take data between "Built:" and "Gross" - > incidentally, not always numeric). I think an ugly way would be to > start at the end of each record and use a substitution expression to > chip away at it, but I'm afraid it will take forever to run. Is there > a way to use strings as delimiters in an expression? > > Thanks in advance for ideas. > > LB > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
On Tue, 2007-09-25 at 16:39 -0400, lucy b wrote:> Dear List, > > I have an ascii text file with data I'd like to extract. Example: > > Year Built: 1873 Gross Building Area: 578 sq ft > Total Rooms: 6 Living Area: 578 sq ft > > There is a lot of data I'd like to ignore in each record, so I'm > hoping there is a way to use strings as delimiters to get the data I > want (e.g. tell R to take data between "Built:" and "Gross" - > incidentally, not always numeric). I think an ugly way would be to > start at the end of each record and use a substitution expression to > chip away at it, but I'm afraid it will take forever to run. Is there > a way to use strings as delimiters in an expression? > > Thanks in advance for ideas. > > LBI don't know that any of the default base functions enable the use of a regex as a delimiter. If your text file is consistent in the use of the colon ':' as a separator, you might be able to use that. Each of the above lines then would be broken into 3 fields using: DF <- read.table("YourFile.txt", sep = ":")> DFV1 V2 V3 1 Year Built 1873 Gross Building Area 578 sq ft 2 Total Rooms 6 Living Area 578 sq ft You could then parse them further using appropriate functions if needed, such as gsub():> as.data.frame(lapply(DF[, -1], function(x) gsub("[^0-9]", "", x)))V2 V3 1 1873 578 2 6 578 This now gives you the numeric data in two columns. You would now need to know that data in the rows are perhaps in some predictable or alternating order for further processing. See ?gsub and ?regex for more information. Hope that provides some help. You also might want to look at ?readLines and ?strsplit as other ways to read in the data and then post-process it once in an R object. Marc Schwartz
Here is one way. You can setup a list of the patterns to match against and then apply it to the string. I am not sure what the rest of the text file look like, but this will return all the values that match.> x <- readLines(textConnection("Year Built: 1873 Gross Building Area: 578 sq ft+ Total Rooms: 6 Living Area: 578 sq ft + Year Built: 1873 Gross Building Area: 578 sq ft + Total Rooms: 6 Living Area: 578 sq ft"))> > # list for pattern matches > m.list <- list(year=".*Year Built:(.*)Gross.*",+ Buildarea=".*Building Area:(.*)sq ft.*", + rooms=".*Rooms:(.*)Liv.*", + Livingarea=".*Living Area:(.*)sq ft.*")> > # use lapply to process the patterns and return a list with the name of the > # pattern and its value > lapply(names(m.list), function(.pat){+ # see which lines have the desired patterns + whichLines <- grep(m.list[[.pat]], x) + if (length(whichLines) > 0){ + return(list(pattern=.pat, values=sub(m.list[[.pat]], "\\1", x[whichLines]))) + } + else return(NULL) + }) [[1]] [[1]]$pattern [1] "year" [[1]]$values [1] " 1873 " " 1873 " [[2]] [[2]]$pattern [1] "Buildarea" [[2]]$values [1] " 578 " " 578 " [[3]] [[3]]$pattern [1] "rooms" [[3]]$values [1] " 6 " " 6 " [[4]] [[4]]$pattern [1] "Livingarea" [[4]]$values [1] " 578 " " 578 " On 9/25/07, lucy b <lucy.lists at gmail.com> wrote:> Dear List, > > I have an ascii text file with data I'd like to extract. Example: > > Year Built: 1873 Gross Building Area: 578 sq ft > Total Rooms: 6 Living Area: 578 sq ft > > There is a lot of data I'd like to ignore in each record, so I'm > hoping there is a way to use strings as delimiters to get the data I > want (e.g. tell R to take data between "Built:" and "Gross" - > incidentally, not always numeric). I think an ugly way would be to > start at the end of each record and use a substitution expression to > chip away at it, but I'm afraid it will take forever to run. Is there > a way to use strings as delimiters in an expression? > > Thanks in advance for ideas. > > LB > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve?
On 25-Sep-07 20:39:11, lucy b wrote:> Dear List, > > I have an ascii text file with data I'd like to extract. Example: > > Year Built: 1873 Gross Building Area: 578 sq ft > Total Rooms: 6 Living Area: 578 sq ft > > There is a lot of data I'd like to ignore in each record, so I'm > hoping there is a way to use strings as delimiters to get the data I > want (e.g. tell R to take data between "Built:" and "Gross" - > incidentally, not always numeric). I think an ugly way would be to > start at the end of each record and use a substitution expression to > chip away at it, but I'm afraid it will take forever to run. Is there > a way to use strings as delimiters in an expression? > > Thanks in advance for ideas. > > LBThe scope of what you're trying to achieve is not clear, though on the basis of your examples above you'd have to use a different separator pattern for each type of line. For your first example, a simple method is on the lines of gsub(".*Built:" , "", "Year Built: 1873 Gross Building Area: 578 sq ft") [1] " 1873 Gross Building Area: 578 sq ft" and then just take the first white-space-delimited field from the result. Best wishes, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 25-Sep-07 Time: 23:20:01 ------------------------------ XFMail ------------------------------
Perhaps you could clarify what the general rule is but assuming that what you want is any word after a colon it can be done with strapply in the gsubfn package like this: Lines <- c("Year Built: 1873 Gross Building Area: 578 sq ft", "Total Rooms: 6 Living Area: 578 sq ft") library(gsubfn) strapply(Lines, ": *(\\w+)", backref = -1) # or if each line has same number of returned words strapply(Lines, ": *(\\w+)", backref = -1, simplify = rbind) This matches a colon (:) followed by zero or more spaces ( *) followed by a word ((\\w+)) and backref= - 1 causes it to return only the first backreference (i..e. the portion within parentheses) but not the match itself. On 9/25/07, lucy b <lucy.lists at gmail.com> wrote:> Dear List, > > I have an ascii text file with data I'd like to extract. Example: > > Year Built: 1873 Gross Building Area: 578 sq ft > Total Rooms: 6 Living Area: 578 sq ft > > There is a lot of data I'd like to ignore in each record, so I'm > hoping there is a way to use strings as delimiters to get the data I > want (e.g. tell R to take data between "Built:" and "Gross" - > incidentally, not always numeric). I think an ugly way would be to > start at the end of each record and use a substitution expression to > chip away at it, but I'm afraid it will take forever to run. Is there > a way to use strings as delimiters in an expression? > > Thanks in advance for ideas. > > LB > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >