Min-Han Tan
2010-Apr-23 04:24 UTC
[R] Problem with parsing a dataset - help earnestly sought
Dear fellow R-help members,
I hope to seek your advice on how to parse/manage a dataset with hundreds of
columns. Two examples of these columns, 'cancer.problems', and
'neuro.problems' are depicted below. Essentially, I need to parse this
into
a useful dataset, and unfortunately, I am not familiar with perl or any such
language.
data <- data.frame(id=c(1:10))
data$cancer.problems <- c("Y; DX AGE: 28; COLON", "",
"Y; DX AGE: 27;", "Y;
LIVER","","Y","Y; DX AGE:
24;","Y","Y;DX AGE: 44;","Y;DX AGE: 39;
TESTIS")
data$neuro.problems <- c("Y: DX AGE:
80-89;","Y","","Y; DX AGE: 74;
STROKE","Y; DEMENTIA","Y","","Y; DX AGE:
33; CHOREA", "Y", "Y; WEAKNESS")
As can be seen, the semi-colon delimiter follows its own set of rules, which
are internally consistent - with all 3 elements of data, it should be
"Status; Age; Tissue Type". However, if there is only tissue type, it
is"
Status; Tissue Type", without the trailing semi-colon. However, if there is
Age available, it is "Status; Age;".
The main challenge for me is how to parse/convert this dataset into a useful
and consistent data.frame, or list, where I can capture Status, Age and
Tissue Type as separate fields. Due to the varying application of the
delimiter, I cannot use strsplit consistently. I have tried a convoluted
method by identifying "AGE" as the character string identifying 3
element
fields per below, but faced problems with unlist, given the empty fields.
age.present <- grepl("AGE",data[,2])
data.3column <- strsplit(data[age.present,2],";")
data.2column <- strsplit(data[!age.present,2],";")
data$cancer.status[age.present] <- unlist(data.3column)
[(1:sum(age.present)*3)-2]
...
Your advice is earnestly sought.
Thanks.
Min-Han
[[alternative HTML version deleted]]
Jan van der Laan
2010-Apr-23 06:57 UTC
[R] Problem with parsing a dataset - help earnestly sought
Something like this?
# Remove everything after ; to give the status
status <- sub(';.*$', '', data$cancer.problems)
# Remove everything before the last ; to give tissue
# In case a no ; in the string this goes wrong; correct
tissue <- sub('^.*;[ \n]*', '', data$cancer.problems)
tissue[! grepl(';', data$cancer.problems)] <- ''
# Select the part between ;'s to give age
indices <- regexpr(';.*;', data$cancer.problems)
lengths <- attr(indices, "match.length")
age <- rep(NA, length(data$cancer.problems))
age[indices>0] <- substring(data$cancer.problems[indices>0],
indices[indices>0]+1, indices[indices>0]+lengths[indices>0]-2)
Hope it helps.
Regards,
Jan
On Fri, Apr 23, 2010 at 6:24 AM, Min-Han Tan <minhan.science at gmail.com>
wrote:> Dear fellow R-help members,
>
> I hope to seek your advice on how to parse/manage a dataset with hundreds
of
> columns. Two examples of these columns, 'cancer.problems', and
> 'neuro.problems' are depicted below. Essentially, I need to parse
this into
> a useful dataset, and unfortunately, I am not familiar with perl or any
such
> language.
>
> data <- data.frame(id=c(1:10))
> data$cancer.problems <- c("Y; DX AGE: 28; COLON",
"", "Y; DX AGE: 27;", "Y;
> LIVER","","Y","Y; DX AGE:
24;","Y","Y;DX AGE: 44;","Y;DX AGE: 39;
TESTIS")
> data$neuro.problems <- c("Y: DX AGE:
80-89;","Y","","Y; DX AGE: 74;
> STROKE","Y; DEMENTIA","Y","","Y; DX
AGE: 33; CHOREA", "Y", "Y; WEAKNESS")
>
> As can be seen, the semi-colon delimiter follows its own set of rules,
which
> are internally consistent - with all 3 elements of data, it should be
> "Status; Age; Tissue Type". However, if there is only tissue
type, it is"
> Status; Tissue Type", without the trailing semi-colon. However, if
there is
> Age available, it is "Status; Age;".
>
> The main challenge for me is how to parse/convert this dataset into a
useful
> and consistent data.frame, or list, where I can capture Status, Age and
> Tissue Type as separate fields. Due to the varying application of the
> delimiter, I cannot use strsplit consistently. I have tried a convoluted
> method by identifying "AGE" as the character string identifying 3
element
> fields per below, but faced problems with unlist, given the empty fields.
>
> age.present <- grepl("AGE",data[,2])
> data.3column <- strsplit(data[age.present,2],";")
> data.2column <- strsplit(data[!age.present,2],";")
> data$cancer.status[age.present] <- unlist(data.3column)
> [(1:sum(age.present)*3)-2]
> ...
>
> Your advice is earnestly sought.
>
> Thanks.
>
> Min-Han
>
> ? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Gabor Grothendieck
2010-Apr-23 09:13 UTC
[R] Problem with parsing a dataset - help earnestly sought
We can use strapply in the gsubfn package. It extracts
fields matching regular expressions.
strapply extracts the parenthesized part of the regular
expression (or the entire regular expression if nothing
parenthesized), applies the function to it and returns
the result. See http://gsubfn.googlecode.com
This works with the rules stated below and works on your
example but the general rules may only be apparent with
more data in which case you may need to make appropriate
adjustments.
Note that the regular expressions:
\w refers to a word character and must be written \\w when within in quotes.
+ means one or more occurrences in a row
$ means end of string
library(gsubfn)
NULL2NA <- function(x) if (is.null(x)) NA else x
extract <- function(x) {
# age is "word" that comes after DX AGE:
age <- strapply(x, "DX AGE: (\\w+)", c)
age <- sapply(age, null2NA)
# tissue is 2 or more word characters at end
tissue <- strapply(x, "\\w\\w+$", c)
tissue <- sapply(tissue, null2NA)
data.frame(age, tissue)
}
extract(data[,2])
extract(data[,3])
On Fri, Apr 23, 2010 at 12:24 AM, Min-Han Tan <minhan.science at
gmail.com> wrote:> Dear fellow R-help members,
>
> I hope to seek your advice on how to parse/manage a dataset with hundreds
of
> columns. Two examples of these columns, 'cancer.problems', and
> 'neuro.problems' are depicted below. Essentially, I need to parse
this into
> a useful dataset, and unfortunately, I am not familiar with perl or any
such
> language.
>
> data <- data.frame(id=c(1:10))
> data$cancer.problems <- c("Y; DX AGE: 28; COLON",
"", "Y; DX AGE: 27;", "Y;
> LIVER","","Y","Y; DX AGE:
24;","Y","Y;DX AGE: 44;","Y;DX AGE: 39;
TESTIS")
> data$neuro.problems <- c("Y: DX AGE:
80-89;","Y","","Y; DX AGE: 74;
> STROKE","Y; DEMENTIA","Y","","Y; DX
AGE: 33; CHOREA", "Y", "Y; WEAKNESS")
>
> As can be seen, the semi-colon delimiter follows its own set of rules,
which
> are internally consistent - with all 3 elements of data, it should be
> "Status; Age; Tissue Type". However, if there is only tissue
type, it is"
> Status; Tissue Type", without the trailing semi-colon. However, if
there is
> Age available, it is "Status; Age;".
>
> The main challenge for me is how to parse/convert this dataset into a
useful
> and consistent data.frame, or list, where I can capture Status, Age and
> Tissue Type as separate fields. Due to the varying application of the
> delimiter, I cannot use strsplit consistently. I have tried a convoluted
> method by identifying "AGE" as the character string identifying 3
element
> fields per below, but faced problems with unlist, given the empty fields.
>
> age.present <- grepl("AGE",data[,2])
> data.3column <- strsplit(data[age.present,2],";")
> data.2column <- strsplit(data[!age.present,2],";")
> data$cancer.status[age.present] <- unlist(data.3column)
> [(1:sum(age.present)*3)-2]
> ...
>
> Your advice is earnestly sought.
>
> Thanks.
>
> Min-Han
>
> ? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>