Min-Han Tan
2010-Apr-23 04:24 UTC
[R] Problem with parsing a dataset - help earnestly sought
Dear fellow R-help members, I hope to seek your advice on how to parse/manage a dataset with hundreds of columns. Two examples of these columns, 'cancer.problems', and 'neuro.problems' are depicted below. Essentially, I need to parse this into a useful dataset, and unfortunately, I am not familiar with perl or any such language. data <- data.frame(id=c(1:10)) data$cancer.problems <- c("Y; DX AGE: 28; COLON", "", "Y; DX AGE: 27;", "Y; LIVER","","Y","Y; DX AGE: 24;","Y","Y;DX AGE: 44;","Y;DX AGE: 39; TESTIS") data$neuro.problems <- c("Y: DX AGE: 80-89;","Y","","Y; DX AGE: 74; STROKE","Y; DEMENTIA","Y","","Y; DX AGE: 33; CHOREA", "Y", "Y; WEAKNESS") As can be seen, the semi-colon delimiter follows its own set of rules, which are internally consistent - with all 3 elements of data, it should be "Status; Age; Tissue Type". However, if there is only tissue type, it is" Status; Tissue Type", without the trailing semi-colon. However, if there is Age available, it is "Status; Age;". The main challenge for me is how to parse/convert this dataset into a useful and consistent data.frame, or list, where I can capture Status, Age and Tissue Type as separate fields. Due to the varying application of the delimiter, I cannot use strsplit consistently. I have tried a convoluted method by identifying "AGE" as the character string identifying 3 element fields per below, but faced problems with unlist, given the empty fields. age.present <- grepl("AGE",data[,2]) data.3column <- strsplit(data[age.present,2],";") data.2column <- strsplit(data[!age.present,2],";") data$cancer.status[age.present] <- unlist(data.3column) [(1:sum(age.present)*3)-2] ... Your advice is earnestly sought. Thanks. Min-Han [[alternative HTML version deleted]]
Jan van der Laan
2010-Apr-23 06:57 UTC
[R] Problem with parsing a dataset - help earnestly sought
Something like this? # Remove everything after ; to give the status status <- sub(';.*$', '', data$cancer.problems) # Remove everything before the last ; to give tissue # In case a no ; in the string this goes wrong; correct tissue <- sub('^.*;[ \n]*', '', data$cancer.problems) tissue[! grepl(';', data$cancer.problems)] <- '' # Select the part between ;'s to give age indices <- regexpr(';.*;', data$cancer.problems) lengths <- attr(indices, "match.length") age <- rep(NA, length(data$cancer.problems)) age[indices>0] <- substring(data$cancer.problems[indices>0], indices[indices>0]+1, indices[indices>0]+lengths[indices>0]-2) Hope it helps. Regards, Jan On Fri, Apr 23, 2010 at 6:24 AM, Min-Han Tan <minhan.science at gmail.com> wrote:> Dear fellow R-help members, > > I hope to seek your advice on how to parse/manage a dataset with hundreds of > columns. Two examples of these columns, 'cancer.problems', and > 'neuro.problems' are depicted below. Essentially, I need to parse this into > a useful dataset, and unfortunately, I am not familiar with perl or any such > language. > > data <- data.frame(id=c(1:10)) > data$cancer.problems <- c("Y; DX AGE: 28; COLON", "", "Y; DX AGE: 27;", "Y; > LIVER","","Y","Y; DX AGE: 24;","Y","Y;DX AGE: 44;","Y;DX AGE: 39; TESTIS") > data$neuro.problems <- c("Y: DX AGE: 80-89;","Y","","Y; DX AGE: 74; > STROKE","Y; DEMENTIA","Y","","Y; DX AGE: 33; CHOREA", "Y", "Y; WEAKNESS") > > As can be seen, the semi-colon delimiter follows its own set of rules, which > are internally consistent - with all 3 elements of data, it should be > "Status; Age; Tissue Type". However, if there is only tissue type, it is" > Status; Tissue Type", without the trailing semi-colon. However, if there is > Age available, it is "Status; Age;". > > The main challenge for me is how to parse/convert this dataset into a useful > and consistent data.frame, or list, where I can capture Status, Age and > Tissue Type as separate fields. Due to the varying application of the > delimiter, I cannot use strsplit consistently. I have tried a convoluted > method by identifying "AGE" as the character string identifying 3 element > fields per below, but faced problems with unlist, given the empty fields. > > age.present <- grepl("AGE",data[,2]) > data.3column <- strsplit(data[age.present,2],";") > data.2column <- strsplit(data[!age.present,2],";") > data$cancer.status[age.present] <- unlist(data.3column) > [(1:sum(age.present)*3)-2] > ... > > Your advice is earnestly sought. > > Thanks. > > Min-Han > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Gabor Grothendieck
2010-Apr-23 09:13 UTC
[R] Problem with parsing a dataset - help earnestly sought
We can use strapply in the gsubfn package. It extracts fields matching regular expressions. strapply extracts the parenthesized part of the regular expression (or the entire regular expression if nothing parenthesized), applies the function to it and returns the result. See http://gsubfn.googlecode.com This works with the rules stated below and works on your example but the general rules may only be apparent with more data in which case you may need to make appropriate adjustments. Note that the regular expressions: \w refers to a word character and must be written \\w when within in quotes. + means one or more occurrences in a row $ means end of string library(gsubfn) NULL2NA <- function(x) if (is.null(x)) NA else x extract <- function(x) { # age is "word" that comes after DX AGE: age <- strapply(x, "DX AGE: (\\w+)", c) age <- sapply(age, null2NA) # tissue is 2 or more word characters at end tissue <- strapply(x, "\\w\\w+$", c) tissue <- sapply(tissue, null2NA) data.frame(age, tissue) } extract(data[,2]) extract(data[,3]) On Fri, Apr 23, 2010 at 12:24 AM, Min-Han Tan <minhan.science at gmail.com> wrote:> Dear fellow R-help members, > > I hope to seek your advice on how to parse/manage a dataset with hundreds of > columns. Two examples of these columns, 'cancer.problems', and > 'neuro.problems' are depicted below. Essentially, I need to parse this into > a useful dataset, and unfortunately, I am not familiar with perl or any such > language. > > data <- data.frame(id=c(1:10)) > data$cancer.problems <- c("Y; DX AGE: 28; COLON", "", "Y; DX AGE: 27;", "Y; > LIVER","","Y","Y; DX AGE: 24;","Y","Y;DX AGE: 44;","Y;DX AGE: 39; TESTIS") > data$neuro.problems <- c("Y: DX AGE: 80-89;","Y","","Y; DX AGE: 74; > STROKE","Y; DEMENTIA","Y","","Y; DX AGE: 33; CHOREA", "Y", "Y; WEAKNESS") > > As can be seen, the semi-colon delimiter follows its own set of rules, which > are internally consistent - with all 3 elements of data, it should be > "Status; Age; Tissue Type". However, if there is only tissue type, it is" > Status; Tissue Type", without the trailing semi-colon. However, if there is > Age available, it is "Status; Age;". > > The main challenge for me is how to parse/convert this dataset into a useful > and consistent data.frame, or list, where I can capture Status, Age and > Tissue Type as separate fields. Due to the varying application of the > delimiter, I cannot use strsplit consistently. I have tried a convoluted > method by identifying "AGE" as the character string identifying 3 element > fields per below, but faced problems with unlist, given the empty fields. > > age.present <- grepl("AGE",data[,2]) > data.3column <- strsplit(data[age.present,2],";") > data.2column <- strsplit(data[!age.present,2],";") > data$cancer.status[age.present] <- unlist(data.3column) > [(1:sum(age.present)*3)-2] > ... > > Your advice is earnestly sought. > > Thanks. > > Min-Han > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >