David Duffy
2006-Jul-23 23:19 UTC
[R] Re constructing a dataframe from a database of newspaper articles
> From: Bob Green <bgreen at dyson.brisnet.org.au> > > I am hoping for some assistance with formatting a large text file which > consists of a series of individual records. Each record includes specific > labels/field names (a sample of 1 record (one of the longest ones) is > below - at end of post. What I want to do is reformat the data, so that > each individual record becomes a row (some cells will have a lot of text). > For example, the column variables I want are (a) HD in one column > (b) BY in one column (c) WC data in one column, (d) PD data in one > column, (e) SC data in one column (f) PG data in one column & g) LP and TD > text in one column - this column can contain quite a lot of text, e.g 1900 > words. The other fields are unwanted > > If there were 150 individual records, when formatted this would be a 7 > column by 150 row dataset.Most transparently, txt <- readLines("c:\\cm-mht1.txt") no_of_records <- length(grep("^HD",txt) res <- matrix(nr=no_of_records, nc=8) idx <- 0 for (i in 1:length(txt)) { if (regexpr("^HD", txt[i])!=-1) idx <- idx+1 if (regexpr("^HD", txt[i])!=-1) res[idx, 1] <- txt[i] if (regexpr("^BY", txt[i])!=-1) res[idx, 2] <- txt[i] ... if (regexpr("^TD", txt[i])!=-1) res[idx, 8] <- txt[i] } res[,7] <- paste(res[,7], res[,8], sep="; ") res <- res[,-8] | David Duffy (MBBS PhD) ,-_|\ | email: davidD at qimr.edu.au ph: INT+61+7+3362-0217 fax: -0101 / * | Epidemiology Unit, Queensland Institute of Medical Research \_,-._/ | 300 Herston Rd, Brisbane, Queensland 4029, Australia GPG 4D0B994A v
David Duffy
2006-Jul-24 00:40 UTC
[R] Re constructing a dataframe from a database of newspaper articles
On Mon, 24 Jul 2006, David Duffy wrote:> > From: Bob Green <bgreen at dyson.brisnet.org.au> > > > > I am hoping for some assistance with formatting a large text file which > > consists of a series of individual records. Each record includes specific > > labels/field names (a sample of 1 record (one of the longest ones) is > > below - at end of post. What I want to do is reformat the data, so that > > each individual record becomes a row (some cells will have a lot of text). > > For example, the column variables I want are (a) HD in one column > > (b) BY in one column (c) WC data in one column, (d) PD data in one > > column, (e) SC data in one column (f) PG data in one column & g) LP and TD > > text in one column - this column can contain quite a lot of text, e.g 1900 > > words. The other fields are unwanted > > > > If there were 150 individual records, when formatted this would be a 7 > > column by 150 row dataset.Oops, I forgot to add the bit about multiple lines per field... txt <- readLines("c:\\cm-mht1.txt") txt <- gsub("[ ]+"," ",txt) txt <- gsub("^[ ]+","",txt) no_of_records <- length(grep("^HD",txt) res <- matrix("", nr=no_of_records, nc=7) idx <- 0 typ <- 0 for (i in 1:length(txt)) { if (regexpr("^HD", txt[i])!=-1) { idx <- idx+1 typ <- 1 }else if (regexpr("^BY", txt[i])!=-1) { typ <- 2 } ... } else if (regexpr("(^LP)|(^TD)", txt[i])!=-1) { typ <- 7 } else if (regexpr("^[A-Z][A-Z]", txt[i])!=-1) { typ <- 0 } if (typ>0) { res[idx,typ] <- paste(res[idx,typ], txt[i], sep=" ") } } | David Duffy (MBBS PhD) ,-_|\ | email: davidD at qimr.edu.au ph: INT+61+7+3362-0217 fax: -0101 / * | Epidemiology Unit, Queensland Institute of Medical Research \_,-._/ | 300 Herston Rd, Brisbane, Queensland 4029, Australia GPG 4D0B994A v