thr3ads.net - R help - [R] Re constructing a dataframe from a database of newspaper articles [Jul 2006]

If this information is useful, please help other people find it:
Share via:

David Duffy

2006-Jul-23 23:19 UTC

[R] Re constructing a dataframe from a database of newspaper articles

> From: Bob Green <bgreen at dyson.brisnet.org.au>
>
> I am hoping for some assistance with formatting a large text file which
> consists of a series of individual records. Each record includes specific
> labels/field names (a sample of 1 record (one of the longest ones) is
> below  - at end of post. What I want to do is reformat the data, so that
> each individual record becomes a row (some cells will have a lot of text).
> For example, the column variables I want are (a) HD  in one column
> (b)    BY in one column (c) WC data in one column,  (d) PD data in one
> column, (e) SC data in one column (f) PG data in one column  & g) LP
and TD
> text in one column  - this column can contain quite a lot of text, e.g 1900
> words. The other fields are unwanted
>
> If there were 150 individual records, when formatted this would be a 7
> column by 150 row dataset.
Most transparently,

txt <- readLines("c:\\cm-mht1.txt")
no_of_records <- length(grep("^HD",txt)
res <- matrix(nr=no_of_records, nc=8)
idx <- 0
for (i in 1:length(txt)) {
  if (regexpr("^HD", txt[i])!=-1) idx <- idx+1

  if (regexpr("^HD", txt[i])!=-1) res[idx, 1] <- txt[i]
  if (regexpr("^BY", txt[i])!=-1) res[idx, 2] <- txt[i]
  ...
  if (regexpr("^TD", txt[i])!=-1) res[idx, 8] <- txt[i]
}
res[,7] <- paste(res[,7], res[,8], sep="; ")
res <- res[,-8]


| David Duffy (MBBS PhD)                                         ,-_|\
| email: davidD at qimr.edu.au  ph: INT+61+7+3362-0217 fax: -0101  /     *
| Epidemiology Unit, Queensland Institute of Medical Research   \_,-._/
| 300 Herston Rd, Brisbane, Queensland 4029, Australia  GPG 4D0B994A v

David Duffy

2006-Jul-24 00:40 UTC

head link

[R] Re constructing a dataframe from a database of newspaper articles

On Mon, 24 Jul 2006, David Duffy wrote:
> > From: Bob Green <bgreen at dyson.brisnet.org.au>
> >
> > I am hoping for some assistance with formatting a large text file
which
> > consists of a series of individual records. Each record includes
specific
> > labels/field names (a sample of 1 record (one of the longest ones) is
> > below  - at end of post. What I want to do is reformat the data, so
that
> > each individual record becomes a row (some cells will have a lot of
text).
> > For example, the column variables I want are (a) HD  in one column
> > (b)    BY in one column (c) WC data in one column,  (d) PD data in one
> > column, (e) SC data in one column (f) PG data in one column  & g)
LP and TD
> > text in one column  - this column can contain quite a lot of text, e.g
1900
> > words. The other fields are unwanted
> >
> > If there were 150 individual records, when formatted this would be a 7
> > column by 150 row dataset.
Oops, I forgot to add the bit about multiple lines per field...

txt <- readLines("c:\\cm-mht1.txt")
txt <- gsub("[ ]+"," ",txt)
txt <- gsub("^[ ]+","",txt)
no_of_records <- length(grep("^HD",txt)
res <- matrix("", nr=no_of_records, nc=7)
idx <- 0
typ <- 0
for (i in 1:length(txt)) {
  if (regexpr("^HD", txt[i])!=-1) {
    idx <- idx+1
    typ <- 1
  }else if (regexpr("^BY", txt[i])!=-1) {
    typ <- 2
  }
  ...
  } else if (regexpr("(^LP)|(^TD)", txt[i])!=-1) {
    typ <- 7
  } else if (regexpr("^[A-Z][A-Z]", txt[i])!=-1) {
    typ <- 0
  }
  if (typ>0) {
    res[idx,typ] <- paste(res[idx,typ], txt[i], sep=" ")
  }
}


| David Duffy (MBBS PhD)                                         ,-_|\
| email: davidD at qimr.edu.au  ph: INT+61+7+3362-0217 fax: -0101  /     *
| Epidemiology Unit, Queensland Institute of Medical Research   \_,-._/
| 300 Herston Rd, Brisbane, Queensland 4029, Australia  GPG 4D0B994A v

R help - Jul 2006 - Re constructing a dataframe from a database of newspaper articles

[R] Re constructing a dataframe from a database of newspaper articles

[R] Re constructing a dataframe from a database of newspaper articles