Hi,
I'm trying to read a file containing html markup (discussion board
posts) and output the various parts of each post to an field in a record
in an output file (date, author, title, body). This is a one-off job
and I'm trying to use R to do it.
The file looks something like this:
<br><ul>Created: --- - Dr. Johnsons's article -
concerns<br><p>After writing some
text..........</p><br>-- by
Anon.<br>
<br><ul>Created: --- - RE:Dr. Johnson's article -
concerns<br><p>With
some advance notice about some text........<br>-- by
Anon.<br>
So that <br>tags indicate where the field entries begin and end and
"Created" and "by Anon" indicate the beginning and ending of
the post.
The file is named "Module_1.txt". Here is what I have so far:
## adapted from http://finzi.psych.upenn.edu/R/Rhelp02a/archive/64261.html
## gives post beginning and ending points
starts <- gregexpr("Created", readChar("Module_1.txt",
file.info("Module_1.txt")$size))[[1]]
ends <- gregexpr("by Anon", readChar("Module_1.txt",
file.info("Module_1.txt")$size))[[1]]
## open connection
chk <- file("Module_1.txt", "r")
#seek(chk, origin = "start", ends[1]) # moves through file
## initalize an array
hold <- array(rep(NA, length(starts)))
## write a function to read the connection
catchtext <- function(start, end, source) {
for(i in 1:length(starts))
{
hold[i] <- readChar(chk, nchar = ends[i] - starts[i])
}
}
#close(chk)
and my output is
> hold
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA
[26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA
[51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA
[76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA
[101] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA
[126] NA NA
At this point I'm just trying to get the entire file cut up into
post-sized chunks. Later I'll go through and output the separate bits
into fields. As can be seen, I'm having trouble moving through the
connection to the places I want to read from it. Suggestions welcome.
Thanks,
Scot
> version
_
platform i386-pc-mingw32
arch i386
os mingw32
system i386, mingw32
status
major 2
minor 6.2
year 2008
month 02
day 08
svn rev 44383
language R
version.string R version 2.6.2 (2008-02-08)
--
Scot McNary
smcnary at charm dot net