Sorry to answer my own question - I guess here's one way to read this
table. Other suggestions are still welcome.
Chris
------
x<-htmlParse("<table>
<tr><td rowspan=2>ab</td><td>X</td></tr>
<tr><td rowspan=2>YZ</td></tr>
<tr><td>c</td></tr>
</table>")
# split by rows
z <- getNodeSet(x, "//tr")
# create empty data.frame - probably not the best solution...
t1<- data.frame(matrix(NA, nrow = 3, ncol = 2 ))
for (i in 1:3){
rowspan <- as.numeric( xpathSApply(z[[i]], ".//td", xmlGetAttr,
"rowspan", 1) )
val <- xpathSApply(z[[i]], ".//td", xmlValue)
# fill values into empty cells
n <- which(is.na(t1[i,]))
t1[ i ,n] <- val
if( any(rowspan > 1) ){
for(j in 1:length( rowspan ) ){
if(rowspan[j] > 1){
## repeat value down column
t1[ (i+1):(i+ ( rowspan[j] -1) ) , n[j] ] <- val[j]
}
}
}
}
t1
X1 X2
1 ab X
2 ab YZ
3 c YZ
If you are interested, I used this code in the pmcTable function at
https://github.com/cstubben/pubmed . To get Table 1, this now works...
doc<-pmc("PMC3544749") # downloads XML from OAI service
t1 <- pmcTable(doc,1) # parse table... also saves caption and footnotes
to attributes
t1[1:4,1:4]
Category Gen Name Rv
number Description
1 Lipids and Fatty Acid Metabolism kasB Rv2246
3-oxoacyl-[acyl-carrier protein] synthase 2 kasb
2 Mycolic acid synthesis mmaA4 Rv0642c
Methoxy mycolic acid synthase 4
3 Mycolic acid synthesis pcaA Rv0470c Mycolic acid
synthase (cyclopropane synthase)
4 Mycolic acid synthesis pcaA Rv0470c Mycolic acid
synthase (cyclopropane synthase)
--
Chris Stubben
Los Alamos National Lab
Bioscience Division
MS M888
Los Alamos, NM 87545