Here is my code.
there are three method to get text to be parded by htmlParse function.
1.file on mycomputer
options(encoding="gbk")
library(XML)
xmltext1 <- htmlParse("/home/tiger/Desktop/27174.htm" )
#/home/tiger/Desktop/27174.htm is the file of
http://www.jb51.net/article/27174.htm downloaded on my computer.
2.url
options(encoding="gbk")
library(XML)
xmltext2 <- htmlParse("http://www.jb51.net/article/27174.htm" )
3.readLines
options(encoding="gbk")
library(XML)
txt=readLines("http://www.jb51.net/article/27174.htm")
xmltext3 <- htmlParse(txt,asText=TRUE)
method1,and method2 are ok,they can get right content to be parsed.
when i run method 3 ,to my surprise ,xmltext3 can get some contents,but many
are gone,they are not the same as method1,and method2,why?
you can get only little part of html. > xmltext3
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
xml:lang="zh-cn"><head>
<meta http-equiv="Content-Type" content="text/html;
charset=gb2312">
<title>PYTHONæ£å</title>
</head></html>
[[alternative HTML version deleted]]