Hello
I tried to use the CSVSource in the TextDocCol function in the tm package. But
a) data from several columns is concatenated in one entry and
b) data in a large text column is broken into several entries
I hoped that it would be possible to assign columns as metadata to one
entry with one specific column being the original text to analyze.
Here is an example from the vignette (the backslash in the output is
not in the original data):
> cars <- system.file("texts", "cars.csv", package =
"tm");
> tdc <- TextDocCol(CSVSource(cars))
Read 5 items> inspect(tdc)
A text document collection with 5 text documents
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
[1]
"1997,\"Ford\",\"Mustang\",\"3000.00\""
[[2]]
[1] "1999,\"Chevy\",\"Venture\",4900.00"
[[3]]
[1]
"1996,\"Chrylser\",\"Cherokee\",\"4799.00\""
[[4]]
[1]
"2005,\"Ferrari\",\"Modena\",\"80999.00\""
[[5]]
[1] "1973,\"Tank\",\"\",\"9900.00\""
Also I have a question about the best workflow for text mining/analysis:
My original data is in a mySQL table. Is it possible to import the
data directly into TextDocCol without creating an intermediate csv
file?
I am using
> R.Version()
$platform
[1] "powerpc-apple-darwin8.10.1"
$arch
[1] "powerpc"
$os
[1] "darwin8.10.1"
$system
[1] "powerpc, darwin8.10.1"
$status
[1] ""
$major
[1] "2"
$minor
[1] "6.1"
$year
[1] "2007"
$month
[1] "11"
$day
[1] "26"
$`svn rev`
[1] "43537"
$language
[1] "R"
$version.string
[1] "R version 2.6.1 (2007-11-26)"
--
Armin Goralczyk, M.D.
--
Universit?tsmedizin G?ttingen
Abteilung Allgemein- und Viszeralchirurgie
Rudolf-Koch-Str. 40
39099 G?ttingen
--
Dept. of General Surgery
University of G?ttingen
G?ttingen, Germany
--
http://www.gwdg.de/~agoralc