bryan rasmussen
2013-Feb-19 00:19 UTC
[Rd] best way to extract this meaningful data from a table
I have a table with a structure like the following: lang | basic id | doc id | topics| se | 447157 | MD_2002_0014 |12 | loaded topics <- read.table("path to file",header=TRUE, sep="|", fileEncoding="utf-8") In that table the actual meaningful data (in this context) is the text before the first underscore in doc id which is the document type ( for example MD as above), and topics. However topics can have more than one value in it, multiple values are comma separated, if there is no actual topic I have a 0 although I can also have an empty column if I want. So what I want is the best way to extract the meaningful data - the comma separated values of each topics column and the actual document type so that I can start to do reports of how many documents of type X have no topics, median number of topics per document type etc. Do I have to loop through the table and build a new table up with the info I want, or is there a smarter way to do it? If a smarter way, what is that smarter way. Thanks, Bryan Rasmussen
Kasper Daniel Hansen
2013-Feb-19 02:57 UTC
[Rd] best way to extract this meaningful data from a table
This is not an R-devel question, so please do not reply to this list. I would try sapply(strsplit(loaded.topics$doc.id, "_"), function(xx) xx[1]) to get the MD part. Kasper On Mon, Feb 18, 2013 at 7:19 PM, bryan rasmussen <rasmussen.bryan at gmail.com> wrote:> I have a table with a structure like the following: > > lang | basic id | doc id | topics| > se | 447157 | MD_2002_0014 |12 | > > loaded topics <- read.table("path to file",header=TRUE, sep="|", > fileEncoding="utf-8") > > In that table the actual meaningful data (in this context) is the text > before the first underscore in doc id which is the document type ( for > example MD as above), and topics. > However topics can have more than one value in it, multiple values are > comma separated, if there is no actual topic I have a 0 although I can > also have an empty column if I want. > > So what I want is the best way to extract the meaningful data - the > comma separated values of each topics column and the actual document > type so that I can start to do reports of how many documents of type X > have no topics, median number of topics per document type etc. > > Do I have to loop through the table and build a new table up with the > info I want, or is there a smarter way to do it? > If a smarter way, what is that smarter way. > > Thanks, > Bryan Rasmussen > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel