Melis Mete
2010-Apr-04 15:04 UTC
[R] How to add a column to dtm showing a part from directory source?
Hello Experts, I'm new with R and having troubles doing my graduation project.I have 20 subfolders including almost 20000 txt files.What i need to do is to create a dtm and add a column to it showing a "class" information of the txt files. My directory source is like "C:\\R\\20news-18828\\comp.graphics" for the comp.graphic subfolder.I need to take only "comp.graphic" part to be seen at the CLASS column.Pleasehelp... -- View this message in context: http://n4.nabble.com/How-to-add-a-column-to-dtm-showing-a-part-from-directory-source-tp1750923p1750923.html Sent from the R help mailing list archive at Nabble.com.
jim holtman
2010-Apr-04 21:02 UTC
[R] How to add a column to dtm showing a part from directory source?
Here is one way to parse the data. I just took the lines you had in the email to show how to do it. You can do the same thing on your complete object:> x <- readLines(textConnection("C:\\ProgramFiles\\R\\20news18828/talk.politics.guns/54215 + C:\\Program Files\\R\\20news18828/talk.politics.guns/54216 + C:\\Program Files\\R\\20news18828/talk.politics.guns/54217 + C:\\Program Files\\R\\20news18828/talk.politics.misc/178341 + C:\\Program Files\\R\\20news18828/talk.politics.misc/178342 + C:\\Program Files\\R\\20news18828/talk.politics.misc/178343 + C:\\Program Files\\R\\20news18828/talk.politics.mideast/75964 + C:\\Program Files\\R\\20news18828/talk.politics.mideast/75965"))> # parse the data with 'strsplit' (split at the '/' character) > x.parsed <- strsplit(x, '/') > # now create a matrix with the first column the directory and the secondthe file> # the 'x.parsed' has 3 elements for each line and we only want the lasttwo 'c(2,3)'> x.names <- t(sapply(x.parsed, '[', c(2,3))) > > x.names[,1] [,2] [1,] "talk.politics.guns" "54215" [2,] "talk.politics.guns" "54216" [3,] "talk.politics.guns" "54217" [4,] "talk.politics.misc" "178341" [5,] "talk.politics.misc" "178342" [6,] "talk.politics.misc" "178343" [7,] "talk.politics.mideast" "75964" [8,] "talk.politics.mideast" "75965">2010/4/4 MeLiS MeLiS <black.angel.18@hotmail.com>> Hello again, > > I tried what you have sent to me and i get: > ... > > [15742] "C:\\Program Files\\R\\20news18828/talk.politics.guns/54215" > [15743] "C:\\Program Files\\R\\20news18828/talk.politics.guns/54216" > [15744] "C:\\Program Files\\R\\20news18828/talk.politics.guns/54217" > ... > [17608] "C:\\Program Files\\R\\20news18828/talk.politics.misc/178341" > [17609] "C:\\Program Files\\R\\20news18828/talk.politics.misc/178342" > [17610] "C:\\Program Files\\R\\20news18828/talk.politics.misc/178343" > ... > [16602] "C:\\Program Files\\R\\20news18828/talk.politics.mideast/75964" > [16603] "C:\\Program Files\\R\\20news18828/talk.politics.mideast/75965" > ... > this is the closest thing what i need. i only need to take > "talk.politics.guns", "talk.politics.misc" and "talk.politics.mideast" parts > to the list for the example above. > this help document ( > http://127.0.0.1:29974/library/base/html/list.files.html) mentions about > "pattern".Do i need to use this to achieve what i want because i realyy did > not undersatand how to use it. > > ------------------------------ > Date: Sun, 4 Apr 2010 12:43:58 -0400 > > Subject: Re: [R] How to add a column to dtm showing a part from directory > source? > From: jholtman@gmail.com > To: black.angel.18@hotmail.com > > You can use 'list.files(startPath, recursive=TRUE)' to get a list of all > the file names and then strip off the paths to create the data that you > need. Is this what you want to do? > > 2010/4/4 MeLiS MeLiS <black.angel.18@hotmail.com> > > > word1 word2 word3 ... CLASS doc1 comp.graphics doc2 > rec.autos doc3 rec.motorcycles ... ... > This is basically my dtm.I will apply a classification algorithm later to > categorize newly coming txt documents.So many of the existing nes will be > used for machine learning.I have a folder called 20news-18828 and this > folder includes 20 subfolders some of which are comp.graphics, rec.autos, > rec.motorcycles, etc.And these subfolders include thousands of txt files. > After some algorithms i created the dtm showing most used words in the txt > files as you may guess. Now i have to add a column called "CLASS". The class > column should tell me doc1 is in which subfolder. > I hope this will help you understand.. > ------------------------------ > Date: Sun, 4 Apr 2010 12:07:56 -0400 > Subject: Re: [R] How to add a column to dtm showing a part from directory > source? > From: jholtman@gmail.com > To: black.angel.18@hotmail.com > > > I would like to help, but it is not clear what you are asking for since > there is no example of what you might want in the "dtm" (whatever that is > supposed to be). What do you mean by the "class" information. An example > would be helpful. You can recursively go down the subfolders extracting > information, you just need to tell us what the information is. > > On Sun, Apr 4, 2010 at 11:04 AM, Melis Mete <black.angel.18@hotmail.com>wrote: > > > Hello Experts, > > I'm new with R and having troubles doing my graduation project.I have 20 > subfolders including almost 20000 txt files.What i need to do is to create > a > dtm and add a column to it showing a "class" information of the txt files. > My directory source is like "C:\\R\\20news-18828\\comp.graphics" for the > comp.graphic subfolder.I need to take only "comp.graphic" part to be seen > at > the CLASS column.Pleasehelp... > > -- > View this message in context: > http://n4.nabble.com/How-to-add-a-column-to-dtm-showing-a-part-from-directory-source-tp1750923p1750923.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. > > > > > -- > Jim Holtman > Cincinnati, OH > +1 513 646 9390 > > What is the problem that you are trying to solve? > > ------------------------------ > Windows 7: Gündelik iþlerinizi basitleþtirin. Size en uygun bilgisayarý > bulun. <http://windows.microsoft.com/shop> > > > > > -- > Jim Holtman > Cincinnati, OH > +1 513 646 9390 > > What is the problem that you are trying to solve? > > ------------------------------ > Windows 7: Gündelik iþlerinizi basitleþtirin. Size en uygun bilgisayarý > bulun. <http://windows.microsoft.com/shop> >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? [[alternative HTML version deleted]]