Not To Miss
2013-Mar-11 04:19 UTC
[R] how to convert a data.frame to tree structure object such as dendrogram
I have a data.frame object like:> data.frame(x=c('A','A','B','B'), y=c('Ab','Ac','Ba','Bd'))x y 1 A Ab 2 A Ac 3 B Ba 4 B Bd how could I create a tree structure object like this: |---Ab A---| _| |---Ac | | |---Ba B---| |---Bb Thanks, Zech [[alternative HTML version deleted]]
MacQueen, Don
2013-Mar-11 20:12 UTC
[R] how to convert a data.frame to tree structure object such as dendrogram
You will have to decide what R data structure is a "tree structure". But maybe this will get you started:> foo <- data.frame(x=c('A','A','B','B'), y=c('Ab','Ac','Ba','Bd')) > split(foo$y, foo$x)$A [1] "Ab" "Ac" $B [1] "Ba" "Bd" I suppose it is at least a little bit tree-like. -- Don MacQueen Lawrence Livermore National Laboratory 7000 East Ave., L-627 Livermore, CA 94550 925-423-1062 On 3/10/13 9:19 PM, "Not To Miss" <not.to.miss at gmail.com> wrote:>I have a data.frame object like: > >> data.frame(x=c('A','A','B','B'), y=c('Ab','Ac','Ba','Bd')) > x y >1 A Ab >2 A Ac >3 B Ba >4 B Bd > >how could I create a tree structure object like this: > |---Ab > A---| >_| |---Ac > | > | |---Ba > B---| > |---Bb > >Thanks, >Zech > > [[alternative HTML version deleted]] > >______________________________________________ >R-help at r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.
Bert Gunter
2013-Mar-13 20:12 UTC
[R] how to convert a data.frame to tree structure object such as dendrogram
Here is a simpler, less clumsy version of my previous recursive R solution that I sent you privately, which I'll also cc to the list this time. It's now almost a one-liner. To avoid problems with unused factor levels, I still prefer to have character vectors not factors, as the data frame columns so: df <- data.frame(a=c('A','A', 'A', 'B','B','C','C','C'), b=c('Aa', 'Ab','Ab','Ba','Bd', 'C1','C2','C3'), c=c('Aa1', 'Ab1', 'Ab2', 'Ba1', 'Bd2', 'C11','C12','C13'), stringsAsFactors=FALSE) makeTree2 <-function(x, i,n) { if(i==n)df[x,i] else { spl <- split(x,df[x,i]) lapply(spl,function(x)makeTree2(x,i+1,n)) ##Can't use Recall() } } This is now called as> makeTree2(seq_len(nrow(df)),1,ncol(df)) ## no list structure needed for x## yielding (with the root implicit now) $A $A$Aa [1] "Aa1" $A$Ab [1] "Ab1" "Ab2" $B $B$Ba [1] "Ba1" $B$Bd [1] "Bd2" $C $C$C1 [1] "C11" $C$C2 [1] "C12" $C$C3 [1] "C13" On Wed, Mar 13, 2013 at 10:25 AM, Not To Miss <not.to.miss at gmail.com> wrote:> The ideal solution, I think, is probably recursive. In the last min I > decided to wrote a python script to do this ( use python instead of perl or > R, because of python mutable dict data structure), although I had preferred > to keep all my code in one R piece. I post code here just in case you are > interested. It generates a dict of dict of dict ... > > Hopefully I would not get beaten up for posting python code in R mailing > list. :-) > > import sys > tree = {} > ## input file is a table with columns TAB demilited > for line in open(sys.argv[1]): > if line.startswith('#'): continue > items = line.strip().split('\t') > tmp = tree > for item in items: > if not item in tmp: > tmp[item]={} > tmp = tmp[item] > > The tree looks like this for the example: > {'A': {'Aa': {'Aa1': {}}, 'Ab': {'Ab1': {}, 'Ab2': {}}}, 'C': {'C3': {'C13': > {}}, 'C2': {'C12': {}}, 'C1': {'C11': {}}}, 'B': {'Bd': {'Bd2': {}}, 'Ba': > {'Ba1': {}}}} > > On Wed, Mar 13, 2013 at 10:35 AM, David Winsemius <dwinsemius at comcast.net> > wrote: >> >> >> On Mar 12, 2013, at 9:22 PM, Not To Miss wrote: >> >> Nope, Bert, you miss me? :-D >> >> I apologize that I didn't provide a more realistic example and describe >> the problem more clearly. The real data are just too complicated to post in >> emails, so I made up a simple example, which perhaps seems a little over >> simplistic now, but the basic structure are the same. Here is a more >> approapriate one: >> >data.frame(a=c('A','A', 'A', 'B','B','C','C','C'), b=c('Aa', >> > 'Ab','Ab','Ba','Bd', 'C1','C2','C3'), c=c('Aa1', 'Ab1', 'Ab2', 'Ba1', 'Bd2', >> > 'C11','C12','C13')) >> a b c >> 1 A Aa Aa1 >> 2 A Ab Ab1 >> 3 A Ab Ab2 >> 4 B Ba Ba1 >> 5 B Bd Bd2 >> 6 C C1 C11 >> 7 C C2 C12 >> 8 C C3 C13 >> >> The data structure to convert to: >> |---Aa------Aa1 >> A---| /--Ab1 >> | |---Ab--| >> | \--Ab2 >> | |---Ba------Ba1 >> B---| >> | |---Bd------Bd2 >> | >> | /---C1-----C11 >> C---|----C2-----C12 >> \---C3-----C13 >> >> It's multi-level nested and I won't know how many rows and columns of the >> data.frame ahead of time. I plan to write a perl script to do the >> conversion, just more familiar, if it's not easy to do in R. Thanks Don and >> Greg for suggesting solutions. >> >> >> After a bit of coding I am going to say your proposed answer is wrong (or >> at least improperly specified). The first level can be recovered as you >> suggest : >> >> > sapply(unique(dfrm[[1]]), function(x) dfrm[[2]][grep(x, dfrm[[2]]) ]) >> $A >> [1] "Aa" "Ab" "Ab" >> >> $B >> [1] "Ba" "Bd" >> >> $C >> [1] "C1" "C2" "C3" >> >> >> But the second level cannot be as you imagined. The third level items >> beginning with "C1" all get associated together and there are no terminal >> nodes for C2 or C3 at the third level. >> >> > sapply(unique(dfrm[[2]]), function(x) dfrm[[3]][grep(x, dfrm[[3]]) ]) >> $Aa >> [1] "Aa1" >> >> $Ab >> [1] "Ab1" "Ab2" >> >> $Ba >> [1] "Ba1" >> >> $Bd >> [1] "Bd2" >> >> $C1 >> [1] "C11" "C12" "C13" >> >> $C2 >> character(0) >> >> $C3 >> character(0) >> >> lev1 <- sapply(unique(dfrm[[1]]), function(x) dfrm[[2]][grep(x, dfrm[[2]]) >> ]) >> lapply(lev1, function(ll) lapply(ll, function(lll) dfrm[[3]][grep(lll, >> dfrm[[3]]) ]) ) >> >> $A >> $A[[1]] >> [1] "Aa1" >> >> $A[[2]] >> [1] "Ab1" "Ab2" >> >> $A[[3]] >> [1] "Ab1" "Ab2" >> >> >> $B >> $B[[1]] >> [1] "Ba1" >> >> $B[[2]] >> [1] "Bd2" >> >> >> $C >> $C[[1]] >> [1] "C11" "C12" "C13" >> >> $C[[2]] >> character(0) >> >> $C[[3]] >> character(0) >> >> -- >> David. >> >> >> >> On Tue, Mar 12, 2013 at 2:18 PM, Bert Gunter <gunter.berton at gene.com> >> wrote: >>> >>> So Mr. "not.tomiss" missed? >>> >>> :( >>> >>> -- Bert >>> >>> On Tue, Mar 12, 2013 at 1:08 PM, David Winsemius <dwinsemius at comcast.net> >>> wrote: >>> > >>> > On Mar 12, 2013, at 9:37 AM, Not To Miss wrote: >>> > >>> >> Thanks. Is there any more elegant solution? What if I don't know how >>> >> many >>> >> levels of nesting ahead of time? >>> > >>> > It's even worse than what you now offer as a potential complication. >>> > You did not provide an example of a data object that would illustrate the >>> > complexity of the task nor what you consider the correct procedure (i.e. the >>> > order of the columns to be used for splitting) nor the correct results. The >>> > task is woefully underspecified at the moment. It's a bit akin to asking >>> > "how do I do classification" without saying what you what to classify. >>> > >>> > -- >>> > David. >>> >> >>> >> >>> >> On Tue, Mar 12, 2013 at 8:51 AM, Greg Snow <538280 at gmail.com> wrote: >>> >> >>> >>> You can use the lapply or rapply functions on the resulting list to >>> >>> break >>> >>> each piece into a list itself, then apply the lapply or rapply >>> >>> function to >>> >>> those resulting lists, ... >>> >>> >>> >>> >>> >>> On Mon, Mar 11, 2013 at 3:41 PM, Not To Miss >>> >>> <not.to.miss at gmail.com>wrote: >>> >>> >>> >>>> Thanks. That's just an simple example - what if there are more >>> >>>> columns and >>> >>>> more rows? Is there any easy way to create nested list? >>> >>>> >>> >>>> Best, >>> >>>> Zech >>> >>>> >>> >>>> >>> >>>> On Mon, Mar 11, 2013 at 2:12 PM, MacQueen, Don <macqueen1 at llnl.gov> >>> >>>> wrote: >>> >>>> >>> >>>>> You will have to decide what R data structure is a "tree >>> >>>>> structure". But >>> >>>>> maybe this will get you started: >>> >>>>> >>> >>>>>> foo <- data.frame(x=c('A','A','B','B'), y=c('Ab','Ac','Ba','Bd')) >>> >>>>>> split(foo$y, foo$x) >>> >>>>> $A >>> >>>>> [1] "Ab" "Ac" >>> >>>>> >>> >>>>> $B >>> >>>>> [1] "Ba" "Bd" >>> >>>>> >>> >>>>> I suppose it is at least a little bit tree-like. >>> >>>>> >>> >>>>> >>> >>>>> -- >>> >>>>> Don MacQueen >>> >>>>> >>> >>>>> Lawrence Livermore National Laboratory >>> >>>>> 7000 East Ave., L-627 >>> >>>>> Livermore, CA 94550 >>> >>>>> 925-423-1062 >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> On 3/10/13 9:19 PM, "Not To Miss" <not.to.miss at gmail.com> wrote: >>> >>>>> >>> >>>>>> I have a data.frame object like: >>> >>>>>> >>> >>>>>>> data.frame(x=c('A','A','B','B'), y=c('Ab','Ac','Ba','Bd')) >>> >>>>>> x y >>> >>>>>> 1 A Ab >>> >>>>>> 2 A Ac >>> >>>>>> 3 B Ba >>> >>>>>> 4 B Bd >>> >>>>>> >>> >>>>>> how could I create a tree structure object like this: >>> >>>>>> |---Ab >>> >>>>>> A---| >>> >>>>>> _| |---Ac >>> >>>>>> | >>> >>>>>> | |---Ba >>> >>>>>> B---| >>> >>>>>> |---Bb >>> >>>>>> >>> >>>>>> Thanks, >>> >>>>>> Zech >>> >>>>>> >>> >>>>>> [[alternative HTML version deleted]] >>> >>>>>> >>> >>>>>> ______________________________________________ >>> >>>>>> R-help at r-project.org mailing list >>> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>> >>>>>> PLEASE do read the posting guide >>> >>>>>> http://www.R-project.org/posting-guide.html >>> >>>>>> and provide commented, minimal, self-contained, reproducible code. >>> >>>>> >>> >>>>> >>> >>>> >>> >>>> [[alternative HTML version deleted]] >>> >>>> >>> >>>> ______________________________________________ >>> >>>> R-help at r-project.org mailing list >>> >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>> >>>> PLEASE do read the posting guide >>> >>>> http://www.R-project.org/posting-guide.html >>> >>>> and provide commented, minimal, self-contained, reproducible code. >>> >>>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Gregory (Greg) L. Snow Ph.D. >>> >>> 538280 at gmail.com >>> >>> >>> >> >>> >> [[alternative HTML version deleted]] >>> >> >>> >> ______________________________________________ >>> >> R-help at r-project.org mailing list >>> >> https://stat.ethz.ch/mailman/listinfo/r-help >>> >> PLEASE do read the posting guide >>> >> http://www.R-project.org/posting-guide.html >>> >> and provide commented, minimal, self-contained, reproducible code. >>> > >>> > David Winsemius >>> > Alameda, CA, USA >>> > >>> > ______________________________________________ >>> > R-help at r-project.org mailing list >>> > https://stat.ethz.ch/mailman/listinfo/r-help >>> > PLEASE do read the posting guide >>> > http://www.R-project.org/posting-guide.html >>> > and provide commented, minimal, self-contained, reproducible code. >>> >>> >>> >>> -- >>> >>> Bert Gunter >>> Genentech Nonclinical Biostatistics >>> >>> Internal Contact Info: >>> Phone: 467-7374 >>> Website: >>> >>> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm >> >> >> >> David Winsemius >> Alameda, CA, USA >> >-- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm