Frank Gibbons
2005-Jun-16 21:59 UTC
[R] possible bug in merge with duplicate blank names in 'by' field.
Run this:>p <- c('a', 'c', '', ''); a <- c(10, 20, 30, 40); d1 <- >data.frame(Promoter=p, ip=a) # Note duplicate empty names in p. >p <- c('b', 'c', 'd', ''); a <- c(15, 20, 30, 40); d2 <- >data.frame(Promoter=p, ip=a) >all <- merge(x=d1, y=d2, by="Promoter", all=T) >all <- merge(x=all, y=d2, by="Promoter", all=T) >allData is this:>d1 > Promoter ip >1 a 10 >2 c 20 >3 30 >4 40 > >d2 > Promoter ip >1 b 15 >2 c 20 >3 d 30 >4 40Output looks like this:> Promoter ip.x ip.y ip >1 40 30 30 >2 40 40 30 >3 40 30 40 >4 40 40 40 >5 b 15 NA NA >6 c 20 20 20 >7 d 30 NA NA >8 a NA 10 10The weird thing about this is (in my view) that each instance of '' is considered unique, so with each successive merge, all combinatorial possibilities are explored, like a SQL outer join (Cartesian product). For non-empty names, an inner join is performed. Dealing with genomic data (10^4 datapoints), it's easy to have a couple of blanks buried in the middle of things, and to combine several replicates with successive merges. I couldn't understand how my three replicates of 6000 points, in which I expected substantial overlap in the labels, were taking so long to merge and ultimately generating 57000 labels. The culprit turned out to be a few hundred blanks buried in the middle. Why does the empty ("null") name merit special treatment? Perhaps I'm missing something. I hesitate to submit this as a bug, since technically I guess you could say that blank names, especially duplicates, are not kosher. But on the other hand, this combinatorial behaviour seems to occur only for blanks. -Frank PhD, Computational Biologist, Harvard Medical School BCMP/SGM-322, 250 Longwood Ave, Boston MA 02115, USA. Tel: 617-432-3555 Fax: 617-432-3557 http://llama.med.harvard.edu/~fgibbons
Gabor Grothendieck
2005-Jun-17 03:03 UTC
[R] possible bug in merge with duplicate blank names in 'by' field.
What version of R are you using? I don't get the same result on my system:> R.version.string # Windows XP[1] "R version 2.1.0, 2005-06-10"> p <- c('a', 'c', '', ''); a <- c(10, 20, 30, 40); d1 <-+ data.frame(Promoter=p, ip=a) # Note duplicate empty names in p.> p <- c('b', 'c', 'd', ''); a <- c(15, 20, 30, 40); d2 <-+ data.frame(Promoter=p, ip=a)> all <- merge(x=d1, y=d2, by="Promoter", all=T) > all <- merge(x=all, y=d2, by="Promoter", all=T) > allPromoter ip.x ip.y ip 1 30 40 40 2 40 40 40 3 a 10 NA NA 4 c 20 20 20 5 b NA 15 15 6 d NA 30 30 On 6/16/05, Frank Gibbons <fgibbons at hms.harvard.edu> wrote:> Run this: > > >p <- c('a', 'c', '', ''); a <- c(10, 20, 30, 40); d1 <- > >data.frame(Promoter=p, ip=a) # Note duplicate empty names in p. > >p <- c('b', 'c', 'd', ''); a <- c(15, 20, 30, 40); d2 <- > >data.frame(Promoter=p, ip=a) > >all <- merge(x=d1, y=d2, by="Promoter", all=T) > >all <- merge(x=all, y=d2, by="Promoter", all=T) > >all > > Data is this: > > >d1 > > Promoter ip > >1 a 10 > >2 c 20 > >3 30 > >4 40 > > > >d2 > > Promoter ip > >1 b 15 > >2 c 20 > >3 d 30 > >4 40 > > Output looks like this: > > > Promoter ip.x ip.y ip > >1 40 30 30 > >2 40 40 30 > >3 40 30 40 > >4 40 40 40 > >5 b 15 NA NA > >6 c 20 20 20 > >7 d 30 NA NA > >8 a NA 10 10 > > The weird thing about this is (in my view) that each instance of '' is > considered unique, so with each successive merge, all combinatorial > possibilities are explored, like a SQL outer join (Cartesian product). For > non-empty names, an inner join is performed. > > Dealing with genomic data (10^4 datapoints), it's easy to have a couple of > blanks buried in the middle of things, and to combine several replicates > with successive merges. I couldn't understand how my three replicates of > 6000 points, in which I expected substantial overlap in the labels, were > taking so long to merge and ultimately generating 57000 labels. The culprit > turned out to be a few hundred blanks buried in the middle. > > Why does the empty ("null") name merit special treatment? Perhaps I'm > missing something. I hesitate to submit this as a bug, since technically I > guess you could say that blank names, especially duplicates, are not > kosher. But on the other hand, this combinatorial behaviour seems to occur > only for blanks. > > -Frank > > PhD, Computational Biologist, > Harvard Medical School BCMP/SGM-322, 250 Longwood Ave, Boston MA 02115, USA. > Tel: 617-432-3555 Fax: > 617-432-3557 http://llama.med.harvard.edu/~fgibbons > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
Prof Brian Ripley
2005-Jun-17 07:26 UTC
[R] possible bug in merge with duplicate blank names in 'by' field.
What version of R is this (please do see the posting guide)? In both 2.1.0 and 2.1.1 beta I get> allPromoter ip.x ip.y ip 1 30 40 40 2 40 40 40 3 a 10 NA NA 4 c 20 20 20 5 b NA 15 15 6 d NA 30 30 so cannot reproduce your result. Are you sure that the `blanks' really are empty and not some character that is printing as empty on your unstated OS? BTW ' ' is what is normally called `blank'. BTW, these are not `names' but character strings: `names' has other meanings in R. On Thu, 16 Jun 2005, Frank Gibbons wrote:> Run this: > >> p <- c('a', 'c', '', ''); a <- c(10, 20, 30, 40); d1 <- >> data.frame(Promoter=p, ip=a) # Note duplicate empty names in p. >> p <- c('b', 'c', 'd', ''); a <- c(15, 20, 30, 40); d2 <- >> data.frame(Promoter=p, ip=a) >> all <- merge(x=d1, y=d2, by="Promoter", all=T) >> all <- merge(x=all, y=d2, by="Promoter", all=T) >> all > > Data is this: > >> d1 >> Promoter ip >> 1 a 10 >> 2 c 20 >> 3 30 >> 4 40 >> >> d2 >> Promoter ip >> 1 b 15 >> 2 c 20 >> 3 d 30 >> 4 40 > > Output looks like this: > >> Promoter ip.x ip.y ip >> 1 40 30 30 >> 2 40 40 30 >> 3 40 30 40 >> 4 40 40 40 >> 5 b 15 NA NA >> 6 c 20 20 20 >> 7 d 30 NA NA >> 8 a NA 10 10 > > The weird thing about this is (in my view) that each instance of '' is > considered unique, so with each successive merge, all combinatorial > possibilities are explored, like a SQL outer join (Cartesian product). For > non-empty names, an inner join is performed. > > Dealing with genomic data (10^4 datapoints), it's easy to have a couple of > blanks buried in the middle of things, and to combine several replicates > with successive merges. I couldn't understand how my three replicates of > 6000 points, in which I expected substantial overlap in the labels, were > taking so long to merge and ultimately generating 57000 labels. The culprit > turned out to be a few hundred blanks buried in the middle. > > Why does the empty ("null") name merit special treatment? Perhaps I'm > missing something. I hesitate to submit this as a bug, since technically I > guess you could say that blank names, especially duplicates, are not > kosher. But on the other hand, this combinatorial behaviour seems to occur > only for blanks. > > -Frank > > PhD, Computational Biologist, > Harvard Medical School BCMP/SGM-322, 250 Longwood Ave, Boston MA 02115, USA. > Tel: 617-432-3555 Fax: > 617-432-3557 http://llama.med.harvard.edu/~fgibbons > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Frank Gibbons
2005-Jun-17 13:46 UTC
[R] possible bug in merge with duplicate blank names in 'by' field.
Thanks for your quick responses, Gabor and Brian. I'm currently running R version 1.9.1 on Linux. Actually, I have just tested this on R v.2.1.0 running under Windows XP, and indeed, as you both indicate, the problem does not exist on that version for that OS. So, at an appropriate time I'll upgrade my Linux installation to the most recent version (1.9.1 is a year old, I guess). -Frank At 03:26 AM 6/17/2005, Prof Brian Ripley wrote:>What version of R is this (please do see the posting guide)? > >In both 2.1.0 and 2.1.1 beta I get > >>all > Promoter ip.x ip.y ip >1 30 40 40 >2 40 40 40 >3 a 10 NA NA >4 c 20 20 20 >5 b NA 15 15 >6 d NA 30 30 > >so cannot reproduce your result. Are you sure that the `blanks' really are >empty and not some character that is printing as empty on your unstated OS? > >BTW ' ' is what is normally called `blank'. > >BTW, these are not `names' but character strings: `names' has other >meanings in R.PhD, Computational Biologist, Harvard Medical School BCMP/SGM-322, 250 Longwood Ave, Boston MA 02115, USA. Tel: 617-432-3555 Fax: 617-432-3557 http://llama.med.harvard.edu/~fgibbons