thr3ads.net - R help - [R] possible bug in merge with duplicate blank names in 'by' field. [Jun 2005]

If this information is useful, please help other people find it:
Share via:

Frank Gibbons

2005-Jun-16 21:59 UTC

[R] possible bug in merge with duplicate blank names in 'by' field.

Run this:
>p <- c('a', 'c', '', ''); a <- c(10,
20, 30, 40); d1 <-
>data.frame(Promoter=p, ip=a) # Note duplicate empty names in p.
>p <- c('b', 'c', 'd', ''); a <- c(15,
20, 30, 40); d2 <-
>data.frame(Promoter=p, ip=a)
>all <- merge(x=d1, y=d2, by="Promoter", all=T)
>all <- merge(x=all, y=d2, by="Promoter", all=T)
>all
Data is this:
>d1
>   Promoter ip
>1        a 10
>2        c 20
>3          30
>4          40
>
>d2
>   Promoter ip
>1        b 15
>2        c 20
>3        d 30
>4          40
Output looks like this:
>   Promoter ip.x ip.y ip
>1            40   30 30
>2            40   40 30
>3            40   30 40
>4            40   40 40
>5        b   15   NA NA
>6        c   20   20 20
>7        d   30   NA NA
>8        a   NA   10 10
The weird thing about this is (in my view) that each instance of '' is 
considered unique, so with each successive merge, all combinatorial 
possibilities are explored, like a SQL outer join (Cartesian product). For 
non-empty names, an inner join is performed.

Dealing with genomic data (10^4 datapoints), it's easy to have a couple of 
blanks buried in the middle of things, and to combine several replicates 
with successive merges. I couldn't understand how my three replicates of 
6000 points, in which I expected  substantial overlap in the labels, were 
taking so long to merge and ultimately generating 57000 labels. The culprit 
turned out to be a few hundred blanks buried in the middle.

Why does the empty ("null") name merit special treatment? Perhaps
I'm
missing something. I hesitate to submit this as a bug, since technically I 
guess you could say that blank names, especially duplicates, are not 
kosher. But on the other hand, this combinatorial behaviour seems to occur 
only for blanks.

-Frank

PhD, Computational Biologist,
Harvard Medical School BCMP/SGM-322, 250 Longwood Ave, Boston MA 02115, USA.
Tel: 617-432-3555       Fax: 
617-432-3557       http://llama.med.harvard.edu/~fgibbons

Gabor Grothendieck

2005-Jun-17 03:03 UTC

head link

[R] possible bug in merge with duplicate blank names in 'by' field.

What version of R are you using?   I don't get the same
result on my system:
> R.version.string # Windows XP
[1] "R version 2.1.0, 2005-06-10"> p <- c('a', 'c', '', ''); a <- c(10,
20, 30, 40); d1 <-+ data.frame(Promoter=p, ip=a) # Note duplicate empty names in
p.> p <- c('b', 'c', 'd', ''); a <- c(15,
20, 30, 40); d2 <-
+ data.frame(Promoter=p, ip=a)> all <- merge(x=d1, y=d2, by="Promoter", all=T)
> all <- merge(x=all, y=d2, by="Promoter", all=T)
> all  Promoter ip.x ip.y ip
1            30   40 40
2            40   40 40
3        a   10   NA NA
4        c   20   20 20
5        b   NA   15 15
6        d   NA   30 30


On 6/16/05, Frank Gibbons <fgibbons at hms.harvard.edu>
wrote:> Run this:
> 
> >p <- c('a', 'c', '', ''); a <-
c(10, 20, 30, 40); d1 <-
> >data.frame(Promoter=p, ip=a) # Note duplicate empty names in p.
> >p <- c('b', 'c', 'd', ''); a <-
c(15, 20, 30, 40); d2 <-
> >data.frame(Promoter=p, ip=a)
> >all <- merge(x=d1, y=d2, by="Promoter", all=T)
> >all <- merge(x=all, y=d2, by="Promoter", all=T)
> >all
> 
> Data is this:
> 
> >d1
> >   Promoter ip
> >1        a 10
> >2        c 20
> >3          30
> >4          40
> >
> >d2
> >   Promoter ip
> >1        b 15
> >2        c 20
> >3        d 30
> >4          40
> 
> Output looks like this:
> 
> >   Promoter ip.x ip.y ip
> >1            40   30 30
> >2            40   40 30
> >3            40   30 40
> >4            40   40 40
> >5        b   15   NA NA
> >6        c   20   20 20
> >7        d   30   NA NA
> >8        a   NA   10 10
> 
> The weird thing about this is (in my view) that each instance of ''
is
> considered unique, so with each successive merge, all combinatorial
> possibilities are explored, like a SQL outer join (Cartesian product). For
> non-empty names, an inner join is performed.
> 
> Dealing with genomic data (10^4 datapoints), it's easy to have a couple
of
> blanks buried in the middle of things, and to combine several replicates
> with successive merges. I couldn't understand how my three replicates
of
> 6000 points, in which I expected  substantial overlap in the labels, were
> taking so long to merge and ultimately generating 57000 labels. The culprit
> turned out to be a few hundred blanks buried in the middle.
> 
> Why does the empty ("null") name merit special treatment? Perhaps
I'm
> missing something. I hesitate to submit this as a bug, since technically I
> guess you could say that blank names, especially duplicates, are not
> kosher. But on the other hand, this combinatorial behaviour seems to occur
> only for blanks.
> 
> -Frank
> 
> PhD, Computational Biologist,
> Harvard Medical School BCMP/SGM-322, 250 Longwood Ave, Boston MA 02115,
USA.
> Tel: 617-432-3555       Fax:
> 617-432-3557       http://llama.med.harvard.edu/~fgibbons
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>

Prof Brian Ripley

2005-Jun-17 07:26 UTC

head link

[R] possible bug in merge with duplicate blank names in 'by' field.

What version of R is this (please do see the posting guide)?

In both 2.1.0 and 2.1.1 beta I get
> all   Promoter ip.x ip.y ip
1            30   40 40
2            40   40 40
3        a   10   NA NA
4        c   20   20 20
5        b   NA   15 15
6        d   NA   30 30

so cannot reproduce your result. Are you sure that the `blanks' really are 
empty and not some character that is printing as empty on your unstated 
OS?

BTW ' ' is what is normally called `blank'.

BTW, these are not `names' but character strings: `names' has other 
meanings in R.

On Thu, 16 Jun 2005, Frank Gibbons wrote:
> Run this:
>
>> p <- c('a', 'c', '', ''); a <-
c(10, 20, 30, 40); d1 <-
>> data.frame(Promoter=p, ip=a) # Note duplicate empty names in p.
>> p <- c('b', 'c', 'd', ''); a <-
c(15, 20, 30, 40); d2 <-
>> data.frame(Promoter=p, ip=a)
>> all <- merge(x=d1, y=d2, by="Promoter", all=T)
>> all <- merge(x=all, y=d2, by="Promoter", all=T)
>> all
>
> Data is this:
>
>> d1
>>   Promoter ip
>> 1        a 10
>> 2        c 20
>> 3          30
>> 4          40
>>
>> d2
>>   Promoter ip
>> 1        b 15
>> 2        c 20
>> 3        d 30
>> 4          40
>
> Output looks like this:
>
>>   Promoter ip.x ip.y ip
>> 1            40   30 30
>> 2            40   40 30
>> 3            40   30 40
>> 4            40   40 40
>> 5        b   15   NA NA
>> 6        c   20   20 20
>> 7        d   30   NA NA
>> 8        a   NA   10 10
>
> The weird thing about this is (in my view) that each instance of ''
is
> considered unique, so with each successive merge, all combinatorial
> possibilities are explored, like a SQL outer join (Cartesian product). For
> non-empty names, an inner join is performed.
>
> Dealing with genomic data (10^4 datapoints), it's easy to have a couple
of
> blanks buried in the middle of things, and to combine several replicates
> with successive merges. I couldn't understand how my three replicates
of
> 6000 points, in which I expected  substantial overlap in the labels, were
> taking so long to merge and ultimately generating 57000 labels. The culprit
> turned out to be a few hundred blanks buried in the middle.
>
> Why does the empty ("null") name merit special treatment? Perhaps
I'm
> missing something. I hesitate to submit this as a bug, since technically I
> guess you could say that blank names, especially duplicates, are not
> kosher. But on the other hand, this combinatorial behaviour seems to occur
> only for blanks.
>
> -Frank
>
> PhD, Computational Biologist,
> Harvard Medical School BCMP/SGM-322, 250 Longwood Ave, Boston MA 02115,
USA.
> Tel: 617-432-3555       Fax:
> 617-432-3557       http://llama.med.harvard.edu/~fgibbons
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>
-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Frank Gibbons

2005-Jun-17 13:46 UTC

head link

[R] possible bug in merge with duplicate blank names in 'by' field.

Thanks for your quick responses, Gabor and Brian.

I'm currently running R version 1.9.1 on Linux. Actually, I have just 
tested this on R v.2.1.0 running under Windows XP, and indeed, as you both 
indicate, the problem does not exist on that version for that OS. So, at an 
appropriate time I'll upgrade my Linux installation to the most recent 
version (1.9.1 is a year old, I guess).

-Frank

At 03:26 AM 6/17/2005, Prof Brian Ripley wrote:>What version of R is this (please do see the posting guide)?
>
>In both 2.1.0 and 2.1.1 beta I get
>
>>all
>   Promoter ip.x ip.y ip
>1            30   40 40
>2            40   40 40
>3        a   10   NA NA
>4        c   20   20 20
>5        b   NA   15 15
>6        d   NA   30 30
>
>so cannot reproduce your result. Are you sure that the `blanks' really
are
>empty and not some character that is printing as empty on your unstated OS?
>
>BTW ' ' is what is normally called `blank'.
>
>BTW, these are not `names' but character strings: `names' has other 
>meanings in R.
PhD, Computational Biologist,
Harvard Medical School BCMP/SGM-322, 250 Longwood Ave, Boston MA 02115, USA.
Tel: 617-432-3555       Fax: 
617-432-3557       http://llama.med.harvard.edu/~fgibbons

Apparently Analagous Threads

Search for more reasonably related threads

R help - Jun 2005 - possible bug in merge with duplicate blank names in 'by' field.

[R] possible bug in merge with duplicate blank names in 'by' field.

[R] possible bug in merge with duplicate blank names in 'by' field.

[R] possible bug in merge with duplicate blank names in 'by' field.

[R] possible bug in merge with duplicate blank names in 'by' field.

Apparently Analagous Threads