thr3ads.net - R help - [R] Choose between duplicated rows [Apr 2012]

If this information is useful, please help other people find it:
Share via:

francy

2012-Apr-14 19:03 UTC

[R] Choose between duplicated rows

Dear r experts,

Sorry for this basic question, but I can't seem to find a solution?

I have this data frame:
df <- data.frame(id = c("id1", "id1", "id1",
"id2", "id2", "id2"), A c(11905, 11907, 11907,
11829, 11829, 11829), v1 = c(NA, 3, NA,1,2,NA), v2 c(NA,2,NA, 2, NA,NA), v3 =
c(NA,1,NA,1,NA,NA), v4 = c("N", "Y", "N",
"Y",
"N","N"), v5 = c(0,0,0,1,0,0), numMiss=c(3,0,3,0,2,3))
> df   id     A v1 v2 v3 v4 v5                numMiss
1 id1 11905 NA NA NA  N  0        3
2 id1 11907  3  2  1  Y  0                 0
3 id1 11907 NA NA NA  N  0        3
4 id2 11829  1  2  1  Y  1                 0
5 id2 11829  2 NA NA  N  0          2
6 id2 11829 NA NA NA  N  0       3


And I need to keep, of the rows that have the same value for "A" by
id, only
the ones with the least amount of missing values for all the variables (with
min(numMiss)) to get this:

   id     A v1 v2 v3 v4 v5                numMiss
1 id1 11905 NA NA NA  N  0        3
2 id1 11907  3  2  1  Y  0                 0
4 id2 11829  1  2  1  Y  1                 0

Then I have to choose the records with the least value of "A" of the
rows
that have the same id like this:
   id     A v1 v2 v3 v4 v5                numMiss
1 id1 11905 NA NA NA  N  0        3
4 id2 11829  1  2  1  Y  1                 0

For groupings I have used the package "plyr" before, but this would
involve
a sort of double-grouping by id and by duplicated values of A?Could you
please help me understand how this can be done? 

Thank you very much.
-f






--
View this message in context:
http://r.789695.n4.nabble.com/Choose-between-duplicated-rows-tp4557833p4557833.html
Sent from the R help mailing list archive at Nabble.com.

jim holtman

2012-Apr-14 20:08 UTC

head link

[R] Choose between duplicated rows

try this:
> x  # print data   id     A v1 v2 v3 v4 v5 numMiss
1 id1 11905 NA NA NA  N  0       3
2 id1 11907  3  2  1  Y  0       0
3 id1 11907 NA NA NA  N  0       3
4 id2 11829  1  2  1  Y  1       0
5 id2 11829  2 NA NA  N  0       2
6 id2 11829 NA NA NA  N  0       3> # select best data
> xBest <- do.call(rbind, lapply(split(x, x$A), function(.grp){+     best <- which.min(apply(.grp, 1, function(a) sum(is.na(a))))
+     .grp[best, ]
+ }))> xBest       id     A v1 v2 v3 v4 v5 numMiss
11829 id2 11829  1  2  1  Y  1       0
11905 id1 11905 NA NA NA  N  0       3
11907 id1 11907  3  2  1  Y  0       0>
> xWorst <- do.call(rbind, lapply(split(x, x$A), function(.grp){+     worst <- which.max(apply(.grp, 1, function(a) sum(is.na(a))))
+     .grp[worst, ]
+ }))> xWorst       id     A v1 v2 v3 v4 v5 numMiss
11829 id2 11829 NA NA NA  N  0       3
11905 id1 11905 NA NA NA  N  0       3
11907 id1 11907 NA NA NA  N  0       3>
>
>

On Sat, Apr 14, 2012 at 3:03 PM, francy <francy.casalino@gmail.com> wrote:
> Dear r experts,
>
> Sorry for this basic question, but I can't seem to find a solution…
>
> I have this data frame:
> df <- data.frame(id = c("id1", "id1",
"id1", "id2", "id2", "id2"), A >
c(11905, 11907, 11907, 11829, 11829, 11829), v1 = c(NA, 3, NA,1,2,NA), v2 >
c(NA,2,NA, 2, NA,NA), v3 = c(NA,1,NA,1,NA,NA), v4 = c("N",
"Y", "N", "Y",
> "N","N"), v5 = c(0,0,0,1,0,0), numMiss=c(3,0,3,0,2,3))
>
> > df
>   id     A v1 v2 v3 v4 v5                numMiss
> 1 id1 11905 NA NA NA  N  0        3
> 2 id1 11907  3  2  1  Y  0                 0
> 3 id1 11907 NA NA NA  N  0        3
> 4 id2 11829  1  2  1  Y  1                 0
> 5 id2 11829  2 NA NA  N  0          2
> 6 id2 11829 NA NA NA  N  0       3
>
>
> And I need to keep, of the rows that have the same value for "A"
by id,
> only
> the ones with the least amount of missing values for all the variables
> (with
> min(numMiss)) to get this:
>
>   id     A v1 v2 v3 v4 v5                numMiss
> 1 id1 11905 NA NA NA  N  0        3
> 2 id1 11907  3  2  1  Y  0                 0
> 4 id2 11829  1  2  1  Y  1                 0
>
> Then I have to choose the records with the least value of "A" of
the rows
> that have the same id like this:
>   id     A v1 v2 v3 v4 v5                numMiss
> 1 id1 11905 NA NA NA  N  0        3
> 4 id2 11829  1  2  1  Y  1                 0
>
> For groupings I have used the package "plyr" before, but this
would involve
> a sort of double-grouping by id and by duplicated values of A…Could you
> please help me understand how this can be done?
>
> Thank you very much.
> -f
>
>
>
>
>
>
> --
> View this message in context:
>
http://r.789695.n4.nabble.com/Choose-between-duplicated-rows-tp4557833p4557833.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
>
http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

	[[alternative HTML version deleted]]

Tyler Rinker

2012-Apr-14 20:15 UTC

head link

[R] Choose between duplicated rows

My solution:
SP <- split(df, df[, 1:2])
minner <- function(x, col = 'numMiss') {? ?
x[which.min(unlist(x[,col])), , drop=FALSE]}
NEW <- do.call('rbind', lapply(SP, minner))SP2 <- split(NEW, NEW[,
'id'])do.call('rbind', lapply(SP2, function(x) minner(x,
'A')))

Cheers,Tyler
> Date: Sat, 14 Apr 2012 12:03:36 -0700
> From: francy.casalino at gmail.com
> To: r-help at r-project.org
> Subject: [R] Choose between duplicated rows
> 
> Dear r experts,
> 
> Sorry for this basic question, but I can't seem to find a solution?
> 
> I have this data frame:
> df <- data.frame(id = c("id1", "id1",
"id1", "id2", "id2", "id2"), A >
c(11905, 11907, 11907, 11829, 11829, 11829), v1 = c(NA, 3, NA,1,2,NA), v2 >
c(NA,2,NA, 2, NA,NA), v3 = c(NA,1,NA,1,NA,NA), v4 = c("N",
"Y", "N", "Y",
> "N","N"), v5 = c(0,0,0,1,0,0), numMiss=c(3,0,3,0,2,3))
> 
> > df
>    id     A v1 v2 v3 v4 v5                numMiss
> 1 id1 11905 NA NA NA  N  0        3
> 2 id1 11907  3  2  1  Y  0                 0
> 3 id1 11907 NA NA NA  N  0        3
> 4 id2 11829  1  2  1  Y  1                 0
> 5 id2 11829  2 NA NA  N  0          2
> 6 id2 11829 NA NA NA  N  0       3
> 
> 
> And I need to keep, of the rows that have the same value for "A"
by id, only
> the ones with the least amount of missing values for all the variables
(with
> min(numMiss)) to get this:
> 
>    id     A v1 v2 v3 v4 v5                numMiss
> 1 id1 11905 NA NA NA  N  0        3
> 2 id1 11907  3  2  1  Y  0                 0
> 4 id2 11829  1  2  1  Y  1                 0
> 
> Then I have to choose the records with the least value of "A" of
the rows
> that have the same id like this:
>    id     A v1 v2 v3 v4 v5                numMiss
> 1 id1 11905 NA NA NA  N  0        3
> 4 id2 11829  1  2  1  Y  1                 0
> 
> For groupings I have used the package "plyr" before, but this
would involve
> a sort of double-grouping by id and by duplicated values of A?Could you
> please help me understand how this can be done? 
> 
> Thank you very much.
> -f
> 
> 
> 
> 
> 
> 
> --
> View this message in context:
http://r.789695.n4.nabble.com/Choose-between-duplicated-rows-tp4557833p4557833.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

francy

2012-Apr-15 20:08 UTC

head link

[R] Choose between duplicated rows

I also tried using Jim's code, but it doesn't work as expected with my
real
dataset. This is what I did:

Best.na <- do.call(rbind, lapply(split(x, x$A), function(.grp){ 
     best <- which.min(apply(.grp, 1, function(a) sum(is.na(a)))) 
     .grp[best, ] 
 })) 

df.split <- split(Best.na, Best.na$id)

Best.date <- lapply(df.split, function(x){

    # Select by given criterion
     y <- x[which(max(x$A) == x$A),]
    y
})
Best.date <- do.call(rbind, Best.date)


Thank you again for your help.  

--
View this message in context:
http://r.789695.n4.nabble.com/Choose-between-duplicated-rows-tp4557833p4559792.html
Sent from the R help mailing list archive at Nabble.com.

Maybe Matching Threads

Search for more seemingly similar threads

R help - Apr 2012 - Choose between duplicated rows

[R] Choose between duplicated rows

[R] Choose between duplicated rows

[R] Choose between duplicated rows

[R] Choose between duplicated rows

Maybe Matching Threads