thr3ads.net - R help - [R] identify duplicate from more than one column [Nov 2011]

If this information is useful, please help other people find it:
Share via:

jour4life

2011-Nov-13 04:16 UTC

[R] identify duplicate from more than one column

Hi all,

I've searched everywhere to try to find out how to do this and have had no
luck. I am trying to construct identifiers for couples in a dataset.
Essentially, I want to identify couples using more than one column as
identifiers. Take for instance:

obs	unit	        home       z 	sex	age
1	015029	18	       1	1	053
2	015029	18	       1	2	049
3	015029	01	       1	1	038
4	015029	01	       1	2	033
5	015029	02	       1	1	036
6	015029	02	       1	2	033
7	015029	03	       1	1	023
8	015029	03	       1	2	019
9	015029	04	       1	2	045
10	015029	05	       1	2	047

Where unit is the housing unit, home is household. Of course, there are more
values for unit, although these first ten observations consist of the same
unit (which could possibly be an apartment complex). Nonetheless, I want to
construct an identifier for couples if unit, home match, but only if both
male and female are within the same household. Taking the example data
above, I want to see this:

	unit	        home	z	sex	age      couple
1	015029	18	       1	1	053      1
2	015029	18	       1	2	049      1
3	015029	01	       1	1	038      2
4	015029	01	       1	2	033      2
5	015029	02	       1	1	036      3
6	015029	02	       1	2	033      3
7	015029	03	       1	1	023      4
8	015029	03	       1	2	019      4
9	015029	04	       1	2	045      0
10	015029	05	       1	2	047      0

As you can see in the last two observations, there were no males identified
within the same household, thus the last two observations would not contain
couple identifiers, rather some other identifier (but the same one) so I can
detect them and remove them later. I've tried using the duplicated function
but was not very useful.

Any help would be greatly appreciated!!! 

Thanks,

Carlos

--
View this message in context:
http://r.789695.n4.nabble.com/identify-duplicate-from-more-than-one-column-tp4035888p4035888.html
Sent from the R help mailing list archive at Nabble.com.

Joshua Wiley

2011-Nov-13 06:19 UTC

head link

[R] identify duplicate from more than one column

Hi Carlos,

Here is one option:

## read in your data
dat <- read.table(textConnection("
obs     unit            home       z    sex     age
1       015029  18             1        1       053
2       015029  18             1        2       049
3       015029  01             1        1       038
4       015029  01             1        2       033
5       015029  02             1        1       036
6       015029  02             1        2       033
7       015029  03             1        1       023
8       015029  03             1        2       019
9       015029  04             1        2       045
10      015029  05             1        2       047"),
  header = TRUE, stringsAsFactors = FALSE)
closeAllConnections()

## create a unique ID for matching unit and home
dat$mID <- with(dat, paste(unit, home, sep = ''))

## somewhat messy way of creating a couple number
## for each mID, if there is more than 1 row, and more than 1 sex
## it creates a couple id, otherwise 0
i <- 0L
dat$couple <- with(dat, unlist(lapply(split(sex, mID), function(x) {
  i <<- i + 1L
  if (length(x) > 1 && length(unique(x)) > 1) {
    rep(i, length(x))
  } else 0L
})))

## view results
dat
   obs  unit home z sex age     mID couple
1    1 15029   18 1   1  53 1502918      1
2    2 15029   18 1   2  49 1502918      1
3    3 15029    1 1   1  38  150291      2
4    4 15029    1 1   2  33  150291      2
5    5 15029    2 1   1  36  150292      3
6    6 15029    2 1   2  33  150292      3
7    7 15029    3 1   1  23  150293      4
8    8 15029    3 1   2  19  150293      4
9    9 15029    4 1   2  45  150294      0
10  10 15029    5 1   2  47  150295      0

See these functions for more details:

?ave # where I got my idea
?split
?lapply
?`<<-`

Cheers,

Josh

On Sat, Nov 12, 2011 at 8:16 PM, jour4life <jour4life at gmail.com>
wrote:> Hi all,
>
> I've searched everywhere to try to find out how to do this and have had
no
> luck. I am trying to construct identifiers for couples in a dataset.
> Essentially, I want to identify couples using more than one column as
> identifiers. Take for instance:
>
> obs ? ? unit ? ? ? ? ? ?home ? ? ? z ? ?sex ? ? age
> 1 ? ? ? 015029 ?18 ? ? ? ? ? ? 1 ? ? ? ?1 ? ? ? 053
> 2 ? ? ? 015029 ?18 ? ? ? ? ? ? 1 ? ? ? ?2 ? ? ? 049
> 3 ? ? ? 015029 ?01 ? ? ? ? ? ? 1 ? ? ? ?1 ? ? ? 038
> 4 ? ? ? 015029 ?01 ? ? ? ? ? ? 1 ? ? ? ?2 ? ? ? 033
> 5 ? ? ? 015029 ?02 ? ? ? ? ? ? 1 ? ? ? ?1 ? ? ? 036
> 6 ? ? ? 015029 ?02 ? ? ? ? ? ? 1 ? ? ? ?2 ? ? ? 033
> 7 ? ? ? 015029 ?03 ? ? ? ? ? ? 1 ? ? ? ?1 ? ? ? 023
> 8 ? ? ? 015029 ?03 ? ? ? ? ? ? 1 ? ? ? ?2 ? ? ? 019
> 9 ? ? ? 015029 ?04 ? ? ? ? ? ? 1 ? ? ? ?2 ? ? ? 045
> 10 ? ? ?015029 ?05 ? ? ? ? ? ? 1 ? ? ? ?2 ? ? ? 047
>
> Where unit is the housing unit, home is household. Of course, there are
more
> values for unit, although these first ten observations consist of the same
> unit (which could possibly be an apartment complex). Nonetheless, I want to
> construct an identifier for couples if unit, home match, but only if both
> male and female are within the same household. Taking the example data
> above, I want to see this:
>
> ? ? ? ?unit ? ? ? ? ? ?home ? ?z ? ? ? sex ? ? age ? ? ?couple
> 1 ? ? ? 015029 ?18 ? ? ? ? ? ? 1 ? ? ? ?1 ? ? ? 053 ? ? ?1
> 2 ? ? ? 015029 ?18 ? ? ? ? ? ? 1 ? ? ? ?2 ? ? ? 049 ? ? ?1
> 3 ? ? ? 015029 ?01 ? ? ? ? ? ? 1 ? ? ? ?1 ? ? ? 038 ? ? ?2
> 4 ? ? ? 015029 ?01 ? ? ? ? ? ? 1 ? ? ? ?2 ? ? ? 033 ? ? ?2
> 5 ? ? ? 015029 ?02 ? ? ? ? ? ? 1 ? ? ? ?1 ? ? ? 036 ? ? ?3
> 6 ? ? ? 015029 ?02 ? ? ? ? ? ? 1 ? ? ? ?2 ? ? ? 033 ? ? ?3
> 7 ? ? ? 015029 ?03 ? ? ? ? ? ? 1 ? ? ? ?1 ? ? ? 023 ? ? ?4
> 8 ? ? ? 015029 ?03 ? ? ? ? ? ? 1 ? ? ? ?2 ? ? ? 019 ? ? ?4
> 9 ? ? ? 015029 ?04 ? ? ? ? ? ? 1 ? ? ? ?2 ? ? ? 045 ? ? ?0
> 10 ? ? ?015029 ?05 ? ? ? ? ? ? 1 ? ? ? ?2 ? ? ? 047 ? ? ?0
>
> As you can see in the last two observations, there were no males identified
> within the same household, thus the last two observations would not contain
> couple identifiers, rather some other identifier (but the same one) so I
can
> detect them and remove them later. I've tried using the duplicated
function
> but was not very useful.
>
> Any help would be greatly appreciated!!!
>
> Thanks,
>
> Carlos
>
> --
> View this message in context:
http://r.789695.n4.nabble.com/identify-duplicate-from-more-than-one-column-tp4035888p4035888.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, ATS Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/

jour4life

2011-Nov-13 18:37 UTC

head link

[R] identify duplicate from more than one column

Thanks Jim and David!

It seems like both were great options. Both of your suggestions of pasting
both IDs together worked well, keeping the pasting as a character is better.
Though, Jim's example was interesting, it gave me the following error:

Error in `$<-.data.frame`(`*tmp*`, "coupleid", value = c(1L, 1L,
2L, 2L,  :
  replacement has 123586 rows, data has 123631

Since this was a large dataframe, I don't know exactly where the error
occurred. But, it seems like it was detecting missing values in some of the
rows and after checking using the is.na() function, it didn't say that there
were any missing values used (i.e. the new mID or sex). 

What do you guys think may be happening?

Thanks,

Carlos

--
View this message in context:
http://r.789695.n4.nabble.com/identify-duplicate-from-more-than-one-column-tp4035888p4037177.html
Sent from the R help mailing list archive at Nabble.com.

jour4life

2011-Nov-13 21:46 UTC

head link

[R] identify duplicate from more than one column

Hi Josh,

I'm sorry, it was meant for you. I guess for now that error doesn't
matter...for now. Essentially, I want to repeat the conditions that state
the following, and continue doing so for several variables.

At the end of the day, I'm only going to keep the couple ID and remove the
duplicates. But, before I do that, I want to see how I can write a line/s
that will let me observe both sexes (in the couple) and identify which one
has a certain characteristic and apply that to a new variable. For instance, 

if a male moved residence, but the woman did not, migration = 1, 
else if a woman moved residence, but not the man, migration = 2, 
else if both man and woman migrated, then migration = 3, etc...
else if both man nor woman did not migrate, then migration = 0

However, in order for me to program this and identify them to construct the
variables, I have to ensure that both are in the same couple id, and observe
both sexes in the couple before I remove the duplicates. I thought the
previous example would help me get at this problem, but it still does not
make sense to me.

Using the newly created coupleid (Thanks to you guys!) this is what I want
to see, where mig = migration: 1 = moved and 0 = did not move:

   coupleid         home z sex age    mig    mig.new
1   01502918       1        1 053      1        3
2   01502918       1        2 049      1        3
3   01502901       1        1 038      0        2
4   01502901       1        2 033      1        2
5   01502902       1        1 036      1        3
6   01502902       1        2 033      1        3
7   01502903       1        1 023      0        0
8   01502903       1        2 019      0        0
9   01502904       1        1 045      0        2
10 01502905       1        2 047      1        2


I hope this makes sense, and thanks again, Josh!

Carlos

--
View this message in context:
http://r.789695.n4.nabble.com/identify-duplicate-from-more-than-one-column-tp4035888p4037652.html
Sent from the R help mailing list archive at Nabble.com.

jour4life

2011-Nov-14 04:20 UTC

head link

[R] identify duplicate from more than one column

Hi William,

This worked like a charm! I was thinking about using reshape(), but was
unsure on how to approach it. Though I have a whole lot of variables, I
decided to keep only those variables that contained both sexes'
characteristics, reshape it into wide format, and merge with the rest of the
data later and it worked perfectly. 

Thanks you guys so much. All ideas were great and I greatly appreciate your
help!!

Best,

Carlos

--
View this message in context:
http://r.789695.n4.nabble.com/identify-duplicate-from-more-than-one-column-tp4035888p4038380.html
Sent from the R help mailing list archive at Nabble.com.

Apparently Analagous Threads

Search for more reasonably related threads

R help - Nov 2011 - identify duplicate from more than one column

[R] identify duplicate from more than one column

[R] identify duplicate from more than one column

[R] identify duplicate from more than one column

[R] identify duplicate from more than one column

[R] identify duplicate from more than one column

Apparently Analagous Threads