thr3ads.net - R help - [R] Recoding multiple columns consistently [Aug 2007]

If this information is useful, please help other people find it:
Share via:

Ron Crump

2007-Aug-29 00:01 UTC

[R] Recoding multiple columns consistently

Hi,

I have a dataframe that contains pedigree information;
that is individual, sire and dam identities as separate
columns. It also has date of birth.

These identifiers are not numeric, or not sequential.

Obviously, an identifier can appear in one or two columns,
depending on whether it was a parent or not. These should
be consistent.

Not all identifiers appear in the individual column - it
is possible for a parent not to have its own record if its
parents were not known.

Missing parental (sire and/or dam) identifiers can occur.

I need to export the data for use in another program that
requires the pedigree to be coded as integers, increasing
with date of birth (therefore sire and dam always have
lower identifiers than their offspring) and with missing
values coded as 0.

How would I go about doing this?

And a second, simpler related question, if I have a column with
n different values (may be strings or non-sequential integers)
identifying levels (possibly with repeated occurences), how
can I recode them to be sequential from 1 to n?

I can solve both problems in fortran, so could use loops to
do it in R, but feel there should be quicker, more elegant,
"more R" solution.

Thanks for your help.

Ron.

Uwe Ligges

2007-Aug-29 07:59 UTC

head link

[R] Recoding multiple columns consistently

Ron Crump wrote:> Hi,
> 
> I have a dataframe that contains pedigree information;
> that is individual, sire and dam identities as separate
> columns. It also has date of birth.
> 
> These identifiers are not numeric, or not sequential.
> 
> Obviously, an identifier can appear in one or two columns,
> depending on whether it was a parent or not. These should
> be consistent.
> 
> Not all identifiers appear in the individual column - it
> is possible for a parent not to have its own record if its
> parents were not known.
> 
> Missing parental (sire and/or dam) identifiers can occur.
> 
> I need to export the data for use in another program that
> requires the pedigree to be coded as integers, increasing
> with date of birth (therefore sire and dam always have
> lower identifiers than their offspring) and with missing
> values coded as 0.
> 
> How would I go about doing this?
> 
> And a second, simpler related question, if I have a column with
> n different values (may be strings or non-sequential integers)
> identifying levels (possibly with repeated occurences), how
> can I recode them to be sequential from 1 to n?

rank(x, ties.method="first")


For the question above you can do as follows, for example:
order() identifiers by date, make them unique() and assign them to a new 
"levels" object. Then make them ordered factors:
   factor(some_column, levels=levels, ordered = TRUE)
and then as.numeric(factor_object) is what you are going to get.

Uwe Ligges




> I can solve both problems in fortran, so could use loops to
> do it in R, but feel there should be quicker, more elegant,
> "more R" solution.
> 
> Thanks for your help.
> 
> Ron.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Jim Lemon

2007-Aug-29 11:01 UTC

head link

[R] Recoding multiple columns consistently

Ron Crump wrote:> Hi,
> 
> I have a dataframe that contains pedigree information;
> that is individual, sire and dam identities as separate
> columns. It also has date of birth.
> 
> These identifiers are not numeric, or not sequential.
> 
> Obviously, an identifier can appear in one or two columns,
> depending on whether it was a parent or not. These should
> be consistent.
> 
> Not all identifiers appear in the individual column - it
> is possible for a parent not to have its own record if its
> parents were not known.
> 
> Missing parental (sire and/or dam) identifiers can occur.
> 
> I need to export the data for use in another program that
> requires the pedigree to be coded as integers, increasing
> with date of birth (therefore sire and dam always have
> lower identifiers than their offspring) and with missing
> values coded as 0.
> 
> How would I go about doing this?
>Hi Ron,
Without the genealogical coding system for the output, I can only make a 
guess. It seems as though you are going from a series of records for 
which the index is the individual, followed by fields containing sire, 
dam and date of birth (perhaps not in that order).

I think you want to transform this into a network (maybe hierarchical 
unless consanguinuity intervenes) with individuals coded as positive 
integers (and maybe some or all of the original information attached to 
those identifiers). At a guess, I would recode the birthdates as 
integers, preserving the order and including a rule for breaking ties.

Assuming that you want an inverted tree for each individual, construct a 
linked list beginning with the individual with two pointers to the 
parents (their integer identifiers). Each parent has two links pointing 
to their parents, and so on. Whenever a pointer is zero, the linking 
stops. I don't know whether this can be represented in any of the tree 
diagrams in R, but it certainly could be coded.

I think a bit more information for non-genealogists about the formats 
might elicit a more specific answer.
> And a second, simpler related question, if I have a column with
> n different values (may be strings or non-sequential integers)
> identifying levels (possibly with repeated occurences), how
> can I recode them to be sequential from 1 to n?
> 
> I can solve both problems in fortran, so could use loops to
> do it in R, but feel there should be quicker, more elegant,
> "more R" solution.
> sounds like "sort"

Jim

Possibly Parallel Threads

Search for more seemingly similar threads

R help - Aug 2007 - Recoding multiple columns consistently

[R] Recoding multiple columns consistently

[R] Recoding multiple columns consistently

[R] Recoding multiple columns consistently

Possibly Parallel Threads