Hi, I have a dataframe that contains pedigree information; that is individual, sire and dam identities as separate columns. It also has date of birth. These identifiers are not numeric, or not sequential. Obviously, an identifier can appear in one or two columns, depending on whether it was a parent or not. These should be consistent. Not all identifiers appear in the individual column - it is possible for a parent not to have its own record if its parents were not known. Missing parental (sire and/or dam) identifiers can occur. I need to export the data for use in another program that requires the pedigree to be coded as integers, increasing with date of birth (therefore sire and dam always have lower identifiers than their offspring) and with missing values coded as 0. How would I go about doing this? And a second, simpler related question, if I have a column with n different values (may be strings or non-sequential integers) identifying levels (possibly with repeated occurences), how can I recode them to be sequential from 1 to n? I can solve both problems in fortran, so could use loops to do it in R, but feel there should be quicker, more elegant, "more R" solution. Thanks for your help. Ron.
Ron Crump wrote:> Hi, > > I have a dataframe that contains pedigree information; > that is individual, sire and dam identities as separate > columns. It also has date of birth. > > These identifiers are not numeric, or not sequential. > > Obviously, an identifier can appear in one or two columns, > depending on whether it was a parent or not. These should > be consistent. > > Not all identifiers appear in the individual column - it > is possible for a parent not to have its own record if its > parents were not known. > > Missing parental (sire and/or dam) identifiers can occur. > > I need to export the data for use in another program that > requires the pedigree to be coded as integers, increasing > with date of birth (therefore sire and dam always have > lower identifiers than their offspring) and with missing > values coded as 0. > > How would I go about doing this? > > And a second, simpler related question, if I have a column with > n different values (may be strings or non-sequential integers) > identifying levels (possibly with repeated occurences), how > can I recode them to be sequential from 1 to n?rank(x, ties.method="first") For the question above you can do as follows, for example: order() identifiers by date, make them unique() and assign them to a new "levels" object. Then make them ordered factors: factor(some_column, levels=levels, ordered = TRUE) and then as.numeric(factor_object) is what you are going to get. Uwe Ligges> I can solve both problems in fortran, so could use loops to > do it in R, but feel there should be quicker, more elegant, > "more R" solution. > > Thanks for your help. > > Ron. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Ron Crump wrote:> Hi, > > I have a dataframe that contains pedigree information; > that is individual, sire and dam identities as separate > columns. It also has date of birth. > > These identifiers are not numeric, or not sequential. > > Obviously, an identifier can appear in one or two columns, > depending on whether it was a parent or not. These should > be consistent. > > Not all identifiers appear in the individual column - it > is possible for a parent not to have its own record if its > parents were not known. > > Missing parental (sire and/or dam) identifiers can occur. > > I need to export the data for use in another program that > requires the pedigree to be coded as integers, increasing > with date of birth (therefore sire and dam always have > lower identifiers than their offspring) and with missing > values coded as 0. > > How would I go about doing this? >Hi Ron, Without the genealogical coding system for the output, I can only make a guess. It seems as though you are going from a series of records for which the index is the individual, followed by fields containing sire, dam and date of birth (perhaps not in that order). I think you want to transform this into a network (maybe hierarchical unless consanguinuity intervenes) with individuals coded as positive integers (and maybe some or all of the original information attached to those identifiers). At a guess, I would recode the birthdates as integers, preserving the order and including a rule for breaking ties. Assuming that you want an inverted tree for each individual, construct a linked list beginning with the individual with two pointers to the parents (their integer identifiers). Each parent has two links pointing to their parents, and so on. Whenever a pointer is zero, the linking stops. I don't know whether this can be represented in any of the tree diagrams in R, but it certainly could be coded. I think a bit more information for non-genealogists about the formats might elicit a more specific answer.> And a second, simpler related question, if I have a column with > n different values (may be strings or non-sequential integers) > identifying levels (possibly with repeated occurences), how > can I recode them to be sequential from 1 to n? > > I can solve both problems in fortran, so could use loops to > do it in R, but feel there should be quicker, more elegant, > "more R" solution. >sounds like "sort" Jim