Irene Gallego Romero
2010-Jan-28 12:05 UTC
[R] Conditional editing of rows in a data frame
Dear R users, I have a dataframe (main.table) with ~30,000 rows and 6 columns, of which here are a few rows: id chr window gene xp.norm xp.top 129 1_32 1 32 TAS1R1 1.28882115 FALSE 130 1_32 1 32 ZBTB48 1.28882115 FALSE 131 1_32 1 32 KLHL21 1.28882115 FALSE 132 1_32 1 32 PHF13 1.28882115 FALSE 133 1_33 1 33 PHF13 1.02727430 FALSE 134 1_33 1 33 THAP3 1.02727430 FALSE 135 1_33 1 33 DNAJC11 1.02727430 FALSE 136 1_33 1 33 CAMTA1 1.02727430 FALSE 137 1_34 1 34 CAMTA1 1.40312732 TRUE 138 1_35 1 35 CAMTA1 1.52104538 FALSE 139 1_36 1 36 CAMTA1 1.04853732 FALSE 140 1_37 1 37 CAMTA1 0.64794094 FALSE 141 1_38 1 38 CAMTA1 1.23026086 TRUE 142 1_38 1 38 VAMP3 1.23026086 TRUE 143 1_38 1 38 PER3 1.23026086 TRUE 144 1_39 1 39 PER3 1.18154967 TRUE 145 1_39 1 39 UTS2 1.18154967 TRUE 146 1_39 1 39 TNFRSF9 1.18154967 TRUE 147 1_39 1 39 PARK7 1.18154967 TRUE 148 1_39 1 39 ERRFI1 1.18154967 TRUE 149 1_40 1 40 no_gene 1.79796879 FALSE 150 1_41 1 41 SLC45A1 0.20193560 FALSE I want to create two new columns, xp.bg and xp.n.top, using the following criteria: If gene is the same in consecutive rows, xp.bg is the minimum value of xp.norm in those rows; if gene is not the same, xp.bg is simply the value of xp.norm for that row; Likewise, if there's a run of contiguous xp.top = TRUE values, xp.n.top is the minimum value in that range, and if xp.top is false or NA, xp.n.top is NA, or 0 (I don't care). So, in the above example, xp.bg for rows 136:141 should be 0.64794094, and is equal to xp.norm for all other rows, xp.n.top for row 137 is 1.40312732, 1.18154967 for rows 141:148, and 0/NA for all other rows. Is there a way to combine indexing and if statements or some such to accomplish this? I want to it this without using split(main.table, main.table$gene), because there's about 20,000 unique entries for gene, and one of the entries, no_gene, is repeated throughout. I thought briefly of subsetting the rows where xp.top is TRUE, but I then don't know how to set the range for min, so that it only looks at what would originally have been consecutive rows, and searching the help has not proved particularly useful. Thanks in advance, Irene Gallego Romero -- Irene Gallego Romero Leverhulme Centre for Human Evolutionary Studies University of Cambridge Fitzwilliam St Cambridge CB1 3QH UK email: ig247 at cam.ac.uk
On Jan 28, 2010, at 7:05 AM, Irene Gallego Romero wrote:> Dear R users, > > I have a dataframe (main.table) with ~30,000 rows and 6 columns, of > which here are a few rows: > > id chr window gene xp.norm xp.top > 129 1_32 1 32 TAS1R1 1.28882115 FALSE > 130 1_32 1 32 ZBTB48 1.28882115 FALSE > 131 1_32 1 32 KLHL21 1.28882115 FALSE > 132 1_32 1 32 PHF13 1.28882115 FALSE > 133 1_33 1 33 PHF13 1.02727430 FALSE > 134 1_33 1 33 THAP3 1.02727430 FALSE > 135 1_33 1 33 DNAJC11 1.02727430 FALSE > 136 1_33 1 33 CAMTA1 1.02727430 FALSE > 137 1_34 1 34 CAMTA1 1.40312732 TRUE > 138 1_35 1 35 CAMTA1 1.52104538 FALSE > 139 1_36 1 36 CAMTA1 1.04853732 FALSE > 140 1_37 1 37 CAMTA1 0.64794094 FALSE > 141 1_38 1 38 CAMTA1 1.23026086 TRUE > 142 1_38 1 38 VAMP3 1.23026086 TRUE > 143 1_38 1 38 PER3 1.23026086 TRUE > 144 1_39 1 39 PER3 1.18154967 TRUE > 145 1_39 1 39 UTS2 1.18154967 TRUE > 146 1_39 1 39 TNFRSF9 1.18154967 TRUE > 147 1_39 1 39 PARK7 1.18154967 TRUE > 148 1_39 1 39 ERRFI1 1.18154967 TRUE > 149 1_40 1 40 no_gene 1.79796879 FALSE > 150 1_41 1 41 SLC45A1 0.20193560 FALSE > > I want to create two new columns, xp.bg and xp.n.top, using the > following criteria: > > If gene is the same in consecutive rows, xp.bg is the minimum value of > xp.norm in those rows; if gene is not the same, xp.bg is simply the > value of xp.norm for that row;Assuming that gene values are adjacent in a dataframe named df1, then this would work: df1$xp.bg<- with(df1, ave(xp.norm, gene, FUN=min))> > Likewise, if there's a run of contiguous xp.top = TRUE values, > xp.n.top is the minimum value in that range, and if xp.top is false or > NA, xp.n.top is NA, or 0 (I don't care).df1$seqgrp <- c(0, diff(df1$xp.top)) df1$seqgrp2 <- cumsum(df1$seqgrp != 0) df1$xp.n.top <- with(df1, ave(xp.norm, seqgrp2, FUN=min)) is.na(df1$xp.n.top) <- !xp.top > df1$xp.bg<- with(df1, ave(xp.norm, gene, FUN=min)) > df1 id chr window gene xp.norm xp.top seqgrp seqgrp2 xp.n.top xp.bg 129 1_32 1 32 TAS1R1 1.2888211 FALSE 0 0 NA 1.2888211 130 1_32 1 32 ZBTB48 1.2888211 FALSE 0 0 NA 1.2888211 131 1_32 1 32 KLHL21 1.2888211 FALSE 0 0 NA 1.2888211 132 1_32 1 32 PHF13 1.2888211 FALSE 0 0 NA 1.0272743 133 1_33 1 33 PHF13 1.0272743 FALSE 0 0 NA 1.0272743 134 1_33 1 33 THAP3 1.0272743 FALSE 0 0 NA 1.0272743 135 1_33 1 33 DNAJC11 1.0272743 FALSE 0 0 NA 1.0272743 136 1_33 1 33 CAMTA1 1.0272743 FALSE 0 0 NA 0.6479409 137 1_34 1 34 CAMTA1 1.4031273 TRUE 1 1 1.403127 0.6479409 138 1_35 1 35 CAMTA1 1.5210454 FALSE -1 2 NA 0.6479409 139 1_36 1 36 CAMTA1 1.0485373 FALSE 0 2 NA 0.6479409 140 1_37 1 37 CAMTA1 0.6479409 FALSE 0 2 NA 0.6479409 141 1_38 1 38 CAMTA1 1.2302609 TRUE 1 3 1.181550 0.6479409 142 1_38 1 38 VAMP3 1.2302609 TRUE 0 3 1.181550 1.2302609 143 1_38 1 38 PER3 1.2302609 TRUE 0 3 1.181550 1.1815497 144 1_39 1 39 PER3 1.1815497 TRUE 0 3 1.181550 1.1815497 145 1_39 1 39 UTS2 1.1815497 TRUE 0 3 1.181550 1.1815497 146 1_39 1 39 TNFRSF9 1.1815497 TRUE 0 3 1.181550 1.1815497 147 1_39 1 39 PARK7 1.1815497 TRUE 0 3 1.181550 1.1815497 148 1_39 1 39 ERRFI1 1.1815497 TRUE 0 3 1.181550 1.1815497 149 1_40 1 40 no_gene 1.7979688 FALSE -1 4 NA 1.7979688 150 1_41 1 41 SLC45A1 0.2019356 FALSE 0 4 NA 0.2019356 And if the adjacent-gene assumption of the first request above were not met, then the first portion of this method could be used instead to great group indices. -- David.> > So, in the above example, > xp.bg for rows 136:141 should be 0.64794094, and is equal to xp.norm > for all other rows, > xp.n.top for row 137 is 1.40312732, 1.18154967 for rows 141:148, and > 0/NA for all other rows. > > Is there a way to combine indexing and if statements or some such to > accomplish this? I want to it this without using split(main.table, > main.table$gene), because there's about 20,000 unique entries for > gene, and one of the entries, no_gene, is repeated throughout. I > thought briefly of subsetting the rows where xp.top is TRUE, but I > then don't know how to set the range for min, so that it only looks at > what would originally have been consecutive rows, and searching the > help has not proved particularly useful. > > Thanks in advance, > Irene Gallego RomeroDavid Winsemius, MD Heritage Laboratories West Hartford, CT
If DF is your data frame then: DF$xp.bg <- ave(DF$xp.norm, DF$gene, FUN = min) will create a new column such that the entry in each row has the minimum xp.norm of all rows with the same gene. ave does use split internally but I think it would be worth trying anyways since its only one short line of code. See help(ave) On Thu, Jan 28, 2010 at 7:05 AM, Irene Gallego Romero <ig247 at cam.ac.uk> wrote:> Dear R users, > > I have a dataframe (main.table) with ~30,000 rows and 6 columns, of > which here are a few rows: > > ? ? ?id chr window ? ? ? ? gene ? ? xp.norm ? ?xp.top > 129 1_32 ? 1 ? ? 32 ? ? ? TAS1R1 ?1.28882115 ? ? FALSE > 130 1_32 ? 1 ? ? 32 ? ? ? ZBTB48 ?1.28882115 ? ? FALSE > 131 1_32 ? 1 ? ? 32 ? ? ? KLHL21 ?1.28882115 ? ? FALSE > 132 1_32 ? 1 ? ? 32 ? ? ? ?PHF13 ?1.28882115 ? ? FALSE > 133 1_33 ? 1 ? ? 33 ? ? ? ?PHF13 ?1.02727430 ? ? FALSE > 134 1_33 ? 1 ? ? 33 ? ? ? ?THAP3 ?1.02727430 ? ? FALSE > 135 1_33 ? 1 ? ? 33 ? ? ?DNAJC11 ?1.02727430 ? ? FALSE > 136 1_33 ? 1 ? ? 33 ? ? ? CAMTA1 ?1.02727430 ? ? FALSE > 137 1_34 ? 1 ? ? 34 ? ? ? CAMTA1 ?1.40312732 ? ? ?TRUE > 138 1_35 ? 1 ? ? 35 ? ? ? CAMTA1 ?1.52104538 ? ? FALSE > 139 1_36 ? 1 ? ? 36 ? ? ? CAMTA1 ?1.04853732 ? ? FALSE > 140 1_37 ? 1 ? ? 37 ? ? ? CAMTA1 ?0.64794094 ? ? FALSE > 141 1_38 ? 1 ? ? 38 ? ? ? CAMTA1 ?1.23026086 ? ? ?TRUE > 142 1_38 ? 1 ? ? 38 ? ? ? ?VAMP3 ?1.23026086 ? ? ?TRUE > 143 1_38 ? 1 ? ? 38 ? ? ? ? PER3 ?1.23026086 ? ? ?TRUE > 144 1_39 ? 1 ? ? 39 ? ? ? ? PER3 ?1.18154967 ? ? ?TRUE > 145 1_39 ? 1 ? ? 39 ? ? ? ? UTS2 ?1.18154967 ? ? ?TRUE > 146 1_39 ? 1 ? ? 39 ? ? ?TNFRSF9 ?1.18154967 ? ? ?TRUE > 147 1_39 ? 1 ? ? 39 ? ? ? ?PARK7 ?1.18154967 ? ? ?TRUE > 148 1_39 ? 1 ? ? 39 ? ? ? ERRFI1 ?1.18154967 ? ? ?TRUE > 149 1_40 ? 1 ? ? 40 ? ? ?no_gene ?1.79796879 ? ? FALSE > 150 1_41 ? 1 ? ? 41 ? ? ?SLC45A1 ?0.20193560 ? ? FALSE > > I want to create two new columns, xp.bg and xp.n.top, using the > following criteria: > > If gene is the same in consecutive rows, xp.bg is the minimum value of > xp.norm in those rows; if gene is not the same, xp.bg is simply the > value of xp.norm for that row; > > Likewise, if there's a run of contiguous xp.top = TRUE values, > xp.n.top is the minimum value in that range, and if xp.top is false or > NA, xp.n.top is NA, or 0 (I don't care). > > So, in the above example, > xp.bg for rows 136:141 should be 0.64794094, and is equal to xp.norm > for all other rows, > xp.n.top for row 137 is 1.40312732, 1.18154967 for rows 141:148, and > 0/NA for all other rows. > > Is there a way to combine indexing and if statements or some such to > accomplish this? I want to it this without using split(main.table, > main.table$gene), because there's about 20,000 unique entries for > gene, and one of the entries, no_gene, is repeated throughout. I > thought briefly of subsetting the rows where xp.top is TRUE, but I > then don't know how to set the range for min, so that it only looks at > what would originally have been consecutive rows, and searching the > help has not proved particularly useful. > > Thanks in advance, > Irene Gallego Romero > > > -- > Irene Gallego Romero > Leverhulme Centre for Human Evolutionary Studies > University of Cambridge > Fitzwilliam St > Cambridge > CB1 3QH > UK > email: ig247 at cam.ac.uk > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >