Jeff Newmiller
2016-Jul-03 18:34 UTC
[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset
I still get the impression from your mixing of information types that you are thinking like this is Excel. Perhaps something like drug_study$admin_period <- ave( "Y" == drug_study$drug_admin, drug_study$ID, FUN=cumsum ) library(dplyr) result0 <- ( drug_study %>% filter( 0 != admin_period ) %>% group_by( ID, admin_period ) %>% summarise( start = min( date ) ) %>% mutate( admin_period1 = admin_period -1 ) ) result <- ( result0 %>% select( -admin_period ) %>% inner_join( result0 %>% select( ID, admin_period1, end=start ) , by = c( ID="ID", admin_period ="admin_period1" ) ) %>% mutate( ddays = end - start ) ) -- Sent from my phone. Please excuse my brevity. On July 3, 2016 10:24:51 AM PDT, Kevin Wamae <KWamae at kemri-wellcome.org> wrote:>HI Jeff, it?s been an uphill task working with the dataset and I am not >the first to complain. Nonetheless, data-cleaning is ongoing and since >I cannot wait for that to get done, I decided to make the most of what >the dataset looks like at this time. It appears the process may take a >while. > >Thanks for the script. From the output, I noticed that ?result? >contains the first and last date for each of the individuals and not >taking into account the variable ?drug-admin?. > >ID start end >J1/3 1/5/09 12/25/10 >R1/3 1/4/07 12/15/08 >R10/1 1/4/07 3/5/12 > >My aim is to pick the date, for example in 2007, where drug-admin =>?Y? as my start and the date in the subsequent year (2008 in this case) >where drug-admin == ?Y? as my end. Then, I should populate the variable >?study_id? with ?start? up to the entry just above the one whose date >matches ?end?, as the output below shows (I hope its structure is >maintained as I have copied it from R-Studio). The goal for now is to >then get difference in days between ?date? and ?study_id? and still get >to keep that column for ?study_id? as I might use it later. > >From the output, it can be seen that for this individual, the dates run >from 2007 to 2008. However, for some individuals, the dates run from >2008-2009, 2009-2010 and so on. Therefore, I need to make the script >deal with all the years as the dates range from 2001-2016 > >ID date drug_admin year month study_id >R1/3 5/11/07 Y 2007 5 5/11/07 >R1/3 5/16/07 2007 5 5/11/07 >R1/3 5/22/07 2007 5 5/11/07 >R1/3 5/28/07 2007 5 5/11/07 >R1/3 6/5/07 2007 6 5/11/07 >R1/3 6/11/07 2007 6 5/11/07 >R1/3 6/18/07 2007 6 5/11/07 >R1/3 6/25/07 2007 6 5/11/07 >R1/3 7/2/07 2007 7 5/11/07 >R1/3 7/16/07 2007 7 5/11/07 >R1/3 7/29/07 2007 7 5/11/07 >R1/3 8/2/07 2007 8 5/11/07 >R1/3 8/7/07 2007 8 5/11/07 >R1/3 8/13/07 2007 8 5/11/07 >R1/3 9/18/07 2007 9 5/11/07 >R1/3 9/24/07 2007 9 5/11/07 >R1/3 10/6/07 2007 10 5/11/07 >R1/3 10/8/07 2007 10 5/11/07 >R1/3 10/15/07 2007 10 5/11/07 >R1/3 10/22/07 2007 10 5/11/07 >R1/3 10/29/07 2007 10 5/11/07 >R1/3 11/8/07 2007 11 5/11/07 >R1/3 11/12/07 2007 11 5/11/07 >R1/3 11/19/07 2007 11 5/11/07 >R1/3 11/29/07 2007 11 5/11/07 >R1/3 12/6/07 2007 12 5/11/07 >R1/3 12/10/07 2007 12 5/11/07 >R1/3 12/21/07 2007 12 5/11/07 >R1/3 1/7/08 2008 1 5/11/07 >R1/3 1/14/08 2008 1 5/11/07 >R1/3 1/21/08 2008 1 5/11/07 >R1/3 1/28/08 2008 1 5/11/07 >R1/3 2/4/08 Y 2008 2 > > >Regards >------------------------------------------------------------------------------- >Kevin Wame > >############################################################### > >############################################################### > > > >On 7/3/16, 7:05 PM, "Jeff Newmiller" <jdnewmil at dcn.davis.ca.us> wrote: > >result <- setNames( data.frame( aggregate( date~ID, data=drug_study, >FUN=min ), aggregate( date~ID, data=drug_study, FUN=max )[2] ), c( >"ID", "start", "end" ) ) > > >______________________________________________________________________ > >This e-mail contains information which is confidential. It is intended >only for the use of the named recipient. If you have received this >e-mail in error, please let us know by replying to the sender, and >immediately delete it from your system. Please note, that in these >circumstances, the use, disclosure, distribution or copying of this >information is strictly prohibited. KEMRI-Wellcome Trust Programme >cannot accept any responsibility for the accuracy or completeness of >this message as it has been transmitted over a public network. Although >the Programme has taken reasonable precautions to ensure no viruses are >present in emails, it cannot accept responsibility for any loss or >damage arising from the use of the email or attachments. Any views >expressed in this message are those of the individual sender, except >where the sender specifically states them to be the views of >KEMRI-Wellcome Trust Programme. >______________________________________________________________________
Kevin Wamae
2016-Jul-03 18:55 UTC
[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset
Hi Jeff, ?likes its Excel?, I don?t follow. Pardon me for any mix up. Thanks for the code. After running it, this is the error I get. Error: cannot join on columns 'admin_period' x 'admin_period1': index out of bounds Regards ------------------------------------------------------------------------------- Kevin Wame | Ph.D. Student (IDeAL) KEMRI-Wellcome Trust Collaborative Research Programme Centre for Geographic Medicine Research P.O. Box 230-80108, Kilifi, Kenya On 7/3/16, 9:34 PM, "Jeff Newmiller" <jdnewmil at dcn.davis.ca.us> wrote: I still get the impression from your mixing of information types that you are thinking like this is Excel. Perhaps something like drug_study$admin_period <- ave( "Y" == drug_study$drug_admin, drug_study$ID, FUN=cumsum ) library(dplyr) result0 <- ( drug_study %>% filter( 0 != admin_period ) %>% group_by( ID, admin_period ) %>% summarise( start = min( date ) ) %>% mutate( admin_period1 = admin_period -1 ) ) result <- ( result0 %>% select( -admin_period ) %>% inner_join( result0 %>% select( ID, admin_period1, end=start ) , by = c( ID="ID", admin_period ="admin_period1" ) ) %>% mutate( ddays = end - start ) ) -- Sent from my phone. Please excuse my brevity. On July 3, 2016 10:24:51 AM PDT, Kevin Wamae <KWamae at kemri-wellcome.org> wrote:>HI Jeff, it?s been an uphill task working with the dataset and I am not >the first to complain. Nonetheless, data-cleaning is ongoing and since >I cannot wait for that to get done, I decided to make the most of what >the dataset looks like at this time. It appears the process may take a >while. > >Thanks for the script. From the output, I noticed that ?result? >contains the first and last date for each of the individuals and not >taking into account the variable ?drug-admin?. > >ID start end >J1/3 1/5/09 12/25/10 >R1/3 1/4/07 12/15/08 >R10/1 1/4/07 3/5/12 > >My aim is to pick the date, for example in 2007, where drug-admin =>?Y? as my start and the date in the subsequent year (2008 in this case) >where drug-admin == ?Y? as my end. Then, I should populate the variable >?study_id? with ?start? up to the entry just above the one whose date >matches ?end?, as the output below shows (I hope its structure is >maintained as I have copied it from R-Studio). The goal for now is to >then get difference in days between ?date? and ?study_id? and still get >to keep that column for ?study_id? as I might use it later. > >From the output, it can be seen that for this individual, the dates run >from 2007 to 2008. However, for some individuals, the dates run from >2008-2009, 2009-2010 and so on. Therefore, I need to make the script >deal with all the years as the dates range from 2001-2016 > >ID date drug_admin year month study_id >R1/3 5/11/07 Y 2007 5 5/11/07 >R1/3 5/16/07 2007 5 5/11/07 >R1/3 5/22/07 2007 5 5/11/07 >R1/3 5/28/07 2007 5 5/11/07 >R1/3 6/5/07 2007 6 5/11/07 >R1/3 6/11/07 2007 6 5/11/07 >R1/3 6/18/07 2007 6 5/11/07 >R1/3 6/25/07 2007 6 5/11/07 >R1/3 7/2/07 2007 7 5/11/07 >R1/3 7/16/07 2007 7 5/11/07 >R1/3 7/29/07 2007 7 5/11/07 >R1/3 8/2/07 2007 8 5/11/07 >R1/3 8/7/07 2007 8 5/11/07 >R1/3 8/13/07 2007 8 5/11/07 >R1/3 9/18/07 2007 9 5/11/07 >R1/3 9/24/07 2007 9 5/11/07 >R1/3 10/6/07 2007 10 5/11/07 >R1/3 10/8/07 2007 10 5/11/07 >R1/3 10/15/07 2007 10 5/11/07 >R1/3 10/22/07 2007 10 5/11/07 >R1/3 10/29/07 2007 10 5/11/07 >R1/3 11/8/07 2007 11 5/11/07 >R1/3 11/12/07 2007 11 5/11/07 >R1/3 11/19/07 2007 11 5/11/07 >R1/3 11/29/07 2007 11 5/11/07 >R1/3 12/6/07 2007 12 5/11/07 >R1/3 12/10/07 2007 12 5/11/07 >R1/3 12/21/07 2007 12 5/11/07 >R1/3 1/7/08 2008 1 5/11/07 >R1/3 1/14/08 2008 1 5/11/07 >R1/3 1/21/08 2008 1 5/11/07 >R1/3 1/28/08 2008 1 5/11/07 >R1/3 2/4/08 Y 2008 2 > > >Regards >------------------------------------------------------------------------------- >Kevin Wame > >############################################################### > >############################################################### > > > >On 7/3/16, 7:05 PM, "Jeff Newmiller" <jdnewmil at dcn.davis.ca.us> wrote: > >result <- setNames( data.frame( aggregate( date~ID, data=drug_study, >FUN=min ), aggregate( date~ID, data=drug_study, FUN=max )[2] ), c( >"ID", "start", "end" ) ) > > >______________________________________________________________________ > >This e-mail contains information which is confidential. It is intended >only for the use of the named recipient. If you have received this >e-mail in error, please let us know by replying to the sender, and >immediately delete it from your system. Please note, that in these >circumstances, the use, disclosure, distribution or copying of this >information is strictly prohibited. KEMRI-Wellcome Trust Programme >cannot accept any responsibility for the accuracy or completeness of >this message as it has been transmitted over a public network. Although >the Programme has taken reasonable precautions to ensure no viruses are >present in emails, it cannot accept responsibility for any loss or >damage arising from the use of the email or attachments. Any views >expressed in this message are those of the individual sender, except >where the sender specifically states them to be the views of >KEMRI-Wellcome Trust Programme. >____________________________________________________________________________________________________________________________________________ This e-mail contains information which is confidential. It is intended only for the use of the named recipient. If you have received this e-mail in error, please let us know by replying to the sender, and immediately delete it from your system. Please note, that in these circumstances, the use, disclosure, distribution or copying of this information is strictly prohibited. KEMRI-Wellcome Trust Programme cannot accept any responsibility for the accuracy or completeness of this message as it has been transmitted over a public network. Although the Programme has taken reasonable precautions to ensure no viruses are present in emails, it cannot accept responsibility for any loss or damage arising from the use of the email or attachments. Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of KEMRI-Wellcome Trust Programme. ______________________________________________________________________
Jeff Newmiller
2016-Jul-03 19:09 UTC
[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset
Typo on the second line result <- ( result0 %>% select( -admin_period1 ) %>% inner_join( result0 %>% select( ID, admin_period1, end=start ) , by = c( ID="ID", admin_period ="admin_period1" ) ) %>% mutate( ddays = end - start ) ) -- Sent from my phone. Please excuse my brevity. On July 3, 2016 11:55:14 AM PDT, Kevin Wamae <KWamae at kemri-wellcome.org> wrote:>Hi Jeff, ?likes its Excel?, I don?t follow. Pardon me for any mix up. > >Thanks for the code. After running it, this is the error I get. > >Error: cannot join on columns 'admin_period' x 'admin_period1': index >out of bounds > >Regards >------------------------------------------------------------------------------- >Kevin Wame | Ph.D. Student (IDeAL) >KEMRI-Wellcome Trust Collaborative Research Programme >Centre for Geographic Medicine Research >P.O. Box 230-80108, Kilifi, Kenya > > >On 7/3/16, 9:34 PM, "Jeff Newmiller" <jdnewmil at dcn.davis.ca.us> wrote: > >I still get the impression from your mixing of information types that >you are thinking like this is Excel. > >Perhaps something like > >drug_study$admin_period <- ave( "Y" == drug_study$drug_admin, >drug_study$ID, FUN=cumsum ) >library(dplyr) >result0 <- ( drug_study > %>% filter( 0 != admin_period ) > %>% group_by( ID, admin_period ) > %>% summarise( start = min( date ) ) > %>% mutate( admin_period1 = admin_period -1 ) > ) >result <- ( result0 > %>% select( -admin_period ) > %>% inner_join( result0 %>% select( ID, admin_period1, end=start ) > , by = c( ID="ID", admin_period ="admin_period1" ) > ) > %>% mutate( ddays = end - start ) > ) >-- >Sent from my phone. Please excuse my brevity. > >On July 3, 2016 10:24:51 AM PDT, Kevin Wamae ><KWamae at kemri-wellcome.org> wrote: >>HI Jeff, it?s been an uphill task working with the dataset and I am >not >>the first to complain. Nonetheless, data-cleaning is ongoing and since >>I cannot wait for that to get done, I decided to make the most of what >>the dataset looks like at this time. It appears the process may take a >>while. >> >>Thanks for the script. From the output, I noticed that ?result? >>contains the first and last date for each of the individuals and not >>taking into account the variable ?drug-admin?. >> >>ID start end >>J1/3 1/5/09 12/25/10 >>R1/3 1/4/07 12/15/08 >>R10/1 1/4/07 3/5/12 >> >>My aim is to pick the date, for example in 2007, where drug-admin =>>?Y? as my start and the date in the subsequent year (2008 in this >case) >>where drug-admin == ?Y? as my end. Then, I should populate the >variable >>?study_id? with ?start? up to the entry just above the one whose date >>matches ?end?, as the output below shows (I hope its structure is >>maintained as I have copied it from R-Studio). The goal for now is to >>then get difference in days between ?date? and ?study_id? and still >get >>to keep that column for ?study_id? as I might use it later. >> >>From the output, it can be seen that for this individual, the dates >run >>from 2007 to 2008. However, for some individuals, the dates run from >>2008-2009, 2009-2010 and so on. Therefore, I need to make the script >>deal with all the years as the dates range from 2001-2016 >> >>ID date drug_admin year month study_id >>R1/3 5/11/07 Y 2007 5 5/11/07 >>R1/3 5/16/07 2007 5 5/11/07 >>R1/3 5/22/07 2007 5 5/11/07 >>R1/3 5/28/07 2007 5 5/11/07 >>R1/3 6/5/07 2007 6 5/11/07 >>R1/3 6/11/07 2007 6 5/11/07 >>R1/3 6/18/07 2007 6 5/11/07 >>R1/3 6/25/07 2007 6 5/11/07 >>R1/3 7/2/07 2007 7 5/11/07 >>R1/3 7/16/07 2007 7 5/11/07 >>R1/3 7/29/07 2007 7 5/11/07 >>R1/3 8/2/07 2007 8 5/11/07 >>R1/3 8/7/07 2007 8 5/11/07 >>R1/3 8/13/07 2007 8 5/11/07 >>R1/3 9/18/07 2007 9 5/11/07 >>R1/3 9/24/07 2007 9 5/11/07 >>R1/3 10/6/07 2007 10 5/11/07 >>R1/3 10/8/07 2007 10 5/11/07 >>R1/3 10/15/07 2007 10 5/11/07 >>R1/3 10/22/07 2007 10 5/11/07 >>R1/3 10/29/07 2007 10 5/11/07 >>R1/3 11/8/07 2007 11 5/11/07 >>R1/3 11/12/07 2007 11 5/11/07 >>R1/3 11/19/07 2007 11 5/11/07 >>R1/3 11/29/07 2007 11 5/11/07 >>R1/3 12/6/07 2007 12 5/11/07 >>R1/3 12/10/07 2007 12 5/11/07 >>R1/3 12/21/07 2007 12 5/11/07 >>R1/3 1/7/08 2008 1 5/11/07 >>R1/3 1/14/08 2008 1 5/11/07 >>R1/3 1/21/08 2008 1 5/11/07 >>R1/3 1/28/08 2008 1 5/11/07 >>R1/3 2/4/08 Y 2008 2 >> >> >>Regards >>------------------------------------------------------------------------------- >>Kevin Wame >> >>############################################################### >> >>############################################################### >> >> >> >>On 7/3/16, 7:05 PM, "Jeff Newmiller" <jdnewmil at dcn.davis.ca.us> wrote: >> >>result <- setNames( data.frame( aggregate( date~ID, data=drug_study, >>FUN=min ), aggregate( date~ID, data=drug_study, FUN=max )[2] ), c( >>"ID", "start", "end" ) ) >> >> >>______________________________________________________________________ >> >>This e-mail contains information which is confidential. It is intended >>only for the use of the named recipient. If you have received this >>e-mail in error, please let us know by replying to the sender, and >>immediately delete it from your system. Please note, that in these >>circumstances, the use, disclosure, distribution or copying of this >>information is strictly prohibited. KEMRI-Wellcome Trust Programme >>cannot accept any responsibility for the accuracy or completeness of >>this message as it has been transmitted over a public network. >Although >>the Programme has taken reasonable precautions to ensure no viruses >are >>present in emails, it cannot accept responsibility for any loss or >>damage arising from the use of the email or attachments. Any views >>expressed in this message are those of the individual sender, except >>where the sender specifically states them to be the views of >>KEMRI-Wellcome Trust Programme. >>______________________________________________________________________ > > > > >______________________________________________________________________ > >This e-mail contains information which is confidential. It is intended >only for the use of the named recipient. If you have received this >e-mail in error, please let us know by replying to the sender, and >immediately delete it from your system. Please note, that in these >circumstances, the use, disclosure, distribution or copying of this >information is strictly prohibited. KEMRI-Wellcome Trust Programme >cannot accept any responsibility for the accuracy or completeness of >this message as it has been transmitted over a public network. Although >the Programme has taken reasonable precautions to ensure no viruses are >present in emails, it cannot accept responsibility for any loss or >damage arising from the use of the email or attachments. Any views >expressed in this message are those of the individual sender, except >where the sender specifically states them to be the views of >KEMRI-Wellcome Trust Programme. >______________________________________________________________________