Jeff Newmiller
2016-Jul-03 16:05 UTC
[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset
Your goal of putting character representations of dates in certain rows of a column is hard to imagine a use for. Your goal of identifying start and end dates seems reasonable enough. It can be accomplished using aggregate from base R (less external dependency) or summarise from dplyr (faster, simpler syntax): result <- setNames( data.frame( aggregate( date~ID, data=drug_study, FUN=min ), aggregate( date~ID, data=drug_study, FUN=max )[2] ), c( "ID", "start", "end" ) ) or library( dplyr ) result <- ( drug_study %>% group_by( ID ) %>% summarise( start=min( date ), end=max( date) ) ) -- Sent from my phone. Please excuse my brevity. On July 3, 2016 5:19:01 AM PDT, Kevin Wamae <KWamae at kemri-wellcome.org> wrote:>Hi John, attached is the file in txt. Kindly let me know if it fails >again.. > >Regards >------------------------------------------------------------------------------- >Kevin Wame | Ph.D. Student (IDeAL) >KEMRI-Wellcome Trust Collaborative Research Programme >Centre for Geographic Medicine Research >P.O. Box 230-80108, Kilifi, Kenya > > >On 7/3/16, 3:16 PM, "John Kane" <jrkrideau at inbox.com> wrote: > >The data set did not show up. The R-help list tends to strip out most >file types as a safety precaution. Try renaming the file from xxx.csv >to xxx.txt and it should come through alright. > > > >John Kane >Kingston ON Canada > > >> -----Original Message----- >> From: kwamae at kemri-wellcome.org >> Sent: Sun, 3 Jul 2016 09:39:59 +0000 >> To: jdnewmil at dcn.davis.ca.us, r-help at r-project.org >> Subject: Re: [R] R - Populate Another Variable Based on Multiple >> Conditions | For a Large Dataset >> >> Hi Jeff, pardon me, I was surely not making it easy. I hope this time >I >> will ? >> >> Attached is snippet of the dataset in csv format and below is the >> R.script I have managed so far. >> >> >----------------------------------------------------------------------------------------------------------------------------------------------- >> >----------------------------------------------------------------------------------------------------------------------------------------------- >> >> drug_study <- read.csv("drug_study.csv", header = T); >head(drug_study) >> drug_study$date <- as.Date(drug_study$date, "%m/%d/%Y") >> drug_study$study_id <- "" #create new column >> >> individual <- unique (drug_study$ID) #vector of individuals >> datalength <- dim(drug_study)[1] #number of rows in dataframe >> >> for (i in 1:length(individual)) { >> for (j in 1:datalength) { >> start_admin <- drug_study[c(drug_study$ID == individual[i] & >> drug_study$year == 2007 & drug_study$drug_admin == "Y" & >drug_study$month >> == 5),2] #capture date of start >> end_admin <- drug_study[(drug_study$ID == individual[i] & >> drug_study$year == 2008 & drug_study$drug_admin == "Y" & >drug_study$month >> == 2),2] #capture date of end >> >> if(drug_study[j,1] == individual[i] & drug_study[j,2] >>start_admin >> & drug_study[j,2] < end_admin) { >> drug_study[j,6] <- paste(start_admin) #populate respective row >if >> condition is met >> } >> } >> } >> >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> >> For this dataset, there exists three individuals, J1/3, R1/3, R10/1. >> >> The script works for the last two individuals but not J1/3 with the >error >> below: >> >> >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> Error in if (drug_study[j, 1] == individual[i] & drug_study[j, 2] >>> start_admin & : >> argument is of length zero >> >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> >> I figured it?s because this individuals start_admin and end_admin >dates >> aren?t captured because the if-loop fails. There?s my first problem, >> there are thousands of individuals with varying >> start_admin and end_admin dates and I need a script to capture these >for >> every individual. >> >> Secondly, the above script is taking almost an hour to run for the >entire >> dataset, just for the individuals whose start_admin and end_admin >dates >> can be captured by the if-loop. >> >> I need help in coming up with a script that will tackle the problem >> taking into account the different start_admin and end_admin dates and >be >> resourceful with regards to time. >> >> Regards >> >------------------------------------------------------------------------------- >> Kevin Kariuki >> >> >############################################################################################################################################### >> >############################################################################################################################################### >> >> On 7/3/16, 8:42 AM, "Jeff Newmiller" <jdnewmil at dcn.davis.ca.us> >wrote: >> >> You are making this hard on yourself by not paying attention the >Posting >> Guide listed in the footer of every email on this list. You would >> probably also find [1] helpful also. >> >> [1] >> >http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example >> -- >> Sent from my phone. Please excuse my brevity. >> >> On July 2, 2016 3:41:07 PM PDT, Kevin Wamae ><KWamae at kemri-wellcome.org> >> wrote: >> >Hi Jeff, sorry for referring to you as Jennifer earlier, accept my >> >apologies. >>> >> >I attached a sample dataset in the question, am afraid it must have >> >failed to attach. >>> >> >I have attached it again.. >>> >>> >> >Regards >> >>------------------------------------------------------------------------------- >> >Kevin Kariuki >>> >>> >> >On 7/2/16, 7:37 PM, "Jeff Newmiller" <jdnewmil at dcn.davis.ca.us> >wrote: >>> >> >I can understand you not wanting to supply your actual data online, >but >> >only you know what your data looks like so only you can create a >> >simulated data set that we could show you how to work with. >> >-- >> >Sent from my phone. Please excuse my brevity. >>> >> >On July 2, 2016 2:57:39 AM PDT, Kevin Wamae ><KWamae at kemri-wellcome.org> >> >wrote: >> >>I have a drug-trial study dataset (attached image). >>>> >> >>Since its a large and complex dataset (at least to me) and I hope >to >> >be >> >>as clear as possible with my question. >> >>The dataset is from a study where individuals are given drugs and >> >>followed up over a period spanning two consecutive years. >Individuals >> >>do not start treatment on the same day and once they start, the >> >>variable "drug-admin" is marked "x" as well as the time they stop >> >>treatment in the following year. >> >>There exists another variable, "study_id", that I hope to populate >as >> >>can be seen in the dataset, with the following conditions: >>>> >> >>For every individual >> >>? if the individual has entries that show they received drugs >both >> >>on the start and end date (marked with the "x") >> >>? if the start of drug administration falls in month == 2 | 3 >and >> >>end of administration falls in month == 2 | 4 >> >>? then, using the date that marks the start of drug >administration, >> >>populate the variable _"study_id"_ in all the rows that fall within >> >the >> >>timeframe that the individual was given drugs but excluding the end >of >> >>drug administration. >> >>I have tried my level best and while I have explored several >examples >> >>online, I haven't managed to solve this. The dataset contains close >to >> >>6000 individuals spanning 10 years and my best bet was to use a >loop >> >>which keeps crushing R after running for close to 30min. I have >also >> >>read that dplyr may do the job but my attempts have been in vain. >>>> >> >>sample code >> >>>------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> >>individual <- unique (df$ID) #vector of individuals >> >>datalength <- dim(df)[1] #number of rows in dataframe >>>> >> >>for (i in 1:length(individual)) { >>>> for (j in 1:datalength) { >> >>start_admin <- df[(df$year == 2007] & df$drug_admin == "x" & >> >c(df$month >> >>== 2 | df$month == 3),1] #capture date of start >> >>end_admin <- df[(df$year == 2008] & df$drug_admin == "x" & >c(df$month >> >>== 2 | df$month == 4),1] #capture date of end >>>> >> >>if(df[datalength,1] == individual(i) & df[datalength,2] >>start_admin >> >>& df[datalength,2] < end_admin) { >> >>df[datalength,6] <- start_admin #populate respective row if >condition >> >>is met >>>> } >>>> } >>>> } >>>> >> >>>------------------------------------------------------------------------------------------------------------------------------------------------------------------- >>>> >> >>Above is the code that keeps failing.. >>>> >> >>Any help is highly appreciated.... >>>> >>>> >> >>>______________________________________________________________________ >>>> >> >>This e-mail contains information which is confidential. It is >intended >> >>only for the use of the named recipient. If you have received this >> >>e-mail in error, please let us know by replying to the sender, and >> >>immediately delete it from your system. Please note, that in these >> >>circumstances, the use, disclosure, distribution or copying of this >> >>information is strictly prohibited. KEMRI-Wellcome Trust Programme >> >>cannot accept any responsibility for the accuracy or completeness >of >> >>this message as it has been transmitted over a public network. >> >Although >> >>the Programme has taken reasonable precautions to ensure no viruses >> >are >> >>present in emails, it cannot accept responsibility for any loss or >> >>damage arising from the use of the email or attachments. Any views >> >>expressed in this message are those of the individual sender, >except >> >>where the sender specifically states them to be the views of >> >>KEMRI-Wellcome Trust Programme. >> >>>______________________________________________________________________ >>>> >>>> >> >>>------------------------------------------------------------------------ >>>> >> >>______________________________________________ >> >>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> >>https://stat.ethz.ch/mailman/listinfo/r-help >> >>PLEASE do read the posting guide >> >>http://www.R-project.org/posting-guide.html >> >>and provide commented, minimal, self-contained, reproducible code. >>> >>> >>> >>> >> >>______________________________________________________________________ >>> >> >This e-mail contains information which is confidential. It is >intended >> >only for the use of the named recipient. If you have received this >> >e-mail in error, please let us know by replying to the sender, and >> >immediately delete it from your system. Please note, that in these >> >circumstances, the use, disclosure, distribution or copying of this >> >information is strictly prohibited. KEMRI-Wellcome Trust Programme >> >cannot accept any responsibility for the accuracy or completeness >of >> >this message as it has been transmitted over a public network. >Although >> >the Programme has taken reasonable precautions to ensure no viruses >are >> >present in emails, it cannot accept responsibility for any loss or >> >damage arising from the use of the email or attachments. Any views >> >expressed in this message are those of the individual sender, except >> >where the sender specifically states them to be the views of >> >KEMRI-Wellcome Trust Programme. >> >>______________________________________________________________________ >> >> >> >> >> >______________________________________________________________________ >> >> This e-mail contains information which is confidential. It is >intended >> only for the use of the named recipient. If you have received this >e-mail >> in error, please let us know by replying to the sender, and >immediately >> delete it from your system. Please note, that in these >circumstances, >> the use, disclosure, distribution or copying of this information is >> strictly prohibited. KEMRI-Wellcome Trust Programme cannot accept any >> responsibility for the accuracy or completeness of this message as >it >> has been transmitted over a public network. Although the Programme >has >> taken reasonable precautions to ensure no viruses are present in >emails, >> it cannot accept responsibility for any loss or damage arising from >the >> use of the email or attachments. Any views expressed in this message >are >> those of the individual sender, except where the sender specifically >> states them to be the views of KEMRI-Wellcome Trust Programme. >> >______________________________________________________________________ >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > >____________________________________________________________ >Can't remember your password? Do you need a strong and secure password? >Use Password manager! It stores your passwords & protects your account. >Check it out at http://mysecurelogon.com/password-manager > > > > > >______________________________________________________________________ > >This e-mail contains information which is confidential. It is intended >only for the use of the named recipient. If you have received this >e-mail in error, please let us know by replying to the sender, and >immediately delete it from your system. Please note, that in these >circumstances, the use, disclosure, distribution or copying of this >information is strictly prohibited. KEMRI-Wellcome Trust Programme >cannot accept any responsibility for the accuracy or completeness of >this message as it has been transmitted over a public network. Although >the Programme has taken reasonable precautions to ensure no viruses are >present in emails, it cannot accept responsibility for any loss or >damage arising from the use of the email or attachments. Any views >expressed in this message are those of the individual sender, except >where the sender specifically states them to be the views of >KEMRI-Wellcome Trust Programme. >______________________________________________________________________
Kevin Wamae
2016-Jul-03 17:24 UTC
[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset
HI Jeff, it?s been an uphill task working with the dataset and I am not the first to complain. Nonetheless, data-cleaning is ongoing and since I cannot wait for that to get done, I decided to make the most of what the dataset looks like at this time. It appears the process may take a while. Thanks for the script. From the output, I noticed that ?result? contains the first and last date for each of the individuals and not taking into account the variable ?drug-admin?. ID start end J1/3 1/5/09 12/25/10 R1/3 1/4/07 12/15/08 R10/1 1/4/07 3/5/12 My aim is to pick the date, for example in 2007, where drug-admin == ?Y? as my start and the date in the subsequent year (2008 in this case) where drug-admin == ?Y? as my end. Then, I should populate the variable ?study_id? with ?start? up to the entry just above the one whose date matches ?end?, as the output below shows (I hope its structure is maintained as I have copied it from R-Studio). The goal for now is to then get difference in days between ?date? and ?study_id? and still get to keep that column for ?study_id? as I might use it later. From the output, it can be seen that for this individual, the dates run from 2007 to 2008. However, for some individuals, the dates run from 2008-2009, 2009-2010 and so on. Therefore, I need to make the script deal with all the years as the dates range from 2001-2016 ID date drug_admin year month study_id R1/3 5/11/07 Y 2007 5 5/11/07 R1/3 5/16/07 2007 5 5/11/07 R1/3 5/22/07 2007 5 5/11/07 R1/3 5/28/07 2007 5 5/11/07 R1/3 6/5/07 2007 6 5/11/07 R1/3 6/11/07 2007 6 5/11/07 R1/3 6/18/07 2007 6 5/11/07 R1/3 6/25/07 2007 6 5/11/07 R1/3 7/2/07 2007 7 5/11/07 R1/3 7/16/07 2007 7 5/11/07 R1/3 7/29/07 2007 7 5/11/07 R1/3 8/2/07 2007 8 5/11/07 R1/3 8/7/07 2007 8 5/11/07 R1/3 8/13/07 2007 8 5/11/07 R1/3 9/18/07 2007 9 5/11/07 R1/3 9/24/07 2007 9 5/11/07 R1/3 10/6/07 2007 10 5/11/07 R1/3 10/8/07 2007 10 5/11/07 R1/3 10/15/07 2007 10 5/11/07 R1/3 10/22/07 2007 10 5/11/07 R1/3 10/29/07 2007 10 5/11/07 R1/3 11/8/07 2007 11 5/11/07 R1/3 11/12/07 2007 11 5/11/07 R1/3 11/19/07 2007 11 5/11/07 R1/3 11/29/07 2007 11 5/11/07 R1/3 12/6/07 2007 12 5/11/07 R1/3 12/10/07 2007 12 5/11/07 R1/3 12/21/07 2007 12 5/11/07 R1/3 1/7/08 2008 1 5/11/07 R1/3 1/14/08 2008 1 5/11/07 R1/3 1/21/08 2008 1 5/11/07 R1/3 1/28/08 2008 1 5/11/07 R1/3 2/4/08 Y 2008 2 Regards ------------------------------------------------------------------------------- Kevin Wame ############################################################### ############################################################### On 7/3/16, 7:05 PM, "Jeff Newmiller" <jdnewmil at dcn.davis.ca.us> wrote: result <- setNames( data.frame( aggregate( date~ID, data=drug_study, FUN=min ), aggregate( date~ID, data=drug_study, FUN=max )[2] ), c( "ID", "start", "end" ) ) ______________________________________________________________________ This e-mail contains information which is confidential. It is intended only for the use of the named recipient. If you have received this e-mail in error, please let us know by replying to the sender, and immediately delete it from your system. Please note, that in these circumstances, the use, disclosure, distribution or copying of this information is strictly prohibited. KEMRI-Wellcome Trust Programme cannot accept any responsibility for the accuracy or completeness of this message as it has been transmitted over a public network. Although the Programme has taken reasonable precautions to ensure no viruses are present in emails, it cannot accept responsibility for any loss or damage arising from the use of the email or attachments. Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of KEMRI-Wellcome Trust Programme. ______________________________________________________________________
Jeff Newmiller
2016-Jul-03 18:34 UTC
[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset
I still get the impression from your mixing of information types that you are thinking like this is Excel. Perhaps something like drug_study$admin_period <- ave( "Y" == drug_study$drug_admin, drug_study$ID, FUN=cumsum ) library(dplyr) result0 <- ( drug_study %>% filter( 0 != admin_period ) %>% group_by( ID, admin_period ) %>% summarise( start = min( date ) ) %>% mutate( admin_period1 = admin_period -1 ) ) result <- ( result0 %>% select( -admin_period ) %>% inner_join( result0 %>% select( ID, admin_period1, end=start ) , by = c( ID="ID", admin_period ="admin_period1" ) ) %>% mutate( ddays = end - start ) ) -- Sent from my phone. Please excuse my brevity. On July 3, 2016 10:24:51 AM PDT, Kevin Wamae <KWamae at kemri-wellcome.org> wrote:>HI Jeff, it?s been an uphill task working with the dataset and I am not >the first to complain. Nonetheless, data-cleaning is ongoing and since >I cannot wait for that to get done, I decided to make the most of what >the dataset looks like at this time. It appears the process may take a >while. > >Thanks for the script. From the output, I noticed that ?result? >contains the first and last date for each of the individuals and not >taking into account the variable ?drug-admin?. > >ID start end >J1/3 1/5/09 12/25/10 >R1/3 1/4/07 12/15/08 >R10/1 1/4/07 3/5/12 > >My aim is to pick the date, for example in 2007, where drug-admin =>?Y? as my start and the date in the subsequent year (2008 in this case) >where drug-admin == ?Y? as my end. Then, I should populate the variable >?study_id? with ?start? up to the entry just above the one whose date >matches ?end?, as the output below shows (I hope its structure is >maintained as I have copied it from R-Studio). The goal for now is to >then get difference in days between ?date? and ?study_id? and still get >to keep that column for ?study_id? as I might use it later. > >From the output, it can be seen that for this individual, the dates run >from 2007 to 2008. However, for some individuals, the dates run from >2008-2009, 2009-2010 and so on. Therefore, I need to make the script >deal with all the years as the dates range from 2001-2016 > >ID date drug_admin year month study_id >R1/3 5/11/07 Y 2007 5 5/11/07 >R1/3 5/16/07 2007 5 5/11/07 >R1/3 5/22/07 2007 5 5/11/07 >R1/3 5/28/07 2007 5 5/11/07 >R1/3 6/5/07 2007 6 5/11/07 >R1/3 6/11/07 2007 6 5/11/07 >R1/3 6/18/07 2007 6 5/11/07 >R1/3 6/25/07 2007 6 5/11/07 >R1/3 7/2/07 2007 7 5/11/07 >R1/3 7/16/07 2007 7 5/11/07 >R1/3 7/29/07 2007 7 5/11/07 >R1/3 8/2/07 2007 8 5/11/07 >R1/3 8/7/07 2007 8 5/11/07 >R1/3 8/13/07 2007 8 5/11/07 >R1/3 9/18/07 2007 9 5/11/07 >R1/3 9/24/07 2007 9 5/11/07 >R1/3 10/6/07 2007 10 5/11/07 >R1/3 10/8/07 2007 10 5/11/07 >R1/3 10/15/07 2007 10 5/11/07 >R1/3 10/22/07 2007 10 5/11/07 >R1/3 10/29/07 2007 10 5/11/07 >R1/3 11/8/07 2007 11 5/11/07 >R1/3 11/12/07 2007 11 5/11/07 >R1/3 11/19/07 2007 11 5/11/07 >R1/3 11/29/07 2007 11 5/11/07 >R1/3 12/6/07 2007 12 5/11/07 >R1/3 12/10/07 2007 12 5/11/07 >R1/3 12/21/07 2007 12 5/11/07 >R1/3 1/7/08 2008 1 5/11/07 >R1/3 1/14/08 2008 1 5/11/07 >R1/3 1/21/08 2008 1 5/11/07 >R1/3 1/28/08 2008 1 5/11/07 >R1/3 2/4/08 Y 2008 2 > > >Regards >------------------------------------------------------------------------------- >Kevin Wame > >############################################################### > >############################################################### > > > >On 7/3/16, 7:05 PM, "Jeff Newmiller" <jdnewmil at dcn.davis.ca.us> wrote: > >result <- setNames( data.frame( aggregate( date~ID, data=drug_study, >FUN=min ), aggregate( date~ID, data=drug_study, FUN=max )[2] ), c( >"ID", "start", "end" ) ) > > >______________________________________________________________________ > >This e-mail contains information which is confidential. It is intended >only for the use of the named recipient. If you have received this >e-mail in error, please let us know by replying to the sender, and >immediately delete it from your system. Please note, that in these >circumstances, the use, disclosure, distribution or copying of this >information is strictly prohibited. KEMRI-Wellcome Trust Programme >cannot accept any responsibility for the accuracy or completeness of >this message as it has been transmitted over a public network. Although >the Programme has taken reasonable precautions to ensure no viruses are >present in emails, it cannot accept responsibility for any loss or >damage arising from the use of the email or attachments. Any views >expressed in this message are those of the individual sender, except >where the sender specifically states them to be the views of >KEMRI-Wellcome Trust Programme. >______________________________________________________________________