thr3ads.net - R help - [R] R - Populate Another Variable Based on Multiple Conditions

If this information is useful, please help other people find it:
Share via:

Kevin Wamae

2016-Jul-02 22:41 UTC

[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

Hi Jeff, sorry for referring to you as Jennifer earlier, accept my apologies.

I attached a sample dataset in the question, am afraid it must have failed to
attach.

I have attached it again..


Regards
-------------------------------------------------------------------------------
Kevin Kariuki
 

On 7/2/16, 7:37 PM, "Jeff Newmiller" <jdnewmil at
dcn.davis.ca.us> wrote:

I can understand you not wanting to supply your actual data online, but only you
know what your data looks like so only you can create a simulated data set that
we could show you how to work with.
-- 
Sent from my phone. Please excuse my brevity.

On July 2, 2016 2:57:39 AM PDT, Kevin Wamae <KWamae at kemri-wellcome.org>
wrote:>I have a drug-trial study dataset (attached image).
>
>Since its a large and complex dataset (at least to me) and I hope to be
>as clear as possible with my question.
>The dataset is from a study where individuals are given drugs and
>followed up over a period spanning two consecutive years. Individuals
>do not start treatment on the same day and once they start, the
>variable "drug-admin" is marked "x" as well as the time
they stop
>treatment in the following year.
>There exists another variable, "study_id", that I hope to populate
as
>can be seen in the dataset, with the following conditions:
>
>For every individual
>?    if the individual has entries that show they received drugs both
>on the start and end date (marked with the "x")
>?    if the start of drug administration falls in month == 2 | 3 and
>end of administration falls in month == 2 | 4
>?    then, using the date that marks the start of drug administration,
>populate the variable _"study_id"_ in all the rows that fall
within the
>timeframe that the individual was given drugs but excluding the end of
>drug administration.
>I have tried my level best and while I have explored several examples
>online, I haven't managed to solve this. The dataset contains close to
>6000 individuals spanning 10 years and my best bet was to use a loop
>which keeps crushing R after running for close to 30min. I have also
>read that dplyr may do the job but my attempts have been in vain.
>
>sample code
>-------------------------------------------------------------------------------------------------------------------------------------------------------------------
>individual <- unique (df$ID)  #vector of individuals
>datalength <- dim(df)[1]      #number of rows in dataframe
>
>for (i in 1:length(individual)) {
>  for (j in 1:datalength) {
>start_admin <- df[(df$year == 2007] & df$drug_admin == "x"
& c(df$month
>== 2 | df$month == 3),1]  #capture date of start
>end_admin <- df[(df$year == 2008] & df$drug_admin == "x"
& c(df$month
>== 2 | df$month == 4),1]    #capture date of end
>
>if(df[datalength,1] == individual(i) & df[datalength,2] >=
start_admin
>& df[datalength,2] < end_admin) {
>df[datalength,6] <- start_admin #populate respective row if condition
>is met
>      }
>    }
>  }
>
>-------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>Above is the code that keeps failing..
>
>Any help is highly appreciated....
>
>
>______________________________________________________________________
>
>This e-mail contains information which is confidential. It is intended
>only for the use of the named recipient. If you have received this
>e-mail in error, please let us know by replying to the sender, and
>immediately delete it from your system.  Please note, that in these
>circumstances, the use, disclosure, distribution or copying of this
>information is strictly prohibited. KEMRI-Wellcome Trust Programme
>cannot accept any responsibility for the  accuracy or completeness of
>this message as it has been transmitted over a public network. Although
>the Programme has taken reasonable precautions to ensure no viruses are
>present in emails, it cannot accept responsibility for any loss or
>damage arising from the use of the email or attachments. Any views
>expressed in this message are those of the individual sender, except
>where the sender specifically states them to be the views of
>KEMRI-Wellcome Trust Programme.
>______________________________________________________________________
>
>
>------------------------------------------------------------------------
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.



______________________________________________________________________

This e-mail contains information which is confidential. It is intended only for
the use of the named recipient. If you have received this e-mail in error,
please let us know by replying to the sender, and immediately delete it from
your system.  Please note, that in these circumstances, the use, disclosure,
distribution or copying of this information is strictly prohibited.
KEMRI-Wellcome Trust Programme cannot accept any responsibility for the 
accuracy or completeness of this message as it has been transmitted over a
public network. Although the Programme has taken reasonable precautions to
ensure no viruses are present in emails, it cannot accept responsibility for any
loss or damage arising from the use of the email or attachments. Any views
expressed in this message are those of the individual sender, except where the
sender specifically states them to be the views of KEMRI-Wellcome Trust
Programme.
______________________________________________________________________
-------------- next part --------------
A non-text attachment was scrubbed...
Name: XsOgd.png
Type: image/png
Size: 100935 bytes
Desc: XsOgd.png
URL:
<https://stat.ethz.ch/pipermail/r-help/attachments/20160702/c493765e/attachment.png>

Jeff Newmiller

2016-Jul-03 05:42 UTC

head link

[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

You are making this hard on yourself by not paying attention the Posting Guide
listed in the footer of every email on this list. You would probably also find
[1] helpful also.

[1]
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
-- 
Sent from my phone. Please excuse my brevity.

On July 2, 2016 3:41:07 PM PDT, Kevin Wamae <KWamae at kemri-wellcome.org>
wrote:>Hi Jeff, sorry for referring to you as Jennifer earlier, accept my
>apologies.
>
>I attached a sample dataset in the question, am afraid it must have
>failed to attach.
>
>I have attached it again..
>
>
>Regards
>-------------------------------------------------------------------------------
>Kevin Kariuki
> 
>
>On 7/2/16, 7:37 PM, "Jeff Newmiller" <jdnewmil at
dcn.davis.ca.us> wrote:
>
>I can understand you not wanting to supply your actual data online, but
>only you know what your data looks like so only you can create a
>simulated data set that we could show you how to work with. 
>-- 
>Sent from my phone. Please excuse my brevity.
>
>On July 2, 2016 2:57:39 AM PDT, Kevin Wamae <KWamae at
kemri-wellcome.org>
>wrote:
>>I have a drug-trial study dataset (attached image).
>>
>>Since its a large and complex dataset (at least to me) and I hope to
>be
>>as clear as possible with my question.
>>The dataset is from a study where individuals are given drugs and
>>followed up over a period spanning two consecutive years. Individuals
>>do not start treatment on the same day and once they start, the
>>variable "drug-admin" is marked "x" as well as the
time they stop
>>treatment in the following year.
>>There exists another variable, "study_id", that I hope to
populate as
>>can be seen in the dataset, with the following conditions:
>>
>>For every individual
>>?    if the individual has entries that show they received drugs both
>>on the start and end date (marked with the "x")
>>?    if the start of drug administration falls in month == 2 | 3 and
>>end of administration falls in month == 2 | 4
>>?    then, using the date that marks the start of drug administration,
>>populate the variable _"study_id"_ in all the rows that fall
within
>the
>>timeframe that the individual was given drugs but excluding the end of
>>drug administration.
>>I have tried my level best and while I have explored several examples
>>online, I haven't managed to solve this. The dataset contains close
to
>>6000 individuals spanning 10 years and my best bet was to use a loop
>>which keeps crushing R after running for close to 30min. I have also
>>read that dplyr may do the job but my attempts have been in vain.
>>
>>sample code
>>-------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>individual <- unique (df$ID)  #vector of individuals
>>datalength <- dim(df)[1]      #number of rows in dataframe
>>
>>for (i in 1:length(individual)) {
>>  for (j in 1:datalength) {
>>start_admin <- df[(df$year == 2007] & df$drug_admin ==
"x" &
>c(df$month
>>== 2 | df$month == 3),1]  #capture date of start
>>end_admin <- df[(df$year == 2008] & df$drug_admin ==
"x" & c(df$month
>>== 2 | df$month == 4),1]    #capture date of end
>>
>>if(df[datalength,1] == individual(i) & df[datalength,2] >=
start_admin
>>& df[datalength,2] < end_admin) {
>>df[datalength,6] <- start_admin #populate respective row if condition
>>is met
>>      }
>>    }
>>  }
>>
>>-------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>>Above is the code that keeps failing..
>>
>>Any help is highly appreciated....
>>
>>
>>______________________________________________________________________
>>
>>This e-mail contains information which is confidential. It is intended
>>only for the use of the named recipient. If you have received this
>>e-mail in error, please let us know by replying to the sender, and
>>immediately delete it from your system.  Please note, that in these
>>circumstances, the use, disclosure, distribution or copying of this
>>information is strictly prohibited. KEMRI-Wellcome Trust Programme
>>cannot accept any responsibility for the  accuracy or completeness of
>>this message as it has been transmitted over a public network.
>Although
>>the Programme has taken reasonable precautions to ensure no viruses
>are
>>present in emails, it cannot accept responsibility for any loss or
>>damage arising from the use of the email or attachments. Any views
>>expressed in this message are those of the individual sender, except
>>where the sender specifically states them to be the views of
>>KEMRI-Wellcome Trust Programme.
>>______________________________________________________________________
>>
>>
>>------------------------------------------------------------------------
>>
>>______________________________________________
>>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide
>>http://www.R-project.org/posting-guide.html
>>and provide commented, minimal, self-contained, reproducible code.
>
>
>
>
>______________________________________________________________________
>
>This e-mail contains information which is confidential. It is intended
>only for the use of the named recipient. If you have received this
>e-mail in error, please let us know by replying to the sender, and
>immediately delete it from your system.  Please note, that in these
>circumstances, the use, disclosure, distribution or copying of this
>information is strictly prohibited. KEMRI-Wellcome Trust Programme
>cannot accept any responsibility for the  accuracy or completeness of
>this message as it has been transmitted over a public network. Although
>the Programme has taken reasonable precautions to ensure no viruses are
>present in emails, it cannot accept responsibility for any loss or
>damage arising from the use of the email or attachments. Any views
>expressed in this message are those of the individual sender, except
>where the sender specifically states them to be the views of
>KEMRI-Wellcome Trust Programme.
>______________________________________________________________________

Kevin Wamae

2016-Jul-03 09:39 UTC

head link

[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

Hi Jeff, pardon me, I was surely not making it easy. I hope this time I will ?

Attached is snippet of the dataset in csv format and below is the R.script I
have managed so far.

-----------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------

drug_study <- read.csv("drug_study.csv", header = T);
head(drug_study)
drug_study$date <- as.Date(drug_study$date, "%m/%d/%Y")
drug_study$study_id <- ""  #create new column

individual <- unique (drug_study$ID)  #vector of individuals
datalength <- dim(drug_study)[1]      #number of rows in dataframe

for (i in 1:length(individual)) {
  for (j in 1:datalength) {
    start_admin <- drug_study[c(drug_study$ID == individual[i] &
drug_study$year == 2007 & drug_study$drug_admin == "Y" &
drug_study$month == 5),2]  #capture date of start
    end_admin <- drug_study[(drug_study$ID == individual[i] &
drug_study$year == 2008 & drug_study$drug_admin == "Y" &
drug_study$month == 2),2]    #capture date of end

    if(drug_study[j,1] == individual[i] & drug_study[j,2] >= start_admin
& drug_study[j,2] < end_admin) {
      drug_study[j,6] <- paste(start_admin) #populate respective row if
condition is met
    } 
  }	
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For this dataset, there exists three individuals, J1/3, R1/3, R10/1.

The script works for the last two individuals but not J1/3 with the error below:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Error in if (drug_study[j, 1] == individual[i] & drug_study[j, 2] >=
start_admin &  :
  argument is of length zero
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I figured it?s because this individuals start_admin and end_admin dates aren?t
captured because the if-loop fails. There?s my first problem, there are
thousands of individuals with varying
start_admin and end_admin dates and I need a script to capture these for every
individual.

Secondly, the above script is taking almost an hour to run for the entire
dataset, just for the individuals whose start_admin and end_admin dates can be
captured by the if-loop.

I need help in coming up with a script that will tackle the problem taking into
account the different start_admin and end_admin dates and be resourceful with
regards to time.

Regards
-------------------------------------------------------------------------------
Kevin Kariuki

###############################################################################################################################################
###############################################################################################################################################

On 7/3/16, 8:42 AM, "Jeff Newmiller" <jdnewmil at
dcn.davis.ca.us> wrote:

You are making this hard on yourself by not paying attention the Posting Guide
listed in the footer of every email on this list. You would probably also find
[1] helpful also.

[1]
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
-- 
Sent from my phone. Please excuse my brevity.

On July 2, 2016 3:41:07 PM PDT, Kevin Wamae <KWamae at kemri-wellcome.org>
wrote:>Hi Jeff, sorry for referring to you as Jennifer earlier, accept my
>apologies.
>
>I attached a sample dataset in the question, am afraid it must have
>failed to attach.
>
>I have attached it again..
>
>
>Regards
>-------------------------------------------------------------------------------
>Kevin Kariuki
> 
>
>On 7/2/16, 7:37 PM, "Jeff Newmiller" <jdnewmil at
dcn.davis.ca.us> wrote:
>
>I can understand you not wanting to supply your actual data online, but
>only you know what your data looks like so only you can create a
>simulated data set that we could show you how to work with. 
>-- 
>Sent from my phone. Please excuse my brevity.
>
>On July 2, 2016 2:57:39 AM PDT, Kevin Wamae <KWamae at
kemri-wellcome.org>
>wrote:
>>I have a drug-trial study dataset (attached image).
>>
>>Since its a large and complex dataset (at least to me) and I hope to
>be
>>as clear as possible with my question.
>>The dataset is from a study where individuals are given drugs and
>>followed up over a period spanning two consecutive years. Individuals
>>do not start treatment on the same day and once they start, the
>>variable "drug-admin" is marked "x" as well as the
time they stop
>>treatment in the following year.
>>There exists another variable, "study_id", that I hope to
populate as
>>can be seen in the dataset, with the following conditions:
>>
>>For every individual
>>?    if the individual has entries that show they received drugs both
>>on the start and end date (marked with the "x")
>>?    if the start of drug administration falls in month == 2 | 3 and
>>end of administration falls in month == 2 | 4
>>?    then, using the date that marks the start of drug administration,
>>populate the variable _"study_id"_ in all the rows that fall
within
>the
>>timeframe that the individual was given drugs but excluding the end of
>>drug administration.
>>I have tried my level best and while I have explored several examples
>>online, I haven't managed to solve this. The dataset contains close
to
>>6000 individuals spanning 10 years and my best bet was to use a loop
>>which keeps crushing R after running for close to 30min. I have also
>>read that dplyr may do the job but my attempts have been in vain.
>>
>>sample code
>>-------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>individual <- unique (df$ID)  #vector of individuals
>>datalength <- dim(df)[1]      #number of rows in dataframe
>>
>>for (i in 1:length(individual)) {
>>  for (j in 1:datalength) {
>>start_admin <- df[(df$year == 2007] & df$drug_admin ==
"x" &
>c(df$month
>>== 2 | df$month == 3),1]  #capture date of start
>>end_admin <- df[(df$year == 2008] & df$drug_admin ==
"x" & c(df$month
>>== 2 | df$month == 4),1]    #capture date of end
>>
>>if(df[datalength,1] == individual(i) & df[datalength,2] >=
start_admin
>>& df[datalength,2] < end_admin) {
>>df[datalength,6] <- start_admin #populate respective row if condition
>>is met
>>      }
>>    }
>>  }
>>
>>-------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>>Above is the code that keeps failing..
>>
>>Any help is highly appreciated....
>>
>>
>>______________________________________________________________________
>>
>>This e-mail contains information which is confidential. It is intended
>>only for the use of the named recipient. If you have received this
>>e-mail in error, please let us know by replying to the sender, and
>>immediately delete it from your system.  Please note, that in these
>>circumstances, the use, disclosure, distribution or copying of this
>>information is strictly prohibited. KEMRI-Wellcome Trust Programme
>>cannot accept any responsibility for the  accuracy or completeness of
>>this message as it has been transmitted over a public network.
>Although
>>the Programme has taken reasonable precautions to ensure no viruses
>are
>>present in emails, it cannot accept responsibility for any loss or
>>damage arising from the use of the email or attachments. Any views
>>expressed in this message are those of the individual sender, except
>>where the sender specifically states them to be the views of
>>KEMRI-Wellcome Trust Programme.
>>______________________________________________________________________
>>
>>
>>------------------------------------------------------------------------
>>
>>______________________________________________
>>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide
>>http://www.R-project.org/posting-guide.html
>>and provide commented, minimal, self-contained, reproducible code.
>
>
>
>
>______________________________________________________________________
>
>This e-mail contains information which is confidential. It is intended
>only for the use of the named recipient. If you have received this
>e-mail in error, please let us know by replying to the sender, and
>immediately delete it from your system.  Please note, that in these
>circumstances, the use, disclosure, distribution or copying of this
>information is strictly prohibited. KEMRI-Wellcome Trust Programme
>cannot accept any responsibility for the  accuracy or completeness of
>this message as it has been transmitted over a public network. Although
>the Programme has taken reasonable precautions to ensure no viruses are
>present in emails, it cannot accept responsibility for any loss or
>damage arising from the use of the email or attachments. Any views
>expressed in this message are those of the individual sender, except
>where the sender specifically states them to be the views of
>KEMRI-Wellcome Trust Programme.
>______________________________________________________________________



______________________________________________________________________

This e-mail contains information which is confidential. It is intended only for
the use of the named recipient. If you have received this e-mail in error,
please let us know by replying to the sender, and immediately delete it from
your system.  Please note, that in these circumstances, the use, disclosure,
distribution or copying of this information is strictly prohibited.
KEMRI-Wellcome Trust Programme cannot accept any responsibility for the 
accuracy or completeness of this message as it has been transmitted over a
public network. Although the Programme has taken reasonable precautions to
ensure no viruses are present in emails, it cannot accept responsibility for any
loss or damage arising from the use of the email or attachments. Any views
expressed in this message are those of the individual sender, except where the
sender specifically states them to be the views of KEMRI-Wellcome Trust
Programme.
______________________________________________________________________

R help - Jul 2016 - R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset