thr3ads.net - R help - [R] R - Populate Another Variable Based on Multiple Conditions

If this information is useful, please help other people find it:
Share via:

Jeff Newmiller

2016-Jul-03 21:43 UTC

[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

There are a great many hits when I search on the keywords "kaplan meier
plot R"... so my first reaction is that you should be referring to some of
the existing packages for doing this type of analysis. I do not do this type of
analysis normally, so am probably not your best helper... perhaps someone else
will chime in if you show that you have read some existing KM examples.

My second reaction is that if you want to avoid losing records you should also
avoid adding records. Your example extends from the first matching date to and
including the next matching date, which conflicts with analysis of successive
treatment periods. You may have a good reason for doing this, but in my
experience this is usually a mistake.

Finally, I think you should more closely study the use of the ave function that
I already used if you want to work with the data in its original form. It should
not be too difficult to generate your diff_days column using ave if you have the
admin_period column that I showed you how to make.
-- 
Sent from my phone. Please excuse my brevity.

On July 3, 2016 1:47:17 PM PDT, Kevin Wamae <KWamae at kemri-wellcome.org>
wrote:>Hi Bert, my first task is to make a Kaplan Meier Plot to evaluate the
>risk of developing disease in the treated vs the non-treated
>individuals. I therefore figured it might be easier to compute dates
>first as any further analysis will be based on time, in this case days.
>I keep getting recommendations on how to tweak my analysis and keeps
>coming down to dates between the start of drug administration and the
>end of it.
>
>Can you suggest an ?easier? way to go about this.. 
>
>Regards
>-------------------------------------------------------------------------------
>Kevin Wame 
> 
>
>On 7/3/16, 11:28 PM, "Bert Gunter" <bgunter.4567 at
gmail.com> wrote:
>
>I haven't followed this thread closely, but if it's not too late, I
>might suggest that you stop worrying about how you want your data
>frame to look and start worrying about you want to display/analyze
>your data. As Jeff suggested, you and your supervisor are probably
>being driven by paradigms from Excel, SPSS, or whatever that are
>simply unnecessary for R. My guess would be that if you explained the
>sort of analyses/plots you wish to do, you will find it can be done
>fairly directly from your existing data. At the very least it would
>give Jeff and other helpeRs a better idea of what you might need
>rather than what you and your supervisor think you need.
>
>
>Cheers,
>Bert
>
>
>Bert Gunter
>
>"The trouble with having an open mind is that people keep coming along
>and sticking things into it."
>-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
>On Sun, Jul 3, 2016 at 1:08 PM, Kevin Wamae <KWamae at
kemri-wellcome.org>
>wrote:
>> Hi Jeff, It works on well on a dataset with 100000 rows and I figure
>it will work well with the ?real? dataset. You?ve been of great help
>and I am starting to make headway.
>>
>> It creates a new dataframe (result), as shown below that doesn?t
>quite have the result as I would want it.
>>
>> ID      admin_period    start   end     ddays
>> J1/3    1       5/11/07 8/13/07 94
>> J1/3    2       8/13/07 11/12/07        91
>> J1/3    3       11/12/07        2/4/08             84
>> J1/3    4       2/4/08              5/5/08                  91
>> J1/3    5       5/5/08               5/4/09            364
>> J1/3    6       5/4/09               5/17/10    378
>> J1/3    7       5/17/10 5/16/11 364
>> J10/1   1       5/11/07 8/13/07 94
>> J10/1   2       8/13/07 11/12/07        91
>> J10/1   3       11/12/07        2/4/08              84
>> J10/1   4       2/4/08                5/5/08    91
>> J10/1   5       5/5/08                5/8/09    368
>> J10/1   6       5/8/09               5/17/10    374
>> J10/1   7       5/17/10 5/16/11 364
>> J102/1  1       5/15/07 8/15/07 92
>> J102/1  2       8/15/07 11/13/07        90
>> J102/1  3       11/13/07        2/5/08             84
>> J102/1  4       2/5/08                5/6/08    91
>> J102/1  5       5/6/08                5/5/09    364
>> J102/1  6       5/5/09                5/19/10   379
>>
>> My supervisor doesn?t want me to create a new dataset, she?s afraid I
>might lose some data?I cannot fight that.
>>
>> Like you mentioned earlier, I might be mixing up things which I think
>is what you alluded to earlier.
>>
>> After consultation with my supervisor, this is what we?ve agreed. For
>every individual, given the start and end date, create a new column
>(say, diff_days) and for every row that falls within the range of start
>and end_date, get the difference between the date in that row and start
>date and add it to the diff_days column. Below is an example of the
>result. As it can be seen 5/11/2007 is the start while 2/4/2008 is the
>end. The diff_days has been populated excluding the end date and that
>is because that is the start of the study in 2008 that will continue
>into 2009 and thus from 2/4/2008, I should compute diff_days till 2009
>and so no (I hope this makes sense).
>>
>> ID      date    drug_admin      year    month   diff_days
>> R1/3    5/11/2007       Y       2007    5       0
>> R1/3    5/16/2007               2007    5       6
>> R1/3    5/22/2007               2007    5       11
>> R1/3    5/28/2007               2007    5       17
>> R1/3    1/14/2008               2008    1       248
>> R1/3    1/21/2008               2008    1       255
>> R1/3    1/28/2008               2008    1       263
>> R1/3    2/4/2008        Y       2008    2
>>
>>
>> Regards
>>
>-------------------------------------------------------------------------------
>> Kevin Wame
>>
>>
>> On 7/3/16, 10:09 PM, "Jeff Newmiller" <jdnewmil at
dcn.davis.ca.us>
>wrote:
>>
>> Typo on the second line
>>
>> result <- (   result0
>>           %>% select( -admin_period1 )
>>           %>% inner_join( result0 %>% select( ID, admin_period1,
>end=start )
>>                        , by = c( ID="ID", admin_period
>="admin_period1" )
>>                         )
>>           %>% mutate( ddays = end - start )
>>           )
>> --
>> Sent from my phone. Please excuse my brevity.
>>
>> On July 3, 2016 11:55:14 AM PDT, Kevin Wamae
><KWamae at kemri-wellcome.org> wrote:
>>>Hi Jeff, ?likes its Excel?, I don?t follow. Pardon me for any mix
up.
>>>
>>>Thanks for the code.  After running it, this is the error I get.
>>>
>>>Error: cannot join on columns 'admin_period' x
'admin_period1': index
>>>out of bounds
>>>
>>>Regards
>>>-------------------------------------------------------------------------------
>>>Kevin Wame | Ph.D. Student (IDeAL)
>>>KEMRI-Wellcome Trust Collaborative Research Programme
>>>Centre for Geographic Medicine Research
>>>P.O. Box 230-80108, Kilifi, Kenya
>>>
>>>
>>>On 7/3/16, 9:34 PM, "Jeff Newmiller" <jdnewmil at
dcn.davis.ca.us>
>wrote:
>>>
>>>I still get the impression from your mixing of information types
that
>>>you are thinking like this is Excel.
>>>
>>>Perhaps something like
>>>
>>>drug_study$admin_period  <- ave( "Y" ==
drug_study$drug_admin,
>>>drug_study$ID, FUN=cumsum )
>>>library(dplyr)
>>>result0 <- (   drug_study
>>>          %>% filter( 0 != admin_period )
>>>          %>% group_by( ID, admin_period )
>>>          %>% summarise( start = min( date ) )
>>>          %>% mutate( admin_period1 = admin_period -1 )
>>>          )
>>>result <- (   result0
>>>          %>% select( -admin_period )
>>>     %>% inner_join( result0 %>% select( ID, admin_period1,
end=start
>)
>>>                     , by = c( ID="ID", admin_period
="admin_period1"
>)
>>>                        )
>>>          %>% mutate( ddays = end - start )
>>>          )
>>>--
>>>Sent from my phone. Please excuse my brevity.
>>>
>>>On July 3, 2016 10:24:51 AM PDT, Kevin Wamae
>>><KWamae at kemri-wellcome.org> wrote:
>>>>HI Jeff, it?s been an uphill task working with the dataset and I
am
>>>not
>>>>the first to complain. Nonetheless, data-cleaning is ongoing and
>since
>>>>I cannot wait for that to get done, I decided to make the most
of
>what
>>>>the dataset looks like at this time. It appears the process may
take
>a
>>>>while.
>>>>
>>>>Thanks for the script. From the output, I noticed that ?result?
>>>>contains the first and last date for each of the individuals and
not
>>>>taking into account the variable ?drug-admin?.
>>>>
>>>>ID        start               end
>>>>J1/3      1/5/09      12/25/10
>>>>R1/3      1/4/07      12/15/08
>>>>R10/1     1/4/07      3/5/12
>>>>
>>>>My aim is to pick the date, for example in 2007, where
drug-admin =>>>>?Y? as my start and the date in the subsequent year
(2008 in this
>>>case)
>>>>where drug-admin == ?Y? as my end. Then, I should populate the
>>>variable
>>>>?study_id? with ?start? up to the entry just above the one whose
>date
>>>>matches ?end?, as the output below shows (I hope its structure
is
>>>>maintained as I have copied it from R-Studio). The goal for now
is
>to
>>>>then get difference in days between ?date? and ?study_id? and
still
>>>get
>>>>to keep that column for ?study_id? as I might use it later.
>>>>
>>>>From the output, it can be seen that for this individual, the
dates
>>>run
>>>>from 2007 to 2008. However, for some individuals, the dates run
from
>>>>2008-2009, 2009-2010 and so on. Therefore, I need to make the
script
>>>>deal with all the years as the dates range from 2001-2016
>>>>
>>>>ID    date    drug_admin      year    month   study_id
>>>>R1/3  5/11/07 Y       2007    5       5/11/07
>>>>R1/3  5/16/07         2007    5       5/11/07
>>>>R1/3  5/22/07         2007    5       5/11/07
>>>>R1/3  5/28/07         2007    5       5/11/07
>>>>R1/3  6/5/07                  2007    6       5/11/07
>>>>R1/3  6/11/07         2007    6       5/11/07
>>>>R1/3  6/18/07         2007    6       5/11/07
>>>>R1/3  6/25/07         2007    6       5/11/07
>>>>R1/3  7/2/07                  2007    7       5/11/07
>>>>R1/3  7/16/07         2007    7       5/11/07
>>>>R1/3  7/29/07         2007    7       5/11/07
>>>>R1/3  8/2/07                  2007    8       5/11/07
>>>>R1/3  8/7/07                  2007    8       5/11/07
>>>>R1/3  8/13/07         2007    8       5/11/07
>>>>R1/3  9/18/07         2007    9       5/11/07
>>>>R1/3  9/24/07         2007    9       5/11/07
>>>>R1/3  10/6/07         2007    10      5/11/07
>>>>R1/3  10/8/07         2007    10      5/11/07
>>>>R1/3  10/15/07                2007    10      5/11/07
>>>>R1/3  10/22/07                2007    10      5/11/07
>>>>R1/3  10/29/07                2007    10      5/11/07
>>>>R1/3  11/8/07         2007    11      5/11/07
>>>>R1/3  11/12/07                2007    11      5/11/07
>>>>R1/3  11/19/07                2007    11      5/11/07
>>>>R1/3  11/29/07                2007    11      5/11/07
>>>>R1/3  12/6/07         2007    12      5/11/07
>>>>R1/3  12/10/07                2007    12      5/11/07
>>>>R1/3  12/21/07                2007    12      5/11/07
>>>>R1/3  1/7/08                  2008    1       5/11/07
>>>>R1/3  1/14/08         2008    1       5/11/07
>>>>R1/3  1/21/08         2008    1       5/11/07
>>>>R1/3  1/28/08         2008    1       5/11/07
>>>>R1/3  2/4/08          Y       2008    2
>>>>
>>>>
>>>>Regards
>>>>-------------------------------------------------------------------------------
>>>>Kevin Wame
>>>>
>>>>###############################################################
>>>>
>>>>###############################################################
>>>>
>>>>
>>>>
>>>>On 7/3/16, 7:05 PM, "Jeff Newmiller" <jdnewmil at
dcn.davis.ca.us>
>wrote:
>>>>
>>>>result <- setNames( data.frame( aggregate( date~ID,
data=drug_study,
>>>>FUN=min ),  aggregate( date~ID, data=drug_study, FUN=max )[2] ),
c(
>>>>"ID", "start", "end" ) )
>>>>
>>>>
>>>>______________________________________________________________________
>>>>
>>>>This e-mail contains information which is confidential. It is
>intended
>>>>only for the use of the named recipient. If you have received
this
>>>>e-mail in error, please let us know by replying to the sender,
and
>>>>immediately delete it from your system.  Please note, that in
these
>>>>circumstances, the use, disclosure, distribution or copying of
this
>>>>information is strictly prohibited. KEMRI-Wellcome Trust
Programme
>>>>cannot accept any responsibility for the  accuracy or
completeness
>of
>>>>this message as it has been transmitted over a public network.
>>>Although
>>>>the Programme has taken reasonable precautions to ensure no
viruses
>>>are
>>>>present in emails, it cannot accept responsibility for any loss
or
>>>>damage arising from the use of the email or attachments. Any
views
>>>>expressed in this message are those of the individual sender,
except
>>>>where the sender specifically states them to be the views of
>>>>KEMRI-Wellcome Trust Programme.
>>>>______________________________________________________________________
>>>
>>>
>>>
>>>
>>>______________________________________________________________________
>>>
>>>This e-mail contains information which is confidential. It is
>intended
>>>only for the use of the named recipient. If you have received this
>>>e-mail in error, please let us know by replying to the sender, and
>>>immediately delete it from your system.  Please note, that in these
>>>circumstances, the use, disclosure, distribution or copying of this
>>>information is strictly prohibited. KEMRI-Wellcome Trust Programme
>>>cannot accept any responsibility for the  accuracy or completeness
of
>>>this message as it has been transmitted over a public network.
>Although
>>>the Programme has taken reasonable precautions to ensure no viruses
>are
>>>present in emails, it cannot accept responsibility for any loss or
>>>damage arising from the use of the email or attachments. Any views
>>>expressed in this message are those of the individual sender, except
>>>where the sender specifically states them to be the views of
>>>KEMRI-Wellcome Trust Programme.
>>>______________________________________________________________________
>>
>>
>>
>>
>>
>______________________________________________________________________
>>
>> This e-mail contains information which is confidential. It is
>intended only for the use of the named recipient. If you have received
>this e-mail in error, please let us know by replying to the sender, and
>immediately delete it from your system.  Please note, that in these
>circumstances, the use, disclosure, distribution or copying of this
>information is strictly prohibited. KEMRI-Wellcome Trust Programme
>cannot accept any responsibility for the  accuracy or completeness of
>this message as it has been transmitted over a public network. Although
>the Programme has taken reasonable precautions to ensure no viruses are
>present in emails, it cannot accept responsibility for any loss or
>damage arising from the use of the email or attachments. Any views
>expressed in this message are those of the individual sender, except
>where the sender specifically states them to be the views of
>KEMRI-Wellcome Trust Programme.
>>
>______________________________________________________________________
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
>______________________________________________________________________
>
>This e-mail contains information which is confidential. It is intended
>only for the use of the named recipient. If you have received this
>e-mail in error, please let us know by replying to the sender, and
>immediately delete it from your system.  Please note, that in these
>circumstances, the use, disclosure, distribution or copying of this
>information is strictly prohibited. KEMRI-Wellcome Trust Programme
>cannot accept any responsibility for the  accuracy or completeness of
>this message as it has been transmitted over a public network. Although
>the Programme has taken reasonable precautions to ensure no viruses are
>present in emails, it cannot accept responsibility for any loss or
>damage arising from the use of the email or attachments. Any views
>expressed in this message are those of the individual sender, except
>where the sender specifically states them to be the views of
>KEMRI-Wellcome Trust Programme.
>______________________________________________________________________

Bert Gunter

2016-Jul-04 06:32 UTC

head link

[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

A kaplan-meier plot requires for each individual (in each treatment
group, if there are more than one):

1. Survival time,which in your case appears to mean time without disease;
2. Status at end of time on study: whether the individual was censored
(still without disease) or died (in your case, was diseased) on the
last date they are seen in the study.

AFAICT, the 2nd piece of information is not present in your data; if
this is so, then you cannot do the K-M plot or, indeed, any survival
analysis. That is, you can quit the analysis right now.

If you have the status, where is it?

If, for example, the last date for each individual is the date at
which disease is first seen, then you can simply convert the date
column to the Date class with ?as.Date (the year and month columns
appear to be useless as they repeat info already available in the date
columns), and then:

survtimes_byID <- with(datasetname, tapply(date, ID,
function(x)diff(range(x))))

will give you a list of survival times (in days) by ID. See ?with,
?tapply for details.

If the status info is in some other form, then this advice should be
ignored of course and you have to incorporate it into your data in
some other way.


Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sun, Jul 3, 2016 at 2:43 PM, Jeff Newmiller <jdnewmil at
dcn.davis.ca.us> wrote:> There are a great many hits when I search on the keywords "kaplan
meier plot R"... so my first reaction is that you should be referring to
some of the existing packages for doing this type of analysis. I do not do this
type of analysis normally, so am probably not your best helper... perhaps
someone else will chime in if you show that you have read some existing KM
examples.
>
> My second reaction is that if you want to avoid losing records you should
also avoid adding records. Your example extends from the first matching date to
and including the next matching date, which conflicts with analysis of
successive treatment periods. You may have a good reason for doing this, but in
my experience this is usually a mistake.
>
> Finally, I think you should more closely study the use of the ave function
that I already used if you want to work with the data in its original form. It
should not be too difficult to generate your diff_days column using ave if you
have the admin_period column that I showed you how to make.
> --
> Sent from my phone. Please excuse my brevity.
>
> On July 3, 2016 1:47:17 PM PDT, Kevin Wamae <KWamae at
kemri-wellcome.org> wrote:
>>Hi Bert, my first task is to make a Kaplan Meier Plot to evaluate the
>>risk of developing disease in the treated vs the non-treated
>>individuals. I therefore figured it might be easier to compute dates
>>first as any further analysis will be based on time, in this case days.
>>I keep getting recommendations on how to tweak my analysis and keeps
>>coming down to dates between the start of drug administration and the
>>end of it.
>>
>>Can you suggest an ?easier? way to go about this..
>>
>>Regards
>>-------------------------------------------------------------------------------
>>Kevin Wame
>>
>>
>>On 7/3/16, 11:28 PM, "Bert Gunter" <bgunter.4567 at
gmail.com> wrote:
>>
>>I haven't followed this thread closely, but if it's not too
late, I
>>might suggest that you stop worrying about how you want your data
>>frame to look and start worrying about you want to display/analyze
>>your data. As Jeff suggested, you and your supervisor are probably
>>being driven by paradigms from Excel, SPSS, or whatever that are
>>simply unnecessary for R. My guess would be that if you explained the
>>sort of analyses/plots you wish to do, you will find it can be done
>>fairly directly from your existing data. At the very least it would
>>give Jeff and other helpeRs a better idea of what you might need
>>rather than what you and your supervisor think you need.
>>
>>
>>Cheers,
>>Bert
>>
>>
>>Bert Gunter
>>
>>"The trouble with having an open mind is that people keep coming
along
>>and sticking things into it."
>>-- Opus (aka Berkeley Breathed in his "Bloom County" comic
strip )
>>
>>
>>On Sun, Jul 3, 2016 at 1:08 PM, Kevin Wamae <KWamae at
kemri-wellcome.org>
>>wrote:
>>> Hi Jeff, It works on well on a dataset with 100000 rows and I
figure
>>it will work well with the ?real? dataset. You?ve been of great help
>>and I am starting to make headway.
>>>
>>> It creates a new dataframe (result), as shown below that doesn?t
>>quite have the result as I would want it.
>>>
>>> ID      admin_period    start   end     ddays
>>> J1/3    1       5/11/07 8/13/07 94
>>> J1/3    2       8/13/07 11/12/07        91
>>> J1/3    3       11/12/07        2/4/08             84
>>> J1/3    4       2/4/08              5/5/08                  91
>>> J1/3    5       5/5/08               5/4/09            364
>>> J1/3    6       5/4/09               5/17/10    378
>>> J1/3    7       5/17/10 5/16/11 364
>>> J10/1   1       5/11/07 8/13/07 94
>>> J10/1   2       8/13/07 11/12/07        91
>>> J10/1   3       11/12/07        2/4/08              84
>>> J10/1   4       2/4/08                5/5/08    91
>>> J10/1   5       5/5/08                5/8/09    368
>>> J10/1   6       5/8/09               5/17/10    374
>>> J10/1   7       5/17/10 5/16/11 364
>>> J102/1  1       5/15/07 8/15/07 92
>>> J102/1  2       8/15/07 11/13/07        90
>>> J102/1  3       11/13/07        2/5/08             84
>>> J102/1  4       2/5/08                5/6/08    91
>>> J102/1  5       5/6/08                5/5/09    364
>>> J102/1  6       5/5/09                5/19/10   379
>>>
>>> My supervisor doesn?t want me to create a new dataset, she?s afraid
I
>>might lose some data?I cannot fight that.
>>>
>>> Like you mentioned earlier, I might be mixing up things which I
think
>>is what you alluded to earlier.
>>>
>>> After consultation with my supervisor, this is what we?ve agreed.
For
>>every individual, given the start and end date, create a new column
>>(say, diff_days) and for every row that falls within the range of start
>>and end_date, get the difference between the date in that row and start
>>date and add it to the diff_days column. Below is an example of the
>>result. As it can be seen 5/11/2007 is the start while 2/4/2008 is the
>>end. The diff_days has been populated excluding the end date and that
>>is because that is the start of the study in 2008 that will continue
>>into 2009 and thus from 2/4/2008, I should compute diff_days till 2009
>>and so no (I hope this makes sense).
>>>
>>> ID      date    drug_admin      year    month   diff_days
>>> R1/3    5/11/2007       Y       2007    5       0
>>> R1/3    5/16/2007               2007    5       6
>>> R1/3    5/22/2007               2007    5       11
>>> R1/3    5/28/2007               2007    5       17
>>> R1/3    1/14/2008               2008    1       248
>>> R1/3    1/21/2008               2008    1       255
>>> R1/3    1/28/2008               2008    1       263
>>> R1/3    2/4/2008        Y       2008    2
>>>
>>>
>>> Regards
>>>
>>-------------------------------------------------------------------------------
>>> Kevin Wame
>>>
>>>
>>> On 7/3/16, 10:09 PM, "Jeff Newmiller" <jdnewmil at
dcn.davis.ca.us>
>>wrote:
>>>
>>> Typo on the second line
>>>
>>> result <- (   result0
>>>           %>% select( -admin_period1 )
>>>           %>% inner_join( result0 %>% select( ID,
admin_period1,
>>end=start )
>>>                        , by = c( ID="ID", admin_period
>>="admin_period1" )
>>>                         )
>>>           %>% mutate( ddays = end - start )
>>>           )
>>> --
>>> Sent from my phone. Please excuse my brevity.
>>>
>>> On July 3, 2016 11:55:14 AM PDT, Kevin Wamae
>><KWamae at kemri-wellcome.org> wrote:
>>>>Hi Jeff, ?likes its Excel?, I don?t follow. Pardon me for any
mix up.
>>>>
>>>>Thanks for the code.  After running it, this is the error I get.
>>>>
>>>>Error: cannot join on columns 'admin_period' x
'admin_period1': index
>>>>out of bounds
>>>>
>>>>Regards
>>>>-------------------------------------------------------------------------------
>>>>Kevin Wame | Ph.D. Student (IDeAL)
>>>>KEMRI-Wellcome Trust Collaborative Research Programme
>>>>Centre for Geographic Medicine Research
>>>>P.O. Box 230-80108, Kilifi, Kenya
>>>>
>>>>
>>>>On 7/3/16, 9:34 PM, "Jeff Newmiller" <jdnewmil at
dcn.davis.ca.us>
>>wrote:
>>>>
>>>>I still get the impression from your mixing of information types
that
>>>>you are thinking like this is Excel.
>>>>
>>>>Perhaps something like
>>>>
>>>>drug_study$admin_period  <- ave( "Y" ==
drug_study$drug_admin,
>>>>drug_study$ID, FUN=cumsum )
>>>>library(dplyr)
>>>>result0 <- (   drug_study
>>>>          %>% filter( 0 != admin_period )
>>>>          %>% group_by( ID, admin_period )
>>>>          %>% summarise( start = min( date ) )
>>>>          %>% mutate( admin_period1 = admin_period -1 )
>>>>          )
>>>>result <- (   result0
>>>>          %>% select( -admin_period )
>>>>     %>% inner_join( result0 %>% select( ID,
admin_period1, end=start
>>)
>>>>                     , by = c( ID="ID", admin_period
="admin_period1"
>>)
>>>>                        )
>>>>          %>% mutate( ddays = end - start )
>>>>          )
>>>>--
>>>>Sent from my phone. Please excuse my brevity.
>>>>
>>>>On July 3, 2016 10:24:51 AM PDT, Kevin Wamae
>>>><KWamae at kemri-wellcome.org> wrote:
>>>>>HI Jeff, it?s been an uphill task working with the dataset
and I am
>>>>not
>>>>>the first to complain. Nonetheless, data-cleaning is ongoing
and
>>since
>>>>>I cannot wait for that to get done, I decided to make the
most of
>>what
>>>>>the dataset looks like at this time. It appears the process
may take
>>a
>>>>>while.
>>>>>
>>>>>Thanks for the script. From the output, I noticed that
?result?
>>>>>contains the first and last date for each of the individuals
and not
>>>>>taking into account the variable ?drug-admin?.
>>>>>
>>>>>ID        start               end
>>>>>J1/3      1/5/09      12/25/10
>>>>>R1/3      1/4/07      12/15/08
>>>>>R10/1     1/4/07      3/5/12
>>>>>
>>>>>My aim is to pick the date, for example in 2007, where
drug-admin =>>>>>?Y? as my start and the date in the subsequent
year (2008 in this
>>>>case)
>>>>>where drug-admin == ?Y? as my end. Then, I should populate
the
>>>>variable
>>>>>?study_id? with ?start? up to the entry just above the one
whose
>>date
>>>>>matches ?end?, as the output below shows (I hope its
structure is
>>>>>maintained as I have copied it from R-Studio). The goal for
now is
>>to
>>>>>then get difference in days between ?date? and ?study_id?
and still
>>>>get
>>>>>to keep that column for ?study_id? as I might use it later.
>>>>>
>>>>>From the output, it can be seen that for this individual,
the dates
>>>>run
>>>>>from 2007 to 2008. However, for some individuals, the dates
run from
>>>>>2008-2009, 2009-2010 and so on. Therefore, I need to make
the script
>>>>>deal with all the years as the dates range from 2001-2016
>>>>>
>>>>>ID    date    drug_admin      year    month   study_id
>>>>>R1/3  5/11/07 Y       2007    5       5/11/07
>>>>>R1/3  5/16/07         2007    5       5/11/07
>>>>>R1/3  5/22/07         2007    5       5/11/07
>>>>>R1/3  5/28/07         2007    5       5/11/07
>>>>>R1/3  6/5/07                  2007    6       5/11/07
>>>>>R1/3  6/11/07         2007    6       5/11/07
>>>>>R1/3  6/18/07         2007    6       5/11/07
>>>>>R1/3  6/25/07         2007    6       5/11/07
>>>>>R1/3  7/2/07                  2007    7       5/11/07
>>>>>R1/3  7/16/07         2007    7       5/11/07
>>>>>R1/3  7/29/07         2007    7       5/11/07
>>>>>R1/3  8/2/07                  2007    8       5/11/07
>>>>>R1/3  8/7/07                  2007    8       5/11/07
>>>>>R1/3  8/13/07         2007    8       5/11/07
>>>>>R1/3  9/18/07         2007    9       5/11/07
>>>>>R1/3  9/24/07         2007    9       5/11/07
>>>>>R1/3  10/6/07         2007    10      5/11/07
>>>>>R1/3  10/8/07         2007    10      5/11/07
>>>>>R1/3  10/15/07                2007    10      5/11/07
>>>>>R1/3  10/22/07                2007    10      5/11/07
>>>>>R1/3  10/29/07                2007    10      5/11/07
>>>>>R1/3  11/8/07         2007    11      5/11/07
>>>>>R1/3  11/12/07                2007    11      5/11/07
>>>>>R1/3  11/19/07                2007    11      5/11/07
>>>>>R1/3  11/29/07                2007    11      5/11/07
>>>>>R1/3  12/6/07         2007    12      5/11/07
>>>>>R1/3  12/10/07                2007    12      5/11/07
>>>>>R1/3  12/21/07                2007    12      5/11/07
>>>>>R1/3  1/7/08                  2008    1       5/11/07
>>>>>R1/3  1/14/08         2008    1       5/11/07
>>>>>R1/3  1/21/08         2008    1       5/11/07
>>>>>R1/3  1/28/08         2008    1       5/11/07
>>>>>R1/3  2/4/08          Y       2008    2
>>>>>
>>>>>
>>>>>Regards
>>>>>-------------------------------------------------------------------------------
>>>>>Kevin Wame
>>>>>
>>>>>###############################################################
>>>>>
>>>>>###############################################################
>>>>>
>>>>>
>>>>>
>>>>>On 7/3/16, 7:05 PM, "Jeff Newmiller" <jdnewmil
at dcn.davis.ca.us>
>>wrote:
>>>>>
>>>>>result <- setNames( data.frame( aggregate( date~ID,
data=drug_study,
>>>>>FUN=min ),  aggregate( date~ID, data=drug_study, FUN=max
)[2] ), c(
>>>>>"ID", "start", "end" ) )
>>>>>
>>>>>
>>>>>______________________________________________________________________
>>>>>
>>>>>This e-mail contains information which is confidential. It
is
>>intended
>>>>>only for the use of the named recipient. If you have
received this
>>>>>e-mail in error, please let us know by replying to the
sender, and
>>>>>immediately delete it from your system.  Please note, that
in these
>>>>>circumstances, the use, disclosure, distribution or copying
of this
>>>>>information is strictly prohibited. KEMRI-Wellcome Trust
Programme
>>>>>cannot accept any responsibility for the  accuracy or
completeness
>>of
>>>>>this message as it has been transmitted over a public
network.
>>>>Although
>>>>>the Programme has taken reasonable precautions to ensure no
viruses
>>>>are
>>>>>present in emails, it cannot accept responsibility for any
loss or
>>>>>damage arising from the use of the email or attachments. Any
views
>>>>>expressed in this message are those of the individual
sender, except
>>>>>where the sender specifically states them to be the views of
>>>>>KEMRI-Wellcome Trust Programme.
>>>>>______________________________________________________________________
>>>>
>>>>
>>>>
>>>>
>>>>______________________________________________________________________
>>>>
>>>>This e-mail contains information which is confidential. It is
>>intended
>>>>only for the use of the named recipient. If you have received
this
>>>>e-mail in error, please let us know by replying to the sender,
and
>>>>immediately delete it from your system.  Please note, that in
these
>>>>circumstances, the use, disclosure, distribution or copying of
this
>>>>information is strictly prohibited. KEMRI-Wellcome Trust
Programme
>>>>cannot accept any responsibility for the  accuracy or
completeness of
>>>>this message as it has been transmitted over a public network.
>>Although
>>>>the Programme has taken reasonable precautions to ensure no
viruses
>>are
>>>>present in emails, it cannot accept responsibility for any loss
or
>>>>damage arising from the use of the email or attachments. Any
views
>>>>expressed in this message are those of the individual sender,
except
>>>>where the sender specifically states them to be the views of
>>>>KEMRI-Wellcome Trust Programme.
>>>>______________________________________________________________________
>>>
>>>
>>>
>>>
>>>
>>______________________________________________________________________
>>>
>>> This e-mail contains information which is confidential. It is
>>intended only for the use of the named recipient. If you have received
>>this e-mail in error, please let us know by replying to the sender, and
>>immediately delete it from your system.  Please note, that in these
>>circumstances, the use, disclosure, distribution or copying of this
>>information is strictly prohibited. KEMRI-Wellcome Trust Programme
>>cannot accept any responsibility for the  accuracy or completeness of
>>this message as it has been transmitted over a public network. Although
>>the Programme has taken reasonable precautions to ensure no viruses are
>>present in emails, it cannot accept responsibility for any loss or
>>damage arising from the use of the email or attachments. Any views
>>expressed in this message are those of the individual sender, except
>>where the sender specifically states them to be the views of
>>KEMRI-Wellcome Trust Programme.
>>>
>>______________________________________________________________________
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
>>______________________________________________________________________
>>
>>This e-mail contains information which is confidential. It is intended
>>only for the use of the named recipient. If you have received this
>>e-mail in error, please let us know by replying to the sender, and
>>immediately delete it from your system.  Please note, that in these
>>circumstances, the use, disclosure, distribution or copying of this
>>information is strictly prohibited. KEMRI-Wellcome Trust Programme
>>cannot accept any responsibility for the  accuracy or completeness of
>>this message as it has been transmitted over a public network. Although
>>the Programme has taken reasonable precautions to ensure no viruses are
>>present in emails, it cannot accept responsibility for any loss or
>>damage arising from the use of the email or attachments. Any views
>>expressed in this message are those of the individual sender, except
>>where the sender specifically states them to be the views of
>>KEMRI-Wellcome Trust Programme.
>>______________________________________________________________________
>

Kevin Wamae

2016-Jul-04 06:48 UTC

head link

[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

Hi Jeff, thanks and I will explore your suggestions too..

Regards
-------------------------------------------------------------------------------
Kevin Wame 

 

On 7/4/16, 12:43 AM, "Jeff Newmiller" <jdnewmil at
dcn.davis.ca.us> wrote:

There are a great many hits when I search on the keywords "kaplan meier
plot R"... so my first reaction is that you should be referring to some of
the existing packages for doing this type of analysis. I do not do this type of
analysis normally, so am probably not your best helper... perhaps someone else
will chime in if you show that you have read some existing KM examples.

My second reaction is that if you want to avoid losing records you should also
avoid adding records. Your example extends from the first matching date to and
including the next matching date, which conflicts with analysis of successive
treatment periods. You may have a good reason for doing this, but in my
experience this is usually a mistake.

Finally, I think you should more closely study the use of the ave function that
I already used if you want to work with the data in its original form. It should
not be too difficult to generate your diff_days column using ave if you have the
admin_period column that I showed you how to make.
-- 
Sent from my phone. Please excuse my brevity.

On July 3, 2016 1:47:17 PM PDT, Kevin Wamae <KWamae at kemri-wellcome.org>
wrote:>Hi Bert, my first task is to make a Kaplan Meier Plot to evaluate the
>risk of developing disease in the treated vs the non-treated
>individuals. I therefore figured it might be easier to compute dates
>first as any further analysis will be based on time, in this case days.
>I keep getting recommendations on how to tweak my analysis and keeps
>coming down to dates between the start of drug administration and the
>end of it.
>
>Can you suggest an ?easier? way to go about this.. 
>
>Regards
>-------------------------------------------------------------------------------
>Kevin Wame 
> 
>
>On 7/3/16, 11:28 PM, "Bert Gunter" <bgunter.4567 at
gmail.com> wrote:
>
>I haven't followed this thread closely, but if it's not too late, I
>might suggest that you stop worrying about how you want your data
>frame to look and start worrying about you want to display/analyze
>your data. As Jeff suggested, you and your supervisor are probably
>being driven by paradigms from Excel, SPSS, or whatever that are
>simply unnecessary for R. My guess would be that if you explained the
>sort of analyses/plots you wish to do, you will find it can be done
>fairly directly from your existing data. At the very least it would
>give Jeff and other helpeRs a better idea of what you might need
>rather than what you and your supervisor think you need.
>
>
>Cheers,
>Bert
>
>
>Bert Gunter
>
>"The trouble with having an open mind is that people keep coming along
>and sticking things into it."
>-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
>On Sun, Jul 3, 2016 at 1:08 PM, Kevin Wamae <KWamae at
kemri-wellcome.org>
>wrote:
>> Hi Jeff, It works on well on a dataset with 100000 rows and I figure
>it will work well with the ?real? dataset. You?ve been of great help
>and I am starting to make headway.
>>
>> It creates a new dataframe (result), as shown below that doesn?t
>quite have the result as I would want it.
>>
>> ID      admin_period    start   end     ddays
>> J1/3    1       5/11/07 8/13/07 94
>> J1/3    2       8/13/07 11/12/07        91
>> J1/3    3       11/12/07        2/4/08             84
>> J1/3    4       2/4/08              5/5/08                  91
>> J1/3    5       5/5/08               5/4/09            364
>> J1/3    6       5/4/09               5/17/10    378
>> J1/3    7       5/17/10 5/16/11 364
>> J10/1   1       5/11/07 8/13/07 94
>> J10/1   2       8/13/07 11/12/07        91
>> J10/1   3       11/12/07        2/4/08              84
>> J10/1   4       2/4/08                5/5/08    91
>> J10/1   5       5/5/08                5/8/09    368
>> J10/1   6       5/8/09               5/17/10    374
>> J10/1   7       5/17/10 5/16/11 364
>> J102/1  1       5/15/07 8/15/07 92
>> J102/1  2       8/15/07 11/13/07        90
>> J102/1  3       11/13/07        2/5/08             84
>> J102/1  4       2/5/08                5/6/08    91
>> J102/1  5       5/6/08                5/5/09    364
>> J102/1  6       5/5/09                5/19/10   379
>>
>> My supervisor doesn?t want me to create a new dataset, she?s afraid I
>might lose some data?I cannot fight that.
>>
>> Like you mentioned earlier, I might be mixing up things which I think
>is what you alluded to earlier.
>>
>> After consultation with my supervisor, this is what we?ve agreed. For
>every individual, given the start and end date, create a new column
>(say, diff_days) and for every row that falls within the range of start
>and end_date, get the difference between the date in that row and start
>date and add it to the diff_days column. Below is an example of the
>result. As it can be seen 5/11/2007 is the start while 2/4/2008 is the
>end. The diff_days has been populated excluding the end date and that
>is because that is the start of the study in 2008 that will continue
>into 2009 and thus from 2/4/2008, I should compute diff_days till 2009
>and so no (I hope this makes sense).
>>
>> ID      date    drug_admin      year    month   diff_days
>> R1/3    5/11/2007       Y       2007    5       0
>> R1/3    5/16/2007               2007    5       6
>> R1/3    5/22/2007               2007    5       11
>> R1/3    5/28/2007               2007    5       17
>> R1/3    1/14/2008               2008    1       248
>> R1/3    1/21/2008               2008    1       255
>> R1/3    1/28/2008               2008    1       263
>> R1/3    2/4/2008        Y       2008    2
>>
>>
>> Regards
>>
>-------------------------------------------------------------------------------
>> Kevin Wame
>>
>>
>> On 7/3/16, 10:09 PM, "Jeff Newmiller" <jdnewmil at
dcn.davis.ca.us>
>wrote:
>>
>> Typo on the second line
>>
>> result <- (   result0
>>           %>% select( -admin_period1 )
>>           %>% inner_join( result0 %>% select( ID, admin_period1,
>end=start )
>>                        , by = c( ID="ID", admin_period
>="admin_period1" )
>>                         )
>>           %>% mutate( ddays = end - start )
>>           )
>> --
>> Sent from my phone. Please excuse my brevity.
>>
>> On July 3, 2016 11:55:14 AM PDT, Kevin Wamae
><KWamae at kemri-wellcome.org> wrote:
>>>Hi Jeff, ?likes its Excel?, I don?t follow. Pardon me for any mix
up.
>>>
>>>Thanks for the code.  After running it, this is the error I get.
>>>
>>>Error: cannot join on columns 'admin_period' x
'admin_period1': index
>>>out of bounds
>>>
>>>Regards
>>>-------------------------------------------------------------------------------
>>>Kevin Wame | Ph.D. Student (IDeAL)
>>>KEMRI-Wellcome Trust Collaborative Research Programme
>>>Centre for Geographic Medicine Research
>>>P.O. Box 230-80108, Kilifi, Kenya
>>>
>>>
>>>On 7/3/16, 9:34 PM, "Jeff Newmiller" <jdnewmil at
dcn.davis.ca.us>
>wrote:
>>>
>>>I still get the impression from your mixing of information types
that
>>>you are thinking like this is Excel.
>>>
>>>Perhaps something like
>>>
>>>drug_study$admin_period  <- ave( "Y" ==
drug_study$drug_admin,
>>>drug_study$ID, FUN=cumsum )
>>>library(dplyr)
>>>result0 <- (   drug_study
>>>          %>% filter( 0 != admin_period )
>>>          %>% group_by( ID, admin_period )
>>>          %>% summarise( start = min( date ) )
>>>          %>% mutate( admin_period1 = admin_period -1 )
>>>          )
>>>result <- (   result0
>>>          %>% select( -admin_period )
>>>     %>% inner_join( result0 %>% select( ID, admin_period1,
end=start
>)
>>>                     , by = c( ID="ID", admin_period
="admin_period1"
>)
>>>                        )
>>>          %>% mutate( ddays = end - start )
>>>          )
>>>--
>>>Sent from my phone. Please excuse my brevity.
>>>
>>>On July 3, 2016 10:24:51 AM PDT, Kevin Wamae
>>><KWamae at kemri-wellcome.org> wrote:
>>>>HI Jeff, it?s been an uphill task working with the dataset and I
am
>>>not
>>>>the first to complain. Nonetheless, data-cleaning is ongoing and
>since
>>>>I cannot wait for that to get done, I decided to make the most
of
>what
>>>>the dataset looks like at this time. It appears the process may
take
>a
>>>>while.
>>>>
>>>>Thanks for the script. From the output, I noticed that ?result?
>>>>contains the first and last date for each of the individuals and
not
>>>>taking into account the variable ?drug-admin?.
>>>>
>>>>ID        start               end
>>>>J1/3      1/5/09      12/25/10
>>>>R1/3      1/4/07      12/15/08
>>>>R10/1     1/4/07      3/5/12
>>>>
>>>>My aim is to pick the date, for example in 2007, where
drug-admin =>>>>?Y? as my start and the date in the subsequent year
(2008 in this
>>>case)
>>>>where drug-admin == ?Y? as my end. Then, I should populate the
>>>variable
>>>>?study_id? with ?start? up to the entry just above the one whose
>date
>>>>matches ?end?, as the output below shows (I hope its structure
is
>>>>maintained as I have copied it from R-Studio). The goal for now
is
>to
>>>>then get difference in days between ?date? and ?study_id? and
still
>>>get
>>>>to keep that column for ?study_id? as I might use it later.
>>>>
>>>>From the output, it can be seen that for this individual, the
dates
>>>run
>>>>from 2007 to 2008. However, for some individuals, the dates run
from
>>>>2008-2009, 2009-2010 and so on. Therefore, I need to make the
script
>>>>deal with all the years as the dates range from 2001-2016
>>>>
>>>>ID    date    drug_admin      year    month   study_id
>>>>R1/3  5/11/07 Y       2007    5       5/11/07
>>>>R1/3  5/16/07         2007    5       5/11/07
>>>>R1/3  5/22/07         2007    5       5/11/07
>>>>R1/3  5/28/07         2007    5       5/11/07
>>>>R1/3  6/5/07                  2007    6       5/11/07
>>>>R1/3  6/11/07         2007    6       5/11/07
>>>>R1/3  6/18/07         2007    6       5/11/07
>>>>R1/3  6/25/07         2007    6       5/11/07
>>>>R1/3  7/2/07                  2007    7       5/11/07
>>>>R1/3  7/16/07         2007    7       5/11/07
>>>>R1/3  7/29/07         2007    7       5/11/07
>>>>R1/3  8/2/07                  2007    8       5/11/07
>>>>R1/3  8/7/07                  2007    8       5/11/07
>>>>R1/3  8/13/07         2007    8       5/11/07
>>>>R1/3  9/18/07         2007    9       5/11/07
>>>>R1/3  9/24/07         2007    9       5/11/07
>>>>R1/3  10/6/07         2007    10      5/11/07
>>>>R1/3  10/8/07         2007    10      5/11/07
>>>>R1/3  10/15/07                2007    10      5/11/07
>>>>R1/3  10/22/07                2007    10      5/11/07
>>>>R1/3  10/29/07                2007    10      5/11/07
>>>>R1/3  11/8/07         2007    11      5/11/07
>>>>R1/3  11/12/07                2007    11      5/11/07
>>>>R1/3  11/19/07                2007    11      5/11/07
>>>>R1/3  11/29/07                2007    11      5/11/07
>>>>R1/3  12/6/07         2007    12      5/11/07
>>>>R1/3  12/10/07                2007    12      5/11/07
>>>>R1/3  12/21/07                2007    12      5/11/07
>>>>R1/3  1/7/08                  2008    1       5/11/07
>>>>R1/3  1/14/08         2008    1       5/11/07
>>>>R1/3  1/21/08         2008    1       5/11/07
>>>>R1/3  1/28/08         2008    1       5/11/07
>>>>R1/3  2/4/08          Y       2008    2
>>>>
>>>>
>>>>Regards
>>>>-------------------------------------------------------------------------------
>>>>Kevin Wame
>>>>
>>>>###############################################################
>>>>
>>>>###############################################################
>>>>
>>>>
>>>>
>>>>On 7/3/16, 7:05 PM, "Jeff Newmiller" <jdnewmil at
dcn.davis.ca.us>
>wrote:
>>>>
>>>>result <- setNames( data.frame( aggregate( date~ID,
data=drug_study,
>>>>FUN=min ),  aggregate( date~ID, data=drug_study, FUN=max )[2] ),
c(
>>>>"ID", "start", "end" ) )
>>>>
>>>>
>>>>______________________________________________________________________
>>>>
>>>>This e-mail contains information which is confidential. It is
>intended
>>>>only for the use of the named recipient. If you have received
this
>>>>e-mail in error, please let us know by replying to the sender,
and
>>>>immediately delete it from your system.  Please note, that in
these
>>>>circumstances, the use, disclosure, distribution or copying of
this
>>>>information is strictly prohibited. KEMRI-Wellcome Trust
Programme
>>>>cannot accept any responsibility for the  accuracy or
completeness
>of
>>>>this message as it has been transmitted over a public network.
>>>Although
>>>>the Programme has taken reasonable precautions to ensure no
viruses
>>>are
>>>>present in emails, it cannot accept responsibility for any loss
or
>>>>damage arising from the use of the email or attachments. Any
views
>>>>expressed in this message are those of the individual sender,
except
>>>>where the sender specifically states them to be the views of
>>>>KEMRI-Wellcome Trust Programme.
>>>>______________________________________________________________________
>>>
>>>
>>>
>>>
>>>______________________________________________________________________
>>>
>>>This e-mail contains information which is confidential. It is
>intended
>>>only for the use of the named recipient. If you have received this
>>>e-mail in error, please let us know by replying to the sender, and
>>>immediately delete it from your system.  Please note, that in these
>>>circumstances, the use, disclosure, distribution or copying of this
>>>information is strictly prohibited. KEMRI-Wellcome Trust Programme
>>>cannot accept any responsibility for the  accuracy or completeness
of
>>>this message as it has been transmitted over a public network.
>Although
>>>the Programme has taken reasonable precautions to ensure no viruses
>are
>>>present in emails, it cannot accept responsibility for any loss or
>>>damage arising from the use of the email or attachments. Any views
>>>expressed in this message are those of the individual sender, except
>>>where the sender specifically states them to be the views of
>>>KEMRI-Wellcome Trust Programme.
>>>______________________________________________________________________
>>
>>
>>
>>
>>
>______________________________________________________________________
>>
>> This e-mail contains information which is confidential. It is
>intended only for the use of the named recipient. If you have received
>this e-mail in error, please let us know by replying to the sender, and
>immediately delete it from your system.  Please note, that in these
>circumstances, the use, disclosure, distribution or copying of this
>information is strictly prohibited. KEMRI-Wellcome Trust Programme
>cannot accept any responsibility for the  accuracy or completeness of
>this message as it has been transmitted over a public network. Although
>the Programme has taken reasonable precautions to ensure no viruses are
>present in emails, it cannot accept responsibility for any loss or
>damage arising from the use of the email or attachments. Any views
>expressed in this message are those of the individual sender, except
>where the sender specifically states them to be the views of
>KEMRI-Wellcome Trust Programme.
>>
>______________________________________________________________________
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
>______________________________________________________________________
>
>This e-mail contains information which is confidential. It is intended
>only for the use of the named recipient. If you have received this
>e-mail in error, please let us know by replying to the sender, and
>immediately delete it from your system.  Please note, that in these
>circumstances, the use, disclosure, distribution or copying of this
>information is strictly prohibited. KEMRI-Wellcome Trust Programme
>cannot accept any responsibility for the  accuracy or completeness of
>this message as it has been transmitted over a public network. Although
>the Programme has taken reasonable precautions to ensure no viruses are
>present in emails, it cannot accept responsibility for any loss or
>damage arising from the use of the email or attachments. Any views
>expressed in this message are those of the individual sender, except
>where the sender specifically states them to be the views of
>KEMRI-Wellcome Trust Programme.
>______________________________________________________________________



______________________________________________________________________

This e-mail contains information which is confidential. It is intended only for
the use of the named recipient. If you have received this e-mail in error,
please let us know by replying to the sender, and immediately delete it from
your system.  Please note, that in these circumstances, the use, disclosure,
distribution or copying of this information is strictly prohibited.
KEMRI-Wellcome Trust Programme cannot accept any responsibility for the 
accuracy or completeness of this message as it has been transmitted over a
public network. Although the Programme has taken reasonable precautions to
ensure no viruses are present in emails, it cannot accept responsibility for any
loss or damage arising from the use of the email or attachments. Any views
expressed in this message are those of the individual sender, except where the
sender specifically states them to be the views of KEMRI-Wellcome Trust
Programme.
______________________________________________________________________

Kevin Wamae

2016-Jul-04 06:57 UTC

head link

[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

Hi Bert, The ?status? at the end of the study does exist in the original
dataset, what was missing was the time between events. And there exists so many
events that fall between the first and last day to be explored in this work.

The suggestion I received then, was to compute time between the initial date for
each individual and all sub subsequent events, up to the last day of the study.
The rationale being, once I have that column of difference in days, I can then
use it to make any other calculations that arise.

Let me try your suggested script and see how that goes..highly appreciated..

Regards
-------------------------------------------------------------------------------
Kevin Wame 
 

On 7/4/16, 9:32 AM, "Bert Gunter" <bgunter.4567 at gmail.com>
wrote:

A kaplan-meier plot requires for each individual (in each treatment
group, if there are more than one):

1. Survival time,which in your case appears to mean time without disease;
2. Status at end of time on study: whether the individual was censored
(still without disease) or died (in your case, was diseased) on the
last date they are seen in the study.

AFAICT, the 2nd piece of information is not present in your data; if
this is so, then you cannot do the K-M plot or, indeed, any survival
analysis. That is, you can quit the analysis right now.

If you have the status, where is it?

If, for example, the last date for each individual is the date at
which disease is first seen, then you can simply convert the date
column to the Date class with ?as.Date (the year and month columns
appear to be useless as they repeat info already available in the date
columns), and then:

survtimes_byID <- with(datasetname, tapply(date, ID,
function(x)diff(range(x))))

will give you a list of survival times (in days) by ID. See ?with,
?tapply for details.

If the status info is in some other form, then this advice should be
ignored of course and you have to incorporate it into your data in
some other way.


Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sun, Jul 3, 2016 at 2:43 PM, Jeff Newmiller <jdnewmil at
dcn.davis.ca.us> wrote:> There are a great many hits when I search on the keywords "kaplan
meier plot R"... so my first reaction is that you should be referring to
some of the existing packages for doing this type of analysis. I do not do this
type of analysis normally, so am probably not your best helper... perhaps
someone else will chime in if you show that you have read some existing KM
examples.
>
> My second reaction is that if you want to avoid losing records you should
also avoid adding records. Your example extends from the first matching date to
and including the next matching date, which conflicts with analysis of
successive treatment periods. You may have a good reason for doing this, but in
my experience this is usually a mistake.
>
> Finally, I think you should more closely study the use of the ave function
that I already used if you want to work with the data in its original form. It
should not be too difficult to generate your diff_days column using ave if you
have the admin_period column that I showed you how to make.
> --
> Sent from my phone. Please excuse my brevity.
>
> On July 3, 2016 1:47:17 PM PDT, Kevin Wamae <KWamae at
kemri-wellcome.org> wrote:
>>Hi Bert, my first task is to make a Kaplan Meier Plot to evaluate the
>>risk of developing disease in the treated vs the non-treated
>>individuals. I therefore figured it might be easier to compute dates
>>first as any further analysis will be based on time, in this case days.
>>I keep getting recommendations on how to tweak my analysis and keeps
>>coming down to dates between the start of drug administration and the
>>end of it.
>>
>>Can you suggest an ?easier? way to go about this..
>>
>>Regards
>>-------------------------------------------------------------------------------
>>Kevin Wame
>>
>>
>>On 7/3/16, 11:28 PM, "Bert Gunter" <bgunter.4567 at
gmail.com> wrote:
>>
>>I haven't followed this thread closely, but if it's not too
late, I
>>might suggest that you stop worrying about how you want your data
>>frame to look and start worrying about you want to display/analyze
>>your data. As Jeff suggested, you and your supervisor are probably
>>being driven by paradigms from Excel, SPSS, or whatever that are
>>simply unnecessary for R. My guess would be that if you explained the
>>sort of analyses/plots you wish to do, you will find it can be done
>>fairly directly from your existing data. At the very least it would
>>give Jeff and other helpeRs a better idea of what you might need
>>rather than what you and your supervisor think you need.
>>
>>
>>Cheers,
>>Bert
>>
>>
>>Bert Gunter
>>
>>"The trouble with having an open mind is that people keep coming
along
>>and sticking things into it."
>>-- Opus (aka Berkeley Breathed in his "Bloom County" comic
strip )
>>
>>
>>On Sun, Jul 3, 2016 at 1:08 PM, Kevin Wamae <KWamae at
kemri-wellcome.org>
>>wrote:
>>> Hi Jeff, It works on well on a dataset with 100000 rows and I
figure
>>it will work well with the ?real? dataset. You?ve been of great help
>>and I am starting to make headway.
>>>
>>> It creates a new dataframe (result), as shown below that doesn?t
>>quite have the result as I would want it.
>>>
>>> ID      admin_period    start   end     ddays
>>> J1/3    1       5/11/07 8/13/07 94
>>> J1/3    2       8/13/07 11/12/07        91
>>> J1/3    3       11/12/07        2/4/08             84
>>> J1/3    4       2/4/08              5/5/08                  91
>>> J1/3    5       5/5/08               5/4/09            364
>>> J1/3    6       5/4/09               5/17/10    378
>>> J1/3    7       5/17/10 5/16/11 364
>>> J10/1   1       5/11/07 8/13/07 94
>>> J10/1   2       8/13/07 11/12/07        91
>>> J10/1   3       11/12/07        2/4/08              84
>>> J10/1   4       2/4/08                5/5/08    91
>>> J10/1   5       5/5/08                5/8/09    368
>>> J10/1   6       5/8/09               5/17/10    374
>>> J10/1   7       5/17/10 5/16/11 364
>>> J102/1  1       5/15/07 8/15/07 92
>>> J102/1  2       8/15/07 11/13/07        90
>>> J102/1  3       11/13/07        2/5/08             84
>>> J102/1  4       2/5/08                5/6/08    91
>>> J102/1  5       5/6/08                5/5/09    364
>>> J102/1  6       5/5/09                5/19/10   379
>>>
>>> My supervisor doesn?t want me to create a new dataset, she?s afraid
I
>>might lose some data?I cannot fight that.
>>>
>>> Like you mentioned earlier, I might be mixing up things which I
think
>>is what you alluded to earlier.
>>>
>>> After consultation with my supervisor, this is what we?ve agreed.
For
>>every individual, given the start and end date, create a new column
>>(say, diff_days) and for every row that falls within the range of start
>>and end_date, get the difference between the date in that row and start
>>date and add it to the diff_days column. Below is an example of the
>>result. As it can be seen 5/11/2007 is the start while 2/4/2008 is the
>>end. The diff_days has been populated excluding the end date and that
>>is because that is the start of the study in 2008 that will continue
>>into 2009 and thus from 2/4/2008, I should compute diff_days till 2009
>>and so no (I hope this makes sense).
>>>
>>> ID      date    drug_admin      year    month   diff_days
>>> R1/3    5/11/2007       Y       2007    5       0
>>> R1/3    5/16/2007               2007    5       6
>>> R1/3    5/22/2007               2007    5       11
>>> R1/3    5/28/2007               2007    5       17
>>> R1/3    1/14/2008               2008    1       248
>>> R1/3    1/21/2008               2008    1       255
>>> R1/3    1/28/2008               2008    1       263
>>> R1/3    2/4/2008        Y       2008    2
>>>
>>>
>>> Regards
>>>
>>-------------------------------------------------------------------------------
>>> Kevin Wame
>>>
>>>
>>> On 7/3/16, 10:09 PM, "Jeff Newmiller" <jdnewmil at
dcn.davis.ca.us>
>>wrote:
>>>
>>> Typo on the second line
>>>
>>> result <- (   result0
>>>           %>% select( -admin_period1 )
>>>           %>% inner_join( result0 %>% select( ID,
admin_period1,
>>end=start )
>>>                        , by = c( ID="ID", admin_period
>>="admin_period1" )
>>>                         )
>>>           %>% mutate( ddays = end - start )
>>>           )
>>> --
>>> Sent from my phone. Please excuse my brevity.
>>>
>>> On July 3, 2016 11:55:14 AM PDT, Kevin Wamae
>><KWamae at kemri-wellcome.org> wrote:
>>>>Hi Jeff, ?likes its Excel?, I don?t follow. Pardon me for any
mix up.
>>>>
>>>>Thanks for the code.  After running it, this is the error I get.
>>>>
>>>>Error: cannot join on columns 'admin_period' x
'admin_period1': index
>>>>out of bounds
>>>>
>>>>Regards
>>>>-------------------------------------------------------------------------------
>>>>Kevin Wame | Ph.D. Student (IDeAL)
>>>>KEMRI-Wellcome Trust Collaborative Research Programme
>>>>Centre for Geographic Medicine Research
>>>>P.O. Box 230-80108, Kilifi, Kenya
>>>>
>>>>
>>>>On 7/3/16, 9:34 PM, "Jeff Newmiller" <jdnewmil at
dcn.davis.ca.us>
>>wrote:
>>>>
>>>>I still get the impression from your mixing of information types
that
>>>>you are thinking like this is Excel.
>>>>
>>>>Perhaps something like
>>>>
>>>>drug_study$admin_period  <- ave( "Y" ==
drug_study$drug_admin,
>>>>drug_study$ID, FUN=cumsum )
>>>>library(dplyr)
>>>>result0 <- (   drug_study
>>>>          %>% filter( 0 != admin_period )
>>>>          %>% group_by( ID, admin_period )
>>>>          %>% summarise( start = min( date ) )
>>>>          %>% mutate( admin_period1 = admin_period -1 )
>>>>          )
>>>>result <- (   result0
>>>>          %>% select( -admin_period )
>>>>     %>% inner_join( result0 %>% select( ID,
admin_period1, end=start
>>)
>>>>                     , by = c( ID="ID", admin_period
="admin_period1"
>>)
>>>>                        )
>>>>          %>% mutate( ddays = end - start )
>>>>          )
>>>>--
>>>>Sent from my phone. Please excuse my brevity.
>>>>
>>>>On July 3, 2016 10:24:51 AM PDT, Kevin Wamae
>>>><KWamae at kemri-wellcome.org> wrote:
>>>>>HI Jeff, it?s been an uphill task working with the dataset
and I am
>>>>not
>>>>>the first to complain. Nonetheless, data-cleaning is ongoing
and
>>since
>>>>>I cannot wait for that to get done, I decided to make the
most of
>>what
>>>>>the dataset looks like at this time. It appears the process
may take
>>a
>>>>>while.
>>>>>
>>>>>Thanks for the script. From the output, I noticed that
?result?
>>>>>contains the first and last date for each of the individuals
and not
>>>>>taking into account the variable ?drug-admin?.
>>>>>
>>>>>ID        start               end
>>>>>J1/3      1/5/09      12/25/10
>>>>>R1/3      1/4/07      12/15/08
>>>>>R10/1     1/4/07      3/5/12
>>>>>
>>>>>My aim is to pick the date, for example in 2007, where
drug-admin =>>>>>?Y? as my start and the date in the subsequent
year (2008 in this
>>>>case)
>>>>>where drug-admin == ?Y? as my end. Then, I should populate
the
>>>>variable
>>>>>?study_id? with ?start? up to the entry just above the one
whose
>>date
>>>>>matches ?end?, as the output below shows (I hope its
structure is
>>>>>maintained as I have copied it from R-Studio). The goal for
now is
>>to
>>>>>then get difference in days between ?date? and ?study_id?
and still
>>>>get
>>>>>to keep that column for ?study_id? as I might use it later.
>>>>>
>>>>>From the output, it can be seen that for this individual,
the dates
>>>>run
>>>>>from 2007 to 2008. However, for some individuals, the dates
run from
>>>>>2008-2009, 2009-2010 and so on. Therefore, I need to make
the script
>>>>>deal with all the years as the dates range from 2001-2016
>>>>>
>>>>>ID    date    drug_admin      year    month   study_id
>>>>>R1/3  5/11/07 Y       2007    5       5/11/07
>>>>>R1/3  5/16/07         2007    5       5/11/07
>>>>>R1/3  5/22/07         2007    5       5/11/07
>>>>>R1/3  5/28/07         2007    5       5/11/07
>>>>>R1/3  6/5/07                  2007    6       5/11/07
>>>>>R1/3  6/11/07         2007    6       5/11/07
>>>>>R1/3  6/18/07         2007    6       5/11/07
>>>>>R1/3  6/25/07         2007    6       5/11/07
>>>>>R1/3  7/2/07                  2007    7       5/11/07
>>>>>R1/3  7/16/07         2007    7       5/11/07
>>>>>R1/3  7/29/07         2007    7       5/11/07
>>>>>R1/3  8/2/07                  2007    8       5/11/07
>>>>>R1/3  8/7/07                  2007    8       5/11/07
>>>>>R1/3  8/13/07         2007    8       5/11/07
>>>>>R1/3  9/18/07         2007    9       5/11/07
>>>>>R1/3  9/24/07         2007    9       5/11/07
>>>>>R1/3  10/6/07         2007    10      5/11/07
>>>>>R1/3  10/8/07         2007    10      5/11/07
>>>>>R1/3  10/15/07                2007    10      5/11/07
>>>>>R1/3  10/22/07                2007    10      5/11/07
>>>>>R1/3  10/29/07                2007    10      5/11/07
>>>>>R1/3  11/8/07         2007    11      5/11/07
>>>>>R1/3  11/12/07                2007    11      5/11/07
>>>>>R1/3  11/19/07                2007    11      5/11/07
>>>>>R1/3  11/29/07                2007    11      5/11/07
>>>>>R1/3  12/6/07         2007    12      5/11/07
>>>>>R1/3  12/10/07                2007    12      5/11/07
>>>>>R1/3  12/21/07                2007    12      5/11/07
>>>>>R1/3  1/7/08                  2008    1       5/11/07
>>>>>R1/3  1/14/08         2008    1       5/11/07
>>>>>R1/3  1/21/08         2008    1       5/11/07
>>>>>R1/3  1/28/08         2008    1       5/11/07
>>>>>R1/3  2/4/08          Y       2008    2
>>>>>
>>>>>
>>>>>Regards
>>>>>-------------------------------------------------------------------------------
>>>>>Kevin Wame
>>>>>
>>>>>###############################################################
>>>>>
>>>>>###############################################################
>>>>>
>>>>>
>>>>>
>>>>>On 7/3/16, 7:05 PM, "Jeff Newmiller" <jdnewmil
at dcn.davis.ca.us>
>>wrote:
>>>>>
>>>>>result <- setNames( data.frame( aggregate( date~ID,
data=drug_study,
>>>>>FUN=min ),  aggregate( date~ID, data=drug_study, FUN=max
)[2] ), c(
>>>>>"ID", "start", "end" ) )
>>>>>
>>>>>
>>>>>______________________________________________________________________
>>>>>
>>>>>This e-mail contains information which is confidential. It
is
>>intended
>>>>>only for the use of the named recipient. If you have
received this
>>>>>e-mail in error, please let us know by replying to the
sender, and
>>>>>immediately delete it from your system.  Please note, that
in these
>>>>>circumstances, the use, disclosure, distribution or copying
of this
>>>>>information is strictly prohibited. KEMRI-Wellcome Trust
Programme
>>>>>cannot accept any responsibility for the  accuracy or
completeness
>>of
>>>>>this message as it has been transmitted over a public
network.
>>>>Although
>>>>>the Programme has taken reasonable precautions to ensure no
viruses
>>>>are
>>>>>present in emails, it cannot accept responsibility for any
loss or
>>>>>damage arising from the use of the email or attachments. Any
views
>>>>>expressed in this message are those of the individual
sender, except
>>>>>where the sender specifically states them to be the views of
>>>>>KEMRI-Wellcome Trust Programme.
>>>>>______________________________________________________________________
>>>>
>>>>
>>>>
>>>>
>>>>______________________________________________________________________
>>>>
>>>>This e-mail contains information which is confidential. It is
>>intended
>>>>only for the use of the named recipient. If you have received
this
>>>>e-mail in error, please let us know by replying to the sender,
and
>>>>immediately delete it from your system.  Please note, that in
these
>>>>circumstances, the use, disclosure, distribution or copying of
this
>>>>information is strictly prohibited. KEMRI-Wellcome Trust
Programme
>>>>cannot accept any responsibility for the  accuracy or
completeness of
>>>>this message as it has been transmitted over a public network.
>>Although
>>>>the Programme has taken reasonable precautions to ensure no
viruses
>>are
>>>>present in emails, it cannot accept responsibility for any loss
or
>>>>damage arising from the use of the email or attachments. Any
views
>>>>expressed in this message are those of the individual sender,
except
>>>>where the sender specifically states them to be the views of
>>>>KEMRI-Wellcome Trust Programme.
>>>>______________________________________________________________________
>>>
>>>
>>>
>>>
>>>
>>______________________________________________________________________
>>>
>>> This e-mail contains information which is confidential. It is
>>intended only for the use of the named recipient. If you have received
>>this e-mail in error, please let us know by replying to the sender, and
>>immediately delete it from your system.  Please note, that in these
>>circumstances, the use, disclosure, distribution or copying of this
>>information is strictly prohibited. KEMRI-Wellcome Trust Programme
>>cannot accept any responsibility for the  accuracy or completeness of
>>this message as it has been transmitted over a public network. Although
>>the Programme has taken reasonable precautions to ensure no viruses are
>>present in emails, it cannot accept responsibility for any loss or
>>damage arising from the use of the email or attachments. Any views
>>expressed in this message are those of the individual sender, except
>>where the sender specifically states them to be the views of
>>KEMRI-Wellcome Trust Programme.
>>>
>>______________________________________________________________________
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
>>______________________________________________________________________
>>
>>This e-mail contains information which is confidential. It is intended
>>only for the use of the named recipient. If you have received this
>>e-mail in error, please let us know by replying to the sender, and
>>immediately delete it from your system.  Please note, that in these
>>circumstances, the use, disclosure, distribution or copying of this
>>information is strictly prohibited. KEMRI-Wellcome Trust Programme
>>cannot accept any responsibility for the  accuracy or completeness of
>>this message as it has been transmitted over a public network. Although
>>the Programme has taken reasonable precautions to ensure no viruses are
>>present in emails, it cannot accept responsibility for any loss or
>>damage arising from the use of the email or attachments. Any views
>>expressed in this message are those of the individual sender, except
>>where the sender specifically states them to be the views of
>>KEMRI-Wellcome Trust Programme.
>>______________________________________________________________________
>


______________________________________________________________________

This e-mail contains information which is confidential. It is intended only for
the use of the named recipient. If you have received this e-mail in error,
please let us know by replying to the sender, and immediately delete it from
your system.  Please note, that in these circumstances, the use, disclosure,
distribution or copying of this information is strictly prohibited.
KEMRI-Wellcome Trust Programme cannot accept any responsibility for the 
accuracy or completeness of this message as it has been transmitted over a
public network. Although the Programme has taken reasonable precautions to
ensure no viruses are present in emails, it cannot accept responsibility for any
loss or damage arising from the use of the email or attachments. Any views
expressed in this message are those of the individual sender, except where the
sender specifically states them to be the views of KEMRI-Wellcome Trust
Programme.
______________________________________________________________________

R help - Jul 2016 - R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset