Williams Scott
2014-Jul-16 13:07 UTC
how to subset based on other row values and multiplicity
Hi R experts, I have a dataset as sampled below. Values are only regarded as Œconfirmed¹ in an individual (Œid¹) if they occur more than once at least 30 days apart. id date value a 2000-01-01 x a 2000-03-01 x b 2000-11-11 w c 2000-11-11 y c 2000-10-01 y c 2000-09-10 y c 2000-12-12 z c 2000-10-11 z d 2000-11-11 w d 2000-11-10 w I wish to subset the data to retain rows where the value for the individual is confirmed more than 30 days apart. So, after deleting all rows with just one occurrence of id and value, the rest would be the earliest occurrence of each value in each case id, provided 31 or more days exist between the dates. If >1 value is present per id, each value level needs to be assessed independently. This example would then reduce to: id date value a 2000-01-01 x c 2000-09-10 y c 2000-10-11 z I can do this via some crude loops and subsetting, but I am looking for as much efficiency as possible as the dataset has around 50 million rows to assess. Any suggestions welcomed. Thanks in advance Scott Williams MD Melbourne, Australia This email (including any attachments or links) may contain confidential and/or legally privileged information and is intended only to be read or used by the addressee. If you are not the intended addressee, any use, distribution, disclosure or copying of this email is strictly prohibited. Confidentiality and legal privilege attached to this email (including any attachments) are not waived or lost by reason of its mistaken delivery to you. If you have received this email in error, please delete it and notify us immediately by telephone or email. Peter MacCallum Cancer Centre provides no guarantee that this transmission is free of virus or that it has not been intercepted or altered and will not be liable for any delay in its receipt.