I have a large dataset (~1 million rows) of three variables: ID (patient's name), DATE (of appointment) and DIAGNOSIS (given on that date). Patients may have been assigned more than one diagnosis at any one appointment - leading to two rows, same ID and DATE but different DIAGNOSIS. The diagnoses may change between appointments. I want to subset the data in two ways: - define groups of patients by the first diagnosis given - define groups of patients by the last diagnosis given. The problem: Unfortunately, a small number of patients have been given more than one diagnosis at their first (or last) appointment. These individuals I need to identify and remove, as it's not possible to say uniquely what their first (or last) diagnosis was. So I need to identify and remove these individuals which have pairs of rows with the same ID and (lowest or highest) DATE. The size of the dataset precludes the option of doing this by eye. I suspect there is a very elegant way of doing this in R. This is what I've come up with: - Sort by DATE then ID - Make a ragged array of DATE by ID - Remove IDs that only occur once. - Subtract the first and second DATEs. Remove IDs for which this = zero, as this will only be true for IDs for which the appointment is recorded twice (because there were two diagnoses recorded on this date). - (Then do the same to get the 'last appointment' duplicates, by reversing the initial sort by DATE.) I am stuck at the 'Subtract dates' step: I would like to get the data out of the ragged array by columns (so e.g. I end up with a matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from the ragged array. I hope someone can help. My ugly code is below, with some data for testing. Stuart Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer and Honorary Consultant Pychiatrist Institute of Mental Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44 115 82 30419 stuart.leask@nottingham.ac.uk<mailto:stuart.leask@nottingham.ac.uk> Google 'Dr Stuart Leask' ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 ,547,794,814,814,814,814,814,814,841,841,841,841,841 ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 ,1019) DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 ,20050421,20060130,20060428,20060602,20060816,20061025,20061129 ,20070112,20070514,20091105,20091117,20091119,20091120,20091210 ,20091224,20050503,19870508,19880223,19880330) id.d <- cbind (ID,DATE ) rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create ragged array, 1-n DATES for every NAME # Inelegant attempt to remove IDs that only have one entry: rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates per row # Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000: rag.t <- rag.s [ rag.s > 21000000 ] multi.dates <- rownames ( rag.t ) # all the IDs with >1 date rag.am <- rag.a [ multi.dates ] # rag.am only has IDs with > 1 Date # But now I'm stuck. # Each row of the array is rag.am$ID. # So I can't pick columns of DATEs from the ragged array. This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it. Please do not use, copy or disclose the information contained in this message or in any attachment. Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham. This message has been checked for viruses but the contents of an attachment may still contain software viruses which could damage your computer system: you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation. [[alternative HTML version deleted]]
Hello, I'm not sure I understand it well, in the solution below the only returned value is ID == 814 but it's not the first nor the last DATE. how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) id.d[how.many > 1, ] See the help page for ?ave if the repetition of id.d[,1] is confusing. The first is the vector to average (to apply FUN to) and the second is one of thw two vectors defining the groups. Hope this helps, Rui Barradas Em 23-10-2012 10:37, Stuart Leask escreveu:> I have a large dataset (~1 million rows) of three variables: ID (patient's name), DATE (of appointment) and DIAGNOSIS (given on that date). > Patients may have been assigned more than one diagnosis at any one appointment - leading to two rows, same ID and DATE but different DIAGNOSIS. > The diagnoses may change between appointments. > > I want to subset the data in two ways: > > - define groups of patients by the first diagnosis given > > - define groups of patients by the last diagnosis given. > > The problem: > Unfortunately, a small number of patients have been given more than one diagnosis at their first (or last) appointment. These individuals I need to identify and remove, as it's not possible to say uniquely what their first (or last) diagnosis was. So I need to identify and remove these individuals which have pairs of rows with the same ID and (lowest or highest) DATE. The size of the dataset precludes the option of doing this by eye. > > I suspect there is a very elegant way of doing this in R. > > This is what I've come up with: > > > - Sort by DATE then ID > > - Make a ragged array of DATE by ID > > - Remove IDs that only occur once. > > - Subtract the first and second DATEs. Remove IDs for which this = zero, as this will only be true for IDs for which the appointment is recorded twice (because there were two diagnoses recorded on this date). > > - (Then do the same to get the 'last appointment' duplicates, by reversing the initial sort by DATE.) > > I am stuck at the 'Subtract dates' step: I would like to get the data out of the ragged array by columns (so e.g. I end up with a matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from the ragged array. > > I hope someone can help. My ugly code is below, with some data for testing. > > > Stuart > > > Dr Stuart John Leask DM FRCPsych MB BChir MA > Clinical Senior Lecturer and Honorary Consultant Pychiatrist > Institute of Mental Health, Innovation Park > Triumph Road, Nottingham, Notts. NG7 2TU. UK > Tel. +44 115 82 30419 stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk> > Google 'Dr Stuart Leask' > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > ,547,794,814,814,814,814,814,814,841,841,841,841,841 > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > ,1019) > > DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 > ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 > ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 > ,20050421,20060130,20060428,20060602,20060816,20061025,20061129 > ,20070112,20070514,20091105,20091117,20091119,20091120,20091210 > ,20091224,20050503,19870508,19880223,19880330) > > id.d <- cbind (ID,DATE ) > rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create ragged array, 1-n DATES for every NAME > > # Inelegant attempt to remove IDs that only have one entry: > > rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates per row > # Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000: > rag.t <- rag.s [ rag.s > 21000000 ] > multi.dates <- rownames ( rag.t ) # all the IDs with >1 date > rag.am <- rag.a [ multi.dates ] # rag.am only has IDs with > 1 Date > > > # But now I'm stuck. > # Each row of the array is rag.am$ID. > # So I can't pick columns of DATEs from the ragged array. > > This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it. Please do not use, copy or disclose the information contained in this message or in any attachment. Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham. > > This message has been checked for viruses but the contents of an attachment > may still contain software viruses which could damage your computer system: > you are advised to perform your own checks. Email communications with the > University of Nottingham may be monitored as permitted by UK legislation. > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Hi I did not check your code and rather followed your explanation. BTW, thanks for test data. small change in data frame to make DATE as Date class datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <- data.frame(ID,datum ) ordering by date id.d<-id.d[order(id.d$datum),] two functions to test if first two dates are the same or last two dates are the same testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) x[length(x),2]==x[length(x)-1,2] change one last date in the data frame to be the same as previous id.d[35,2]<-id.d[36,2] and here are results sapply(split(id.d, id.d$ID), testlast) 58 167 323 547 794 814 841 910 999 1019 FALSE FALSE FALSE NA NA FALSE FALSE TRUE NA FALSE> sapply(split(id.d, id.d$ID), testfirst)58 167 323 547 794 814 841 910 999 1019 FALSE FALSE FALSE NA NA FALSE FALSE FALSE NA FALSE Now you can select ID which is true and remove it from your data which(sapply(split(id.d, id.d$ID), testlast)) and use it for your data frame to subset/remove id.d$ID == as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast)))) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE [37] TRUE TRUE TRUE TRUE However I am not sure if this is exactly what you want. Regards Petr> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of Stuart Leask > Sent: Tuesday, October 23, 2012 11:38 AM > To: r-help at r-project.org > Subject: [R] [r] How to pick colums from a ragged array? > > I have a large dataset (~1 million rows) of three variables: ID > (patient's name), DATE (of appointment) and DIAGNOSIS (given on that > date). > Patients may have been assigned more than one diagnosis at any one > appointment - leading to two rows, same ID and DATE but different > DIAGNOSIS. > The diagnoses may change between appointments. > > I want to subset the data in two ways: > > - define groups of patients by the first diagnosis given > > - define groups of patients by the last diagnosis given. > > The problem: > Unfortunately, a small number of patients have been given more than one > diagnosis at their first (or last) appointment. These individuals I > need to identify and remove, as it's not possible to say uniquely what > their first (or last) diagnosis was. So I need to identify and remove > these individuals which have pairs of rows with the same ID and (lowest > or highest) DATE. The size of the dataset precludes the option of doing > this by eye. > > I suspect there is a very elegant way of doing this in R. > > This is what I've come up with: > > > - Sort by DATE then ID > > - Make a ragged array of DATE by ID > > - Remove IDs that only occur once. > > - Subtract the first and second DATEs. Remove IDs for which > this = zero, as this will only be true for IDs for which the > appointment is recorded twice (because there were two diagnoses > recorded on this date). > > - (Then do the same to get the 'last appointment' duplicates, > by reversing the initial sort by DATE.) > > I am stuck at the 'Subtract dates' step: I would like to get the data > out of the ragged array by columns (so e.g. I end up with a matrix of > ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from > the ragged array. > > I hope someone can help. My ugly code is below, with some data for > testing. > > > Stuart > > > Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer > and Honorary Consultant Pychiatrist Institute of Mental Health, > Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44 > 115 82 30419 > stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk> > Google 'Dr Stuart Leask' > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > ,547,794,814,814,814,814,814,814,841,841,841,841,841 > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > ,1019) > > DATE <- > c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 > ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 > ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 > ,20050421,20060130,20060428,20060602,20060816,20061025,20061129 > ,20070112,20070514,20091105,20091117,20091119,20091120,20091210 > ,20091224,20050503,19870508,19880223,19880330) > > id.d <- cbind (ID,DATE ) > rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create > ragged array, 1-n DATES for every NAME > > # Inelegant attempt to remove IDs that only have one entry: > > rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the > dates per row > # Since DATE is in 'year mo da', if there's only one date, sum will be > less than 2100000: > rag.t <- rag.s [ rag.s > 21000000 ] > multi.dates <- rownames ( rag.t ) # all the IDs > with >1 date > rag.am <- rag.a [ multi.dates ] # rag.am only > has IDs with > 1 Date > > > # But now I'm stuck. > # Each row of the array is rag.am$ID. > # So I can't pick columns of DATEs from the ragged array. > > This message and any attachment are intended solely for the addressee > and may contain confidential information. If you have received this > message in error, please send it back to me, and immediately delete it. > Please do not use, copy or disclose the information contained in this > message or in any attachment. Any views or opinions expressed by the > author of this email do not necessarily reflect the views of the > University of Nottingham. > > This message has been checked for viruses but the contents of an > attachment may still contain software viruses which could damage your > computer system: > you are advised to perform your own checks. Email communications with > the University of Nottingham may be monitored as permitted by UK > legislation. > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.
Hi, tapply(new1[,1],new1[,2],head,1) # works. I used: id.d<-data.frame(ID,DATE) #In that case, tapply(new1$ID,new1$DATE,head,1) #works On closer look, I think you don't want "814" id.? Not sure about the logic behind that. A.K. ----- Original Message ----- From: Stuart Leask <Stuart.Leask at nottingham.ac.uk> To: arun <smartpink111 at yahoo.com> Cc: Petr PIKAL <petr.pikal at precheza.cz> Sent: Tuesday, October 23, 2012 9:15 AM Subject: RE: [R] [r] How to pick colums from a ragged array? Sorry Arun, but when I run it I get an error:> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323+ ,547,794,814,814,814,814,814,814,841,841,841,841,841 + ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 + ,1019)> > DATE <-+? c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 +? ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 +? ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 +? ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 +? ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210 +? ,20091224,20050503,19870508,19870508,19880330)> >? id.d <- cbind (ID,DATE ) > new1<-id.d[duplicated(id.d)|duplicated(id.d,fromLast=TRUE),] > > > tapply(new1$ID,new1$DATE,head,1)Error in new1$DATE : $ operator is invalid for atomic vectors -----Original Message----- From: arun [mailto:smartpink111 at yahoo.com] Sent: 23 October 2012 14:05 To: Stuart Leask Cc: R help; Petr PIKAL Subject: Re: [R] [r] How to pick colums from a ragged array? HI, I was not following the thread. May be this is what you are looking for: new1<-id.d[duplicated(id.d)|duplicated(id.d,fromLast=TRUE),] tapply(new1$ID,new1$DATE,head,1) #19870508 20040205 20040429 20050421 ? #? 1019????? 167????? 814????? 841 A.K. ----- Original Message ----- From: Stuart Leask <Stuart.Leask at nottingham.ac.uk> To: PIKAL Petr <petr.pikal at precheza.cz>; "r-help at r-project.org" <r-help at r-project.org> Cc: Sent: Tuesday, October 23, 2012 8:28 AM Subject: Re: [R] [r] How to pick colums from a ragged array? Hi there. Not sure I follow what you are doing. I want a list of all the IDs that have duplicate DATE entries, only when the DATE is the earliest (or last) date for that ID. I have refined my test dataset, to include some tests (e.g. 910 has the same dup as 1019, but for 910 it's not the earliest date): ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 ,547,794,814,814,814,814,814,814,841,841,841,841,841 ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 ,1019) DATE <- c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210 ,20091224,20050503,19870508,19870508,19880330) Correct output: "167"? "841"? "1019" Stuart -----Original Message----- From: PIKAL Petr [mailto:petr.pikal at precheza.cz] Sent: 23 October 2012 13:15 To: Stuart Leask; r-help at r-project.org Subject: RE: [r] How to pick colums from a ragged array? Hi Rui's answer brought me to more elaborated solution which still needs data frame to be ordered by date fff<-function(data, first=TRUE, remove=FALSE) { testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) x[length(x),2]==x[length(x)-1,2] if(first) sel <- as.numeric(names(which(sapply(split(data, data[,1]), testfirst)))) else sel <- as.numeric(names(which(sapply(split(data, data[,1]), testlast)))) if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] }> fff(id.d)? ? ID? ?? DATE 31 910 20091105 32 910 20091105 33 910 20091117 34 910 20091119 35 910 20091120 36 910 20091210 37 910 20091224 38 910 20091224> fff(id.d, remove=T)? ?? ID? ?? DATE 1? ? 58 20060821 2? ? 58 20061207 3? ? 58 20080102 4? ? 58 20090904 5?? 167 20040205 6?? 167 20040323 7?? 323 20051111 8?? 323 20060111 9?? 323 20071119 10? 323 20080107 11? 323 20080407 12? 323 20080521 13? 323 20080711 14? 547 20041005 15? 794 20070905 16? 814 20020814 17? 814 20021125 18? 814 20040429 19? 814 20040429 20? 814 20071205 21? 814 20080227 22? 841 20050421 23? 841 20060130 24? 841 20060428 25? 841 20060602 26? 841 20060816 27? 841 20061025 28? 841 20061129 29? 841 20070112 30? 841 20070514 39? 999 20050503 40 1019 19870508 41 1019 19880223 42 1019 19880330 43 1019 19880330>Regards Petr> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of PIKAL Petr > Sent: Tuesday, October 23, 2012 1:49 PM > To: Stuart Leask; r-help at r-project.org > Subject: Re: [R] [r] How to pick colums from a ragged array? > > Hi > > I did not check your code and rather followed your explanation. BTW, > thanks for test data. > > small change in data frame to make DATE as Date class > > datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <- > data.frame(ID,datum ) > > ordering by date > > id.d<-id.d[order(id.d$datum),] > > > two functions to test if first two dates are the same or last two > dates are the same > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > x[length(x),2]==x[length(x)-1,2] > > change one last date in the data frame to be the same as previous > > id.d[35,2]<-id.d[36,2] > > and here are results > > sapply(split(id.d, id.d$ID), testlast) >? ? 58?? 167?? 323?? 547?? 794?? 814?? 841?? 910?? 999? 1019? FALSE >FALSE FALSE? ? NA? ? NA FALSE FALSE? TRUE? ? NA FALSE > > > sapply(split(id.d, id.d$ID), testfirst) >? ? 58?? 167?? 323?? 547?? 794?? 814?? 841?? 910?? 999? 1019? FALSE >FALSE FALSE? ? NA? ? NA FALSE FALSE FALSE? ? NA FALSE > > Now you can select ID which is true and remove it from your data > which(sapply(split(id.d, id.d$ID), testlast)) > > and use it for your data frame to subset/remove id.d$ID == > as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast))))? [1] > FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > FALSE TRUE? TRUE [37]? TRUE? TRUE? TRUE? TRUE > > However I am not sure if this is exactly what you want. > > Regards > Petr > > > -----Original Message----- > > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > > project.org] On Behalf Of Stuart Leask > > Sent: Tuesday, October 23, 2012 11:38 AM > > To: r-help at r-project.org > > Subject: [R] [r] How to pick colums from a ragged array? > > > > I have a large dataset (~1 million rows) of three variables: ID > > (patient's name), DATE (of appointment) and DIAGNOSIS (given on that > > date). > > Patients may have been assigned more than one diagnosis at any one > > appointment - leading to two rows, same ID and DATE but different > > DIAGNOSIS. > > The diagnoses may change between appointments. > > > > I want to subset the data in two ways: > > > > -? ? ? ? ? define groups of patients by the first diagnosis given > > > > -? ? ? ? ? define groups of patients by the last diagnosis given. > > > > The problem: > > Unfortunately, a small number of patients have been given more than > > one diagnosis at their first (or last) appointment. These > > individuals I need to identify and remove, as it's not possible to > > say uniquely what their first (or last) diagnosis was. So I need to > > identify and remove these individuals which have pairs of rows with > > the same ID > and > > (lowest or highest) DATE. The size of the dataset precludes the > option > > of doing this by eye. > > > > I suspect there is a very elegant way of doing this in R. > > > > This is what I've come up with: > > > > > > -? ? ? ? ? Sort by DATE then ID > > > > -? ? ? ? ? Make a ragged array of DATE by ID > > > > -? ? ? ? ? Remove IDs that only occur once. > > > > -? ? ? ? ? Subtract the first and second DATEs. Remove IDs for which > > this = zero, as this will only be true for IDs for which the > > appointment is recorded twice (because there were two diagnoses > > recorded on this date). > > > > -? ? ? ? ? (Then do the same to get the 'last appointment' > duplicates, > > by reversing the initial sort by DATE.) > > > > I am stuck at the 'Subtract dates' step: I would like to get the > > data out of the ragged array by columns (so e.g. I end up with a > > matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by > > column from the ragged array. > > > > I hope someone can help. My ugly code is below, with some data for > > testing. > > > > > > Stuart > > > > > > Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior > > Lecturer and Honorary Consultant Pychiatrist Institute of Mental > > Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK > > Tel. +44 > > 115 82 30419 > > stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk> > > Google 'Dr Stuart Leask' > > > > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > > ,547,794,814,814,814,814,814,814,841,841,841,841,841 > > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > > ,1019) > > > > DATE <- > > c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 > > ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 > > ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 > > ,20050421,20060130,20060428,20060602,20060816,20061025,20061129 > > ,20070112,20070514,20091105,20091117,20091119,20091120,20091210 > > ,20091224,20050503,19870508,19880223,19880330) > > > > id.d <- cbind (ID,DATE ) > > rag.a? <-? split ( id.d [ ,2 ], id.d [ ,1])? ? ? ? ? ? ?? # create > > ragged array, 1-n DATES for every NAME > > > > # Inelegant attempt to remove IDs that only have one entry: > > > > rag.s <-tapply? (id.d [ ,2], id.d [ ,1], sum)? ? ? ? ? ?? #add up > > the dates per row # Since DATE is in 'year mo da', if there's only > > one date, sum will > be > > less than 2100000: > > rag.t <- rag.s [ rag.s > 21000000 ] > > multi.dates <- rownames ( rag.t )? ? ? ? ? ? ? ? ? ? ? ?? # all the > IDs > > with >1 date > > rag.am <- rag.a [ multi.dates ]? ? ? ? ? ? ? ? ? ? ? ? ?? # rag.am > only > > has IDs with > 1 Date > > > > > > # But now I'm stuck. > > # Each row of the array is rag.am$ID. > > # So I can't pick columns of DATEs from the ragged array. > > > > This message and any attachment are intended solely for the > > addressee and may contain confidential information. If you have > > received this message in error, please send it back to me, and > > immediately delete > it. > > Please do not use, copy or disclose the information contained in > > this message or in any attachment.? Any views or opinions expressed > > by the author of this email do not necessarily reflect the views of > > the University of Nottingham. > > > > This message has been checked for viruses but the contents of an > > attachment may still contain software viruses which could damage > > your computer system: > > you are advised to perform your own checks. Email communications > > with the University of Nottingham may be monitored as permitted by > > UK legislation. > > ??? [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting- > > guide.html and provide commented, minimal, self-contained, > > reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html and provide commented, minimal, self-contained, > reproducible code.______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it.? Please do not use, copy or disclose the information contained in this message or in any attachment.? Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham. This message has been checked for viruses but the contents of an attachment may still contain software viruses which could damage your computer system: you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation.
Hi, Also one more thing: This should get the dates which are duplicated.? In my first reply, I was looking for the duplicated rows. Sorry for that! id.d<-data.frame(ID,DATE) new1<-id.d[duplicated(id.d$DATE)|duplicated(id.d$DATE,fromLast=TRUE),] new2<-new1[order(new1$ID,new1$DATE),] ?tapply(new2$ID,new2$DATE,head,1) #19870508 20040205 20040429 20050421 ? # ? 910????? 167????? 814????? 841 But, still the result is not that you wanted, because 910's date is the earliest date when compared to 1019. new1[order(new1$ID,new1$DATE),] #???? ID???? DATE #5?? 167 20040205 #6?? 167 20040205 #18? 814 20040429 #19? 814 20040429 #22? 841 20050421 #23? 841 20050421 #31? 910 19870508 #32? 910 20040205 #33? 910 20040205 #38 1019 19870508 #39 1019 19870508 A.K. ----- Original Message ----- From: Stuart Leask <Stuart.Leask at nottingham.ac.uk> To: arun <smartpink111 at yahoo.com> Cc: Petr PIKAL <petr.pikal at precheza.cz> Sent: Tuesday, October 23, 2012 9:15 AM Subject: RE: [R] [r] How to pick colums from a ragged array? Sorry Arun, but when I run it I get an error:> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323+ ,547,794,814,814,814,814,814,814,841,841,841,841,841 + ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 + ,1019)> > DATE <-+? c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 +? ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 +? ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 +? ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 +? ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210 +? ,20091224,20050503,19870508,19870508,19880330)> >? id.d <- cbind (ID,DATE ) > new1<-id.d[duplicated(id.d)|duplicated(id.d,fromLast=TRUE),] > > > tapply(new1$ID,new1$DATE,head,1)Error in new1$DATE : $ operator is invalid for atomic vectors -----Original Message----- From: arun [mailto:smartpink111 at yahoo.com] Sent: 23 October 2012 14:05 To: Stuart Leask Cc: R help; Petr PIKAL Subject: Re: [R] [r] How to pick colums from a ragged array? HI, I was not following the thread. May be this is what you are looking for: new1<-id.d[duplicated(id.d)|duplicated(id.d,fromLast=TRUE),] tapply(new1$ID,new1$DATE,head,1) #19870508 20040205 20040429 20050421 ? #? 1019????? 167????? 814????? 841 A.K. ----- Original Message ----- From: Stuart Leask <Stuart.Leask at nottingham.ac.uk> To: PIKAL Petr <petr.pikal at precheza.cz>; "r-help at r-project.org" <r-help at r-project.org> Cc: Sent: Tuesday, October 23, 2012 8:28 AM Subject: Re: [R] [r] How to pick colums from a ragged array? Hi there. Not sure I follow what you are doing. I want a list of all the IDs that have duplicate DATE entries, only when the DATE is the earliest (or last) date for that ID. I have refined my test dataset, to include some tests (e.g. 910 has the same dup as 1019, but for 910 it's not the earliest date): ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 ,547,794,814,814,814,814,814,814,841,841,841,841,841 ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 ,1019) DATE <- c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210 ,20091224,20050503,19870508,19870508,19880330) Correct output: "167"? "841"? "1019" Stuart -----Original Message----- From: PIKAL Petr [mailto:petr.pikal at precheza.cz] Sent: 23 October 2012 13:15 To: Stuart Leask; r-help at r-project.org Subject: RE: [r] How to pick colums from a ragged array? Hi Rui's answer brought me to more elaborated solution which still needs data frame to be ordered by date fff<-function(data, first=TRUE, remove=FALSE) { testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) x[length(x),2]==x[length(x)-1,2] if(first) sel <- as.numeric(names(which(sapply(split(data, data[,1]), testfirst)))) else sel <- as.numeric(names(which(sapply(split(data, data[,1]), testlast)))) if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] }> fff(id.d)? ? ID? ?? DATE 31 910 20091105 32 910 20091105 33 910 20091117 34 910 20091119 35 910 20091120 36 910 20091210 37 910 20091224 38 910 20091224> fff(id.d, remove=T)? ?? ID? ?? DATE 1? ? 58 20060821 2? ? 58 20061207 3? ? 58 20080102 4? ? 58 20090904 5?? 167 20040205 6?? 167 20040323 7?? 323 20051111 8?? 323 20060111 9?? 323 20071119 10? 323 20080107 11? 323 20080407 12? 323 20080521 13? 323 20080711 14? 547 20041005 15? 794 20070905 16? 814 20020814 17? 814 20021125 18? 814 20040429 19? 814 20040429 20? 814 20071205 21? 814 20080227 22? 841 20050421 23? 841 20060130 24? 841 20060428 25? 841 20060602 26? 841 20060816 27? 841 20061025 28? 841 20061129 29? 841 20070112 30? 841 20070514 39? 999 20050503 40 1019 19870508 41 1019 19880223 42 1019 19880330 43 1019 19880330>Regards Petr> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of PIKAL Petr > Sent: Tuesday, October 23, 2012 1:49 PM > To: Stuart Leask; r-help at r-project.org > Subject: Re: [R] [r] How to pick colums from a ragged array? > > Hi > > I did not check your code and rather followed your explanation. BTW, > thanks for test data. > > small change in data frame to make DATE as Date class > > datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <- > data.frame(ID,datum ) > > ordering by date > > id.d<-id.d[order(id.d$datum),] > > > two functions to test if first two dates are the same or last two > dates are the same > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > x[length(x),2]==x[length(x)-1,2] > > change one last date in the data frame to be the same as previous > > id.d[35,2]<-id.d[36,2] > > and here are results > > sapply(split(id.d, id.d$ID), testlast) >? ? 58?? 167?? 323?? 547?? 794?? 814?? 841?? 910?? 999? 1019? FALSE >FALSE FALSE? ? NA? ? NA FALSE FALSE? TRUE? ? NA FALSE > > > sapply(split(id.d, id.d$ID), testfirst) >? ? 58?? 167?? 323?? 547?? 794?? 814?? 841?? 910?? 999? 1019? FALSE >FALSE FALSE? ? NA? ? NA FALSE FALSE FALSE? ? NA FALSE > > Now you can select ID which is true and remove it from your data > which(sapply(split(id.d, id.d$ID), testlast)) > > and use it for your data frame to subset/remove id.d$ID == > as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast))))? [1] > FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > FALSE TRUE? TRUE [37]? TRUE? TRUE? TRUE? TRUE > > However I am not sure if this is exactly what you want. > > Regards > Petr > > > -----Original Message----- > > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > > project.org] On Behalf Of Stuart Leask > > Sent: Tuesday, October 23, 2012 11:38 AM > > To: r-help at r-project.org > > Subject: [R] [r] How to pick colums from a ragged array? > > > > I have a large dataset (~1 million rows) of three variables: ID > > (patient's name), DATE (of appointment) and DIAGNOSIS (given on that > > date). > > Patients may have been assigned more than one diagnosis at any one > > appointment - leading to two rows, same ID and DATE but different > > DIAGNOSIS. > > The diagnoses may change between appointments. > > > > I want to subset the data in two ways: > > > > -? ? ? ? ? define groups of patients by the first diagnosis given > > > > -? ? ? ? ? define groups of patients by the last diagnosis given. > > > > The problem: > > Unfortunately, a small number of patients have been given more than > > one diagnosis at their first (or last) appointment. These > > individuals I need to identify and remove, as it's not possible to > > say uniquely what their first (or last) diagnosis was. So I need to > > identify and remove these individuals which have pairs of rows with > > the same ID > and > > (lowest or highest) DATE. The size of the dataset precludes the > option > > of doing this by eye. > > > > I suspect there is a very elegant way of doing this in R. > > > > This is what I've come up with: > > > > > > -? ? ? ? ? Sort by DATE then ID > > > > -? ? ? ? ? Make a ragged array of DATE by ID > > > > -? ? ? ? ? Remove IDs that only occur once. > > > > -? ? ? ? ? Subtract the first and second DATEs. Remove IDs for which > > this = zero, as this will only be true for IDs for which the > > appointment is recorded twice (because there were two diagnoses > > recorded on this date). > > > > -? ? ? ? ? (Then do the same to get the 'last appointment' > duplicates, > > by reversing the initial sort by DATE.) > > > > I am stuck at the 'Subtract dates' step: I would like to get the > > data out of the ragged array by columns (so e.g. I end up with a > > matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by > > column from the ragged array. > > > > I hope someone can help. My ugly code is below, with some data for > > testing. > > > > > > Stuart > > > > > > Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior > > Lecturer and Honorary Consultant Pychiatrist Institute of Mental > > Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK > > Tel. +44 > > 115 82 30419 > > stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk> > > Google 'Dr Stuart Leask' > > > > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > > ,547,794,814,814,814,814,814,814,841,841,841,841,841 > > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > > ,1019) > > > > DATE <- > > c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 > > ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 > > ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 > > ,20050421,20060130,20060428,20060602,20060816,20061025,20061129 > > ,20070112,20070514,20091105,20091117,20091119,20091120,20091210 > > ,20091224,20050503,19870508,19880223,19880330) > > > > id.d <- cbind (ID,DATE ) > > rag.a? <-? split ( id.d [ ,2 ], id.d [ ,1])? ? ? ? ? ? ?? # create > > ragged array, 1-n DATES for every NAME > > > > # Inelegant attempt to remove IDs that only have one entry: > > > > rag.s <-tapply? (id.d [ ,2], id.d [ ,1], sum)? ? ? ? ? ?? #add up > > the dates per row # Since DATE is in 'year mo da', if there's only > > one date, sum will > be > > less than 2100000: > > rag.t <- rag.s [ rag.s > 21000000 ] > > multi.dates <- rownames ( rag.t )? ? ? ? ? ? ? ? ? ? ? ?? # all the > IDs > > with >1 date > > rag.am <- rag.a [ multi.dates ]? ? ? ? ? ? ? ? ? ? ? ? ?? # rag.am > only > > has IDs with > 1 Date > > > > > > # But now I'm stuck. > > # Each row of the array is rag.am$ID. > > # So I can't pick columns of DATEs from the ragged array. > > > > This message and any attachment are intended solely for the > > addressee and may contain confidential information. If you have > > received this message in error, please send it back to me, and > > immediately delete > it. > > Please do not use, copy or disclose the information contained in > > this message or in any attachment.? Any views or opinions expressed > > by the author of this email do not necessarily reflect the views of > > the University of Nottingham. > > > > This message has been checked for viruses but the contents of an > > attachment may still contain software viruses which could damage > > your computer system: > > you are advised to perform your own checks. Email communications > > with the University of Nottingham may be monitored as permitted by > > UK legislation. > > ??? [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting- > > guide.html and provide commented, minimal, self-contained, > > reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html and provide commented, minimal, self-contained, > reproducible code.______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it.? Please do not use, copy or disclose the information contained in this message or in any attachment.? Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham. This message has been checked for viruses but the contents of an attachment may still contain software viruses which could damage your computer system: you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation.
HI Stuart, Just a doubt: When I run Rui's function: I got results as: g.rr[rep(!duplicated(g.rr)[(1:(dim(g.rr)[1]/2))*2],each=2),] #???? ID???? DATE DG #22? 841 20050421? 1 #23? 841 20050421? 2 #38 1019 19870508? 2 #39 1019 19870508? 1 But, when I look at the id.d 16? 814 20020814? 2 17? 814 20021125? 2 18? 814 20040429? 2 19? 814 20040429? 2 20? 814 20071205? 2 21? 814 20071205? 2 ------------------------- ?? 323 20051111? 3 8?? 323 20060111? 2 9?? 323 20071119? 3 10? 323 20080107? 2 11? 323 20080407? 1 12? 323 20080521? 2 13? 323 20080521? 3 According to the? function I wrote earlier, I get res1 ????? flag 58?? FALSE 167?? TRUE 323?? TRUE 547? FALSE 794? FALSE 814?? TRUE 841?? TRUE 910? FALSE 999? FALSE 1019? TRUE If I understand correctly, you wanted to eliminate 814 and 167 because the DGs are the same.? That can be done.? I also want to make it clear that why 323 is not included.? It is duplicated in the last visit. Now, to eliminate the ones with duplicate DGs: res1<- data.frame(flag=tapply(id.d[,2],id.d[,1],FUN=function(x) head(duplicated(x)|duplicated(x,fromLast=TRUE),1)|tail(duplicated(x)|duplicated(x,fromLast=TRUE),1))) res2<-id.d[id.d[,1]%in%names(res1[res1$flag==TRUE,])&(duplicated(id.d[,1:2])|duplicated(id.d[,1:2],fromLast=TRUE)),] res2[!res2$ID%in% res2[duplicated(res2)|duplicated(res2,fromLast=TRUE),]$ID,] #???? ID???? DATE DG #12? 323 20080521? 2 #13? 323 20080521? 3 #22? 841 20050421? 1 #23? 841 20050421? 2 #38 1019 19870508? 2 #39 1019 19870508? 1 A.K. ? ? ----- Original Message ----- From: Stuart Leask <Stuart.Leask at nottingham.ac.uk> To: "arun (smartpink111 at yahoo.com)" <smartpink111 at yahoo.com>; PIKAL Petr <petr.pikal at precheza.cz>; "Rui Barradas (ruipbarradas at sapo.pt)" <ruipbarradas at sapo.pt> Cc: Sent: Wednesday, October 24, 2012 11:41 AM Subject: RE: [r] How to pick colums from a ragged array? (And, considering? the real application, the functions ideally should probably output a variable INCLUDE, the same length as the original data, with TRUE and FALSE for whether or not that row should be included...) -----Original Message----- From: Leask Stuart Sent: 24 October 2012 16:25 To: arun (smartpink111 at yahoo.com); 'PIKAL Petr'; Rui Barradas (ruipbarradas at sapo.pt) Subject: RE: [r] How to pick colums from a ragged array? Arun, Petr, Rui, many thanks for your help, and the functions you have written. You'll recall I wanted to remove these first (or last) duplicates, because they represented instances where two different diagnoses (in this case, variable DG, value 1, 2, 3, 4 or 5) had been recorded on the same day - so I can't say which was 'first' (or 'last'). Your functions have revealed something I wasn't expecting: In some cases, the diagnoses recorded on the duplicated DATEs are the same! This is a surprise to me, but probably reflects someone going to two different departments in a clinic, and both departments submit data. I have to say that crazy things like this are often a feature of real data, which I'm sure you've come across yourselves. Of course, I don't want to remove records in which I can determine an unambiguous 'first diagnosis'. You have all put in so much effort on my behalf, I'm ashamed to ask, but I wonder if any of the functions you've written could do this with a little more Indexing and the 'duplicate' function So the function should only exclude an ID, having identified a first (or last) DATE duplicate, the DGs for these two dates are different. Test dataset: ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 ,547,794,814,814,814,814,814,814,841,841,841,841,841 ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 ,1019) DATE <- c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 ,20060111,20071119,20080107,20080407,20080521,20080521,20041005 ,20070905,20020814,20021125,20040429,20040429,20071205,20071205 ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 ,20070112,20070514, 19870508,20040205,20040205, 20080521,20080521 ,20091224,20050503,19870508,19870508,19880330) DG<- c(1,2,1,1,4,4,3,2,3,2,1,2,3,2,1,2,2,2,2,2,2,1,2,1,1,1,1,1,1,4,3,3,3,4,3,2,2,2,1,1) id.d<-data.frame(ID,DATE,DG) id.d # Considering Ruis? getRepeat function: g.r<-getRepeat(id.d)? ? # defaults to first = TRUE getRepeat(id.d, first = FALSE)? to get the last ones g.rr<-do.call(rbind, g.r) # put the data into a matrix # I can remove the date duplicates with: g.rr[rep(!duplicated(g.rr)[(1:(dim(g.rr)[1]/2))*2],each=2),] I'm not sure how to add this to your suggestions, Arun & Petr... Stuart -----Original Message----- From: PIKAL Petr [mailto:petr.pikal at precheza.cz] Sent: 23 October 2012 15:24 To: Stuart Leask Subject: RE: [r] How to pick colums from a ragged array? Hi I assumed that id.d is data frame id.d <- data.frame (ID,DATE ) and fff(id.d) works for me Petr> -----Original Message----- > From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk] > Sent: Tuesday, October 23, 2012 3:13 PM > To: PIKAL Petr > Subject: RE: [r] How to pick colums from a ragged array? > > Hi Petr. > I see what you mean it should do, but when I run it I get an error > (see below). > Stuart > > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > + ,547,794,814,814,814,814,814,814,841,841,841,841,841 > + ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > + ,1019) > > > > DATE <- > +? c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 > +? ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 > +? ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 > +? ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 > +? ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210 > +? ,20091224,20050503,19870508,19870508,19880330) > > > >? id.d <- cbind (ID,DATE ) > > fff<-function(data, first=TRUE, remove=FALSE) { > + > + testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > + x[nrow(x),2]==x[nrow(x)-1,2] > + > + if(first) sel <- as.numeric(names(which(unlist(sapply(split(data, > + data[,1]), testfirst))))) else sel <- > + as.numeric(names(which(unlist(sapply(split(data, data[,1]), > + testlast))))) > + > + if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,] > + } > > > > fff(id.d) > Error in x[1, 2] : incorrect number of dimensions > > > > -----Original Message----- > From: PIKAL Petr [mailto:petr.pikal at precheza.cz] > Sent: 23 October 2012 13:51 > To: Stuart Leask; r-help at r-project.org > Subject: RE: [r] How to pick colums from a ragged array? > > Hi > > > -----Original Message----- > > From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk] > > Sent: Tuesday, October 23, 2012 2:29 PM > > To: PIKAL Petr; r-help at r-project.org > > Subject: RE: [r] How to pick colums from a ragged array? > > > > Hi there. > > > > Not sure I follow what you are doing. > > > > I want a list of all the IDs that have duplicate DATE entries, only > > when the DATE is the earliest (or last) date for that ID. > > And that is what the function (with 3 small modifications) does > > > fff<-function(data, first=TRUE, remove=FALSE) { > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > x[nrow(x),2]==x[nrow(x)-1,2] > > if(first) sel <- as.numeric(names(which(unlist(sapply(split(data, > data[,1]), testfirst))))) else sel <- > as.numeric(names(which(unlist(sapply(split(data, data[,1]), > testlast))))) > > if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,] } > > See the result of your refined data > > fff(id.d) >? ? ? ID? ? ? DATE > 5? 167 2004-02-05 > 6? 167 2004-02-05 > 22? 841 2005-04-21 > 23? 841 2005-04-21 > 24? 841 2006-04-28 > 25? 841 2006-06-02 > 26? 841 2006-08-16 > 27? 841 2006-10-25 > 28? 841 2006-11-29 > 29? 841 2007-01-12 > 30? 841 2007-05-14 > 38 1019 1987-05-08 > 39 1019 1987-05-08 > 40 1019 1988-03-30 > > fff(id.d, first=F) >? ? ID? ? ? DATE > 5 167 2004-02-05 > 6 167 2004-02-05 > > fff(id.d, remove=T) >? ? ID? ? ? DATE > 1? 58 2006-08-21 > 2? 58 2006-12-07 > 3? 58 2008-01-02 > 4? 58 2009-09-04 > 7? 323 2005-11-11 > 8? 323 2006-01-11 > 9? 323 2007-11-19 > 10 323 2008-01-07 > 11 323 2008-04-07 > 12 323 2008-05-21 > 13 323 2008-07-11 > 14 547 2004-10-05 > 15 794 2007-09-05 > 16 814 2002-08-14 > 17 814 2002-11-25 > 18 814 2004-04-29 > 19 814 2004-04-29 > 20 814 2007-12-05 > 21 814 2008-02-27 > 31 910 1987-05-08 > 32 910 2004-02-05 > 33 910 2004-02-05 > 34 910 2009-11-20 > 35 910 2009-12-10 > 36 910 2009-12-24 > 37 999 2005-05-03 > > > > You can do surgery on fff function to see what result comes from some > piece of the function e.g. > > sapply(split(id.d, id.d[,1]), testlast) > > Regards > Petr > > > > > I have refined my test dataset, to include some tests (e.g. 910 has > > the same dup as 1019, but for 910 it's not the earliest date): > > > > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > > ,547,794,814,814,814,814,814,814,841,841,841,841,841 > > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > > ,1019) > > > > DATE <- > >? c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 > >? ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 > >? ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 > >? ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 > >? ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210 > >? ,20091224,20050503,19870508,19870508,19880330) > > > > Correct output: > > "167"? "841"? "1019" > > > > Stuart > > > > -----Original Message----- > > From: PIKAL Petr [mailto:petr.pikal at precheza.cz] > > Sent: 23 October 2012 13:15 > > To: Stuart Leask; r-help at r-project.org > > Subject: RE: [r] How to pick colums from a ragged array? > > > > Hi > > > > Rui's answer brought me to more elaborated solution which still > > needs data frame to be ordered by date > > > > fff<-function(data, first=TRUE, remove=FALSE) { > > > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > > x[length(x),2]==x[length(x)-1,2] > > > > if(first) sel <- as.numeric(names(which(sapply(split(data, > > data[,1]), > > testfirst)))) else sel <- as.numeric(names(which(sapply(split(data, > > data[,1]), testlast)))) > > > > if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] } > > > > > > > fff(id.d) > >? ? ID? ? DATE > > 31 910 20091105 > > 32 910 20091105 > > 33 910 20091117 > > 34 910 20091119 > > 35 910 20091120 > > 36 910 20091210 > > 37 910 20091224 > > 38 910 20091224 > > > > > fff(id.d, remove=T) > >? ? ? ID? ? DATE > > 1? ? 58 20060821 > > 2? ? 58 20061207 > > 3? ? 58 20080102 > > 4? ? 58 20090904 > > 5? 167 20040205 > > 6? 167 20040323 > > 7? 323 20051111 > > 8? 323 20060111 > > 9? 323 20071119 > > 10? 323 20080107 > > 11? 323 20080407 > > 12? 323 20080521 > > 13? 323 20080711 > > 14? 547 20041005 > > 15? 794 20070905 > > 16? 814 20020814 > > 17? 814 20021125 > > 18? 814 20040429 > > 19? 814 20040429 > > 20? 814 20071205 > > 21? 814 20080227 > > 22? 841 20050421 > > 23? 841 20060130 > > 24? 841 20060428 > > 25? 841 20060602 > > 26? 841 20060816 > > 27? 841 20061025 > > 28? 841 20061129 > > 29? 841 20070112 > > 30? 841 20070514 > > 39? 999 20050503 > > 40 1019 19870508 > > 41 1019 19880223 > > 42 1019 19880330 > > 43 1019 19880330 > > > > > > > Regards > > Petr > > > > > > > -----Original Message----- > > > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > > > project.org] On Behalf Of PIKAL Petr > > > Sent: Tuesday, October 23, 2012 1:49 PM > > > To: Stuart Leask; r-help at r-project.org > > > Subject: Re: [R] [r] How to pick colums from a ragged array? > > > > > > Hi > > > > > > I did not check your code and rather followed your explanation. > BTW, > > > thanks for test data. > > > > > > small change in data frame to make DATE as Date class > > > > > > datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <- > > > data.frame(ID,datum ) > > > > > > ordering by date > > > > > > id.d<-id.d[order(id.d$datum),] > > > > > > > > > two functions to test if first two dates are the same or last two > > > dates are the same > > > > > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > > > x[length(x),2]==x[length(x)-1,2] > > > > > > change one last date in the data frame to be the same as previous > > > > > > id.d[35,2]<-id.d[36,2] > > > > > > and here are results > > > > > > sapply(split(id.d, id.d$ID), testlast) > > >? ? 58? 167? 323? 547? 794? 814? 841? 910? 999? 1019 > > > FALSE FALSE FALSE? ? NA? ? NA FALSE FALSE? TRUE? ? NA FALSE > > > > > > > sapply(split(id.d, id.d$ID), testfirst) > > >? ? 58? 167? 323? 547? 794? 814? 841? 910? 999? 1019 > > > FALSE FALSE FALSE? ? NA? ? NA FALSE FALSE FALSE? ? NA FALSE > > > > > > Now you can select ID which is true and remove it from your data > > > which(sapply(split(id.d, id.d$ID), testlast)) > > > > > > and use it for your data frame to subset/remove id.d$ID => > > as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast)))) > [1] > > > FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > > > FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > > FALSE > > > FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > > FALSE > > > FALSE TRUE? TRUE [37]? TRUE? TRUE? TRUE? TRUE > > > > > > However I am not sure if this is exactly what you want. > > > > > > Regards > > > Petr > > > > > > > -----Original Message----- > > > > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > > > > project.org] On Behalf Of Stuart Leask > > > > Sent: Tuesday, October 23, 2012 11:38 AM > > > > To: r-help at r-project.org > > > > Subject: [R] [r] How to pick colums from a ragged array? > > > > > > > > I have a large dataset (~1 million rows) of three variables: ID > > > > (patient's name), DATE (of appointment) and DIAGNOSIS (given on > > that > > > > date). > > > > Patients may have been assigned more than one diagnosis at any > one > > > > appointment - leading to two rows, same ID and DATE but > > > > different DIAGNOSIS. > > > > The diagnoses may change between appointments. > > > > > > > > I want to subset the data in two ways: > > > > > > > > -? ? ? ? ? define groups of patients by the first diagnosis given > > > > > > > > -? ? ? ? ? define groups of patients by the last diagnosis given. > > > > > > > > The problem: > > > > Unfortunately, a small number of patients have been given more > > > > than one diagnosis at their first (or last) appointment. These > > > > individuals I need to identify and remove, as it's not possible > to > > > > say uniquely what their first (or last) diagnosis was. So I need > > > > to identify and remove these individuals which have pairs of > > > > rows with the same ID > > > and > > > > (lowest or highest) DATE. The size of the dataset precludes the > > > option > > > > of doing this by eye. > > > > > > > > I suspect there is a very elegant way of doing this in R. > > > > > > > > This is what I've come up with: > > > > > > > > > > > > -? ? ? ? ? Sort by DATE then ID > > > > > > > > -? ? ? ? ? Make a ragged array of DATE by ID > > > > > > > > -? ? ? ? ? Remove IDs that only occur once. > > > > > > > > -? ? ? ? ? Subtract the first and second DATEs. Remove IDs for > > which > > > > this = zero, as this will only be true for IDs for which the > > > > appointment is recorded twice (because there were two diagnoses > > > > recorded on this date). > > > > > > > > -? ? ? ? ? (Then do the same to get the 'last appointment' > > > duplicates, > > > > by reversing the initial sort by DATE.) > > > > > > > > I am stuck at the 'Subtract dates' step: I would like to get the > > > > data out of the ragged array by columns (so e.g. I end up with a > > > > matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out > > > > by column from the ragged array. > > > > > > > > I hope someone can help. My ugly code is below, with some data > for > > > > testing. > > > > > > > > > > > > Stuart > > > > > > > > > > > > Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior > > > > Lecturer and Honorary Consultant Pychiatrist Institute of Mental > > > > Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. > > UK > > > > Tel. +44 > > > > 115 82 30419 > > > > > stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk > > > > > > > > > Google 'Dr Stuart Leask' > > > > > > > > > > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > > > > ,547,794,814,814,814,814,814,814,841,841,841,841,841 > > > > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > > > > ,1019) > > > > > > > > DATE <- > > > > c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 > > > > ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 > > > > ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 > > > > ,20050421,20060130,20060428,20060602,20060816,20061025,20061129 > > > > ,20070112,20070514,20091105,20091117,20091119,20091120,20091210 > > > > ,20091224,20050503,19870508,19880223,19880330) > > > > > > > > id.d <- cbind (ID,DATE ) > > > > rag.a? <-? split ( id.d [ ,2 ], id.d [ ,1])? ? ? ? ? ? ? # > create > > > > ragged array, 1-n DATES for every NAME > > > > > > > > # Inelegant attempt to remove IDs that only have one entry: > > > > > > > > rag.s <-tapply? (id.d [ ,2], id.d [ ,1], sum)? ? ? ? ? ? #add up > > the > > > > dates per row > > > > # Since DATE is in 'year mo da', if there's only one date, sum > > > > will > > > be > > > > less than 2100000: > > > > rag.t <- rag.s [ rag.s > 21000000 ] > > > > multi.dates <- rownames ( rag.t )? ? ? ? ? ? ? ? ? ? ? ? # all > the > > > IDs > > > > with >1 date > > > > rag.am <- rag.a [ multi.dates ]? ? ? ? ? ? ? ? ? ? ? ? ? # > rag.am > > > only > > > > has IDs with > 1 Date > > > > > > > > > > > > # But now I'm stuck. > > > > # Each row of the array is rag.am$ID. > > > > # So I can't pick columns of DATEs from the ragged array. > > > > > > > > This message and any attachment are intended solely for the > > > > addressee and may contain confidential information. If you have > > > > received this message in error, please send it back to me, and > > > > immediately delete > > > it. > > > > Please do not use, copy or disclose the information contained in > > > > this message or in any attachment.? Any views or opinions > > > > expressed by the author of this email do not necessarily reflect > > > > the views of the University of Nottingham. > > > > > > > > This message has been checked for viruses but the contents of an > > > > attachment may still contain software viruses which could damage > > > > your computer system: > > > > you are advised to perform your own checks. Email communications > > > > with the University of Nottingham may be monitored as permitted > by > > > > UK legislation. > > > >? ? ? ? [[alternative HTML version deleted]] > > > > > > > > ______________________________________________ > > > > R-help at r-project.org mailing list > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > PLEASE do read the posting guide http://www.R- > project.org/posting- > > > > guide.html and provide commented, minimal, self-contained, > > > > reproducible code. > > > > > > ______________________________________________ > > > R-help at r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide http://www.R-project.org/posting- > > > guide.html and provide commented, minimal, self-contained, > > > reproducible code.
Hi, With the dataset that you provided, and considering that my solution works (ID? 323 included), the below code outputs a variable INCLUDE. res1<- data.frame(flag=tapply(id.d[,2],id.d[,1],FUN=function(x) head(duplicated(x)|duplicated(x,fromLast=TRUE),1)|tail(duplicated(x)|duplicated(x,fromLast=TRUE),1))) res2<-id.d[id.d[,1]%in%names(res1[res1$flag==TRUE,])&(duplicated(id.d[,1:2])|duplicated(id.d[,1:2],fromLast=TRUE)),] res3<-res2[!res2$ID%in% res2[duplicated(res2)|duplicated(res2,fromLast=TRUE),]$ID,] id.d1<-id.d bad<-id.d1[id.d1$ID%in%res3$ID,] id.d1$id.d_INCLUDE<-TRUE bad$INCLUDE<-FALSE res4<-merge(id.d1,bad,all=TRUE) res4$INCLUDE[is.na(res4$INCLUDE)]<-TRUE res5<-res4[,-4] ?head(res5) #?? ID???? DATE DG INCLUDE #1? 58 20060821? 1??? TRUE #2? 58 20061207? 2??? TRUE #3? 58 20080102? 1??? TRUE #4? 58 20090904? 1??? TRUE #5 167 20040205? 4??? TRUE #6 167 20040205? 4??? TRUE ?tail(res5) #???? ID???? DATE DG INCLUDE #35? 910 20080521? 4??? TRUE #36? 910 20091224? 2??? TRUE #37? 999 20050503? 2??? TRUE #38 1019 19870508? 1?? FALSE #39 1019 19870508? 2?? FALSE #40 1019 19880330? 1?? FALSE A.K. ----- Original Message ----- From: Stuart Leask <Stuart.Leask at nottingham.ac.uk> To: "arun (smartpink111 at yahoo.com)" <smartpink111 at yahoo.com>; PIKAL Petr <petr.pikal at precheza.cz>; "Rui Barradas (ruipbarradas at sapo.pt)" <ruipbarradas at sapo.pt> Cc: Sent: Wednesday, October 24, 2012 11:41 AM Subject: RE: [r] How to pick colums from a ragged array? (And, considering? the real application, the functions ideally should probably output a variable INCLUDE, the same length as the original data, with TRUE and FALSE for whether or not that row should be included...) -----Original Message----- From: Leask Stuart Sent: 24 October 2012 16:25 To: arun (smartpink111 at yahoo.com); 'PIKAL Petr'; Rui Barradas (ruipbarradas at sapo.pt) Subject: RE: [r] How to pick colums from a ragged array? Arun, Petr, Rui, many thanks for your help, and the functions you have written. You'll recall I wanted to remove these first (or last) duplicates, because they represented instances where two different diagnoses (in this case, variable DG, value 1, 2, 3, 4 or 5) had been recorded on the same day - so I can't say which was 'first' (or 'last'). Your functions have revealed something I wasn't expecting: In some cases, the diagnoses recorded on the duplicated DATEs are the same! This is a surprise to me, but probably reflects someone going to two different departments in a clinic, and both departments submit data. I have to say that crazy things like this are often a feature of real data, which I'm sure you've come across yourselves. Of course, I don't want to remove records in which I can determine an unambiguous 'first diagnosis'. You have all put in so much effort on my behalf, I'm ashamed to ask, but I wonder if any of the functions you've written could do this with a little more Indexing and the 'duplicate' function So the function should only exclude an ID, having identified a first (or last) DATE duplicate, the DGs for these two dates are different. Test dataset: ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 ,547,794,814,814,814,814,814,814,841,841,841,841,841 ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 ,1019) DATE <- c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 ,20060111,20071119,20080107,20080407,20080521,20080521,20041005 ,20070905,20020814,20021125,20040429,20040429,20071205,20071205 ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 ,20070112,20070514, 19870508,20040205,20040205, 20080521,20080521 ,20091224,20050503,19870508,19870508,19880330) DG<- c(1,2,1,1,4,4,3,2,3,2,1,2,3,2,1,2,2,2,2,2,2,1,2,1,1,1,1,1,1,4,3,3,3,4,3,2,2,2,1,1) id.d<-data.frame(ID,DATE,DG) id.d # Considering Ruis? getRepeat function: g.r<-getRepeat(id.d)? ? # defaults to first = TRUE getRepeat(id.d, first = FALSE)? to get the last ones g.rr<-do.call(rbind, g.r) # put the data into a matrix # I can remove the date duplicates with: g.rr[rep(!duplicated(g.rr)[(1:(dim(g.rr)[1]/2))*2],each=2),] I'm not sure how to add this to your suggestions, Arun & Petr... Stuart -----Original Message----- From: PIKAL Petr [mailto:petr.pikal at precheza.cz] Sent: 23 October 2012 15:24 To: Stuart Leask Subject: RE: [r] How to pick colums from a ragged array? Hi I assumed that id.d is data frame id.d <- data.frame (ID,DATE ) and fff(id.d) works for me Petr> -----Original Message----- > From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk] > Sent: Tuesday, October 23, 2012 3:13 PM > To: PIKAL Petr > Subject: RE: [r] How to pick colums from a ragged array? > > Hi Petr. > I see what you mean it should do, but when I run it I get an error > (see below). > Stuart > > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > + ,547,794,814,814,814,814,814,814,841,841,841,841,841 > + ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > + ,1019) > > > > DATE <- > +? c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 > +? ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 > +? ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 > +? ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 > +? ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210 > +? ,20091224,20050503,19870508,19870508,19880330) > > > >? id.d <- cbind (ID,DATE ) > > fff<-function(data, first=TRUE, remove=FALSE) { > + > + testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > + x[nrow(x),2]==x[nrow(x)-1,2] > + > + if(first) sel <- as.numeric(names(which(unlist(sapply(split(data, > + data[,1]), testfirst))))) else sel <- > + as.numeric(names(which(unlist(sapply(split(data, data[,1]), > + testlast))))) > + > + if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,] > + } > > > > fff(id.d) > Error in x[1, 2] : incorrect number of dimensions > > > > -----Original Message----- > From: PIKAL Petr [mailto:petr.pikal at precheza.cz] > Sent: 23 October 2012 13:51 > To: Stuart Leask; r-help at r-project.org > Subject: RE: [r] How to pick colums from a ragged array? > > Hi > > > -----Original Message----- > > From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk] > > Sent: Tuesday, October 23, 2012 2:29 PM > > To: PIKAL Petr; r-help at r-project.org > > Subject: RE: [r] How to pick colums from a ragged array? > > > > Hi there. > > > > Not sure I follow what you are doing. > > > > I want a list of all the IDs that have duplicate DATE entries, only > > when the DATE is the earliest (or last) date for that ID. > > And that is what the function (with 3 small modifications) does > > > fff<-function(data, first=TRUE, remove=FALSE) { > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > x[nrow(x),2]==x[nrow(x)-1,2] > > if(first) sel <- as.numeric(names(which(unlist(sapply(split(data, > data[,1]), testfirst))))) else sel <- > as.numeric(names(which(unlist(sapply(split(data, data[,1]), > testlast))))) > > if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,] } > > See the result of your refined data > > fff(id.d) >? ? ? ID? ? ? DATE > 5? 167 2004-02-05 > 6? 167 2004-02-05 > 22? 841 2005-04-21 > 23? 841 2005-04-21 > 24? 841 2006-04-28 > 25? 841 2006-06-02 > 26? 841 2006-08-16 > 27? 841 2006-10-25 > 28? 841 2006-11-29 > 29? 841 2007-01-12 > 30? 841 2007-05-14 > 38 1019 1987-05-08 > 39 1019 1987-05-08 > 40 1019 1988-03-30 > > fff(id.d, first=F) >? ? ID? ? ? DATE > 5 167 2004-02-05 > 6 167 2004-02-05 > > fff(id.d, remove=T) >? ? ID? ? ? DATE > 1? 58 2006-08-21 > 2? 58 2006-12-07 > 3? 58 2008-01-02 > 4? 58 2009-09-04 > 7? 323 2005-11-11 > 8? 323 2006-01-11 > 9? 323 2007-11-19 > 10 323 2008-01-07 > 11 323 2008-04-07 > 12 323 2008-05-21 > 13 323 2008-07-11 > 14 547 2004-10-05 > 15 794 2007-09-05 > 16 814 2002-08-14 > 17 814 2002-11-25 > 18 814 2004-04-29 > 19 814 2004-04-29 > 20 814 2007-12-05 > 21 814 2008-02-27 > 31 910 1987-05-08 > 32 910 2004-02-05 > 33 910 2004-02-05 > 34 910 2009-11-20 > 35 910 2009-12-10 > 36 910 2009-12-24 > 37 999 2005-05-03 > > > > You can do surgery on fff function to see what result comes from some > piece of the function e.g. > > sapply(split(id.d, id.d[,1]), testlast) > > Regards > Petr > > > > > I have refined my test dataset, to include some tests (e.g. 910 has > > the same dup as 1019, but for 910 it's not the earliest date): > > > > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > > ,547,794,814,814,814,814,814,814,841,841,841,841,841 > > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > > ,1019) > > > > DATE <- > >? c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 > >? ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 > >? ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 > >? ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 > >? ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210 > >? ,20091224,20050503,19870508,19870508,19880330) > > > > Correct output: > > "167"? "841"? "1019" > > > > Stuart > > > > -----Original Message----- > > From: PIKAL Petr [mailto:petr.pikal at precheza.cz] > > Sent: 23 October 2012 13:15 > > To: Stuart Leask; r-help at r-project.org > > Subject: RE: [r] How to pick colums from a ragged array? > > > > Hi > > > > Rui's answer brought me to more elaborated solution which still > > needs data frame to be ordered by date > > > > fff<-function(data, first=TRUE, remove=FALSE) { > > > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > > x[length(x),2]==x[length(x)-1,2] > > > > if(first) sel <- as.numeric(names(which(sapply(split(data, > > data[,1]), > > testfirst)))) else sel <- as.numeric(names(which(sapply(split(data, > > data[,1]), testlast)))) > > > > if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] } > > > > > > > fff(id.d) > >? ? ID? ? DATE > > 31 910 20091105 > > 32 910 20091105 > > 33 910 20091117 > > 34 910 20091119 > > 35 910 20091120 > > 36 910 20091210 > > 37 910 20091224 > > 38 910 20091224 > > > > > fff(id.d, remove=T) > >? ? ? ID? ? DATE > > 1? ? 58 20060821 > > 2? ? 58 20061207 > > 3? ? 58 20080102 > > 4? ? 58 20090904 > > 5? 167 20040205 > > 6? 167 20040323 > > 7? 323 20051111 > > 8? 323 20060111 > > 9? 323 20071119 > > 10? 323 20080107 > > 11? 323 20080407 > > 12? 323 20080521 > > 13? 323 20080711 > > 14? 547 20041005 > > 15? 794 20070905 > > 16? 814 20020814 > > 17? 814 20021125 > > 18? 814 20040429 > > 19? 814 20040429 > > 20? 814 20071205 > > 21? 814 20080227 > > 22? 841 20050421 > > 23? 841 20060130 > > 24? 841 20060428 > > 25? 841 20060602 > > 26? 841 20060816 > > 27? 841 20061025 > > 28? 841 20061129 > > 29? 841 20070112 > > 30? 841 20070514 > > 39? 999 20050503 > > 40 1019 19870508 > > 41 1019 19880223 > > 42 1019 19880330 > > 43 1019 19880330 > > > > > > > Regards > > Petr > > > > > > > -----Original Message----- > > > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > > > project.org] On Behalf Of PIKAL Petr > > > Sent: Tuesday, October 23, 2012 1:49 PM > > > To: Stuart Leask; r-help at r-project.org > > > Subject: Re: [R] [r] How to pick colums from a ragged array? > > > > > > Hi > > > > > > I did not check your code and rather followed your explanation. > BTW, > > > thanks for test data. > > > > > > small change in data frame to make DATE as Date class > > > > > > datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <- > > > data.frame(ID,datum ) > > > > > > ordering by date > > > > > > id.d<-id.d[order(id.d$datum),] > > > > > > > > > two functions to test if first two dates are the same or last two > > > dates are the same > > > > > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > > > x[length(x),2]==x[length(x)-1,2] > > > > > > change one last date in the data frame to be the same as previous > > > > > > id.d[35,2]<-id.d[36,2] > > > > > > and here are results > > > > > > sapply(split(id.d, id.d$ID), testlast) > > >? ? 58? 167? 323? 547? 794? 814? 841? 910? 999? 1019 > > > FALSE FALSE FALSE? ? NA? ? NA FALSE FALSE? TRUE? ? NA FALSE > > > > > > > sapply(split(id.d, id.d$ID), testfirst) > > >? ? 58? 167? 323? 547? 794? 814? 841? 910? 999? 1019 > > > FALSE FALSE FALSE? ? NA? ? NA FALSE FALSE FALSE? ? NA FALSE > > > > > > Now you can select ID which is true and remove it from your data > > > which(sapply(split(id.d, id.d$ID), testlast)) > > > > > > and use it for your data frame to subset/remove id.d$ID => > > as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast)))) > [1] > > > FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > > > FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > > FALSE > > > FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > > FALSE > > > FALSE TRUE? TRUE [37]? TRUE? TRUE? TRUE? TRUE > > > > > > However I am not sure if this is exactly what you want. > > > > > > Regards > > > Petr > > > > > > > -----Original Message----- > > > > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > > > > project.org] On Behalf Of Stuart Leask > > > > Sent: Tuesday, October 23, 2012 11:38 AM > > > > To: r-help at r-project.org > > > > Subject: [R] [r] How to pick colums from a ragged array? > > > > > > > > I have a large dataset (~1 million rows) of three variables: ID > > > > (patient's name), DATE (of appointment) and DIAGNOSIS (given on > > that > > > > date). > > > > Patients may have been assigned more than one diagnosis at any > one > > > > appointment - leading to two rows, same ID and DATE but > > > > different DIAGNOSIS. > > > > The diagnoses may change between appointments. > > > > > > > > I want to subset the data in two ways: > > > > > > > > -? ? ? ? ? define groups of patients by the first diagnosis given > > > > > > > > -? ? ? ? ? define groups of patients by the last diagnosis given. > > > > > > > > The problem: > > > > Unfortunately, a small number of patients have been given more > > > > than one diagnosis at their first (or last) appointment. These > > > > individuals I need to identify and remove, as it's not possible > to > > > > say uniquely what their first (or last) diagnosis was. So I need > > > > to identify and remove these individuals which have pairs of > > > > rows with the same ID > > > and > > > > (lowest or highest) DATE. The size of the dataset precludes the > > > option > > > > of doing this by eye. > > > > > > > > I suspect there is a very elegant way of doing this in R. > > > > > > > > This is what I've come up with: > > > > > > > > > > > > -? ? ? ? ? Sort by DATE then ID > > > > > > > > -? ? ? ? ? Make a ragged array of DATE by ID > > > > > > > > -? ? ? ? ? Remove IDs that only occur once. > > > > > > > > -? ? ? ? ? Subtract the first and second DATEs. Remove IDs for > > which > > > > this = zero, as this will only be true for IDs for which the > > > > appointment is recorded twice (because there were two diagnoses > > > > recorded on this date). > > > > > > > > -? ? ? ? ? (Then do the same to get the 'last appointment' > > > duplicates, > > > > by reversing the initial sort by DATE.) > > > > > > > > I am stuck at the 'Subtract dates' step: I would like to get the > > > > data out of the ragged array by columns (so e.g. I end up with a > > > > matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out > > > > by column from the ragged array. > > > > > > > > I hope someone can help. My ugly code is below, with some data > for > > > > testing. > > > > > > > > > > > > Stuart > > > > > > > > > > > > Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior > > > > Lecturer and Honorary Consultant Pychiatrist Institute of Mental > > > > Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. > > UK > > > > Tel. +44 > > > > 115 82 30419 > > > > > stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk > > > > > > > > > Google 'Dr Stuart Leask' > > > > > > > > > > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > > > > ,547,794,814,814,814,814,814,814,841,841,841,841,841 > > > > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > > > > ,1019) > > > > > > > > DATE <- > > > > c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 > > > > ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 > > > > ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 > > > > ,20050421,20060130,20060428,20060602,20060816,20061025,20061129 > > > > ,20070112,20070514,20091105,20091117,20091119,20091120,20091210 > > > > ,20091224,20050503,19870508,19880223,19880330) > > > > > > > > id.d <- cbind (ID,DATE ) > > > > rag.a? <-? split ( id.d [ ,2 ], id.d [ ,1])? ? ? ? ? ? ? # > create > > > > ragged array, 1-n DATES for every NAME > > > > > > > > # Inelegant attempt to remove IDs that only have one entry: > > > > > > > > rag.s <-tapply? (id.d [ ,2], id.d [ ,1], sum)? ? ? ? ? ? #add up > > the > > > > dates per row > > > > # Since DATE is in 'year mo da', if there's only one date, sum > > > > will > > > be > > > > less than 2100000: > > > > rag.t <- rag.s [ rag.s > 21000000 ] > > > > multi.dates <- rownames ( rag.t )? ? ? ? ? ? ? ? ? ? ? ? # all > the > > > IDs > > > > with >1 date > > > > rag.am <- rag.a [ multi.dates ]? ? ? ? ? ? ? ? ? ? ? ? ? # > rag.am > > > only > > > > has IDs with > 1 Date > > > > > > > > > > > > # But now I'm stuck. > > > > # Each row of the array is rag.am$ID. > > > > # So I can't pick columns of DATEs from the ragged array. > > > > > > > > This message and any attachment are intended solely for the > > > > addressee and may contain confidential information. If you have > > > > received this message in error, please send it back to me, and > > > > immediately delete > > > it. > > > > Please do not use, copy or disclose the information contained in > > > > this message or in any attachment.? Any views or opinions > > > > expressed by the author of this email do not necessarily reflect > > > > the views of the University of Nottingham. > > > > > > > > This message has been checked for viruses but the contents of an > > > > attachment may still contain software viruses which could damage > > > > your computer system: > > > > you are advised to perform your own checks. Email communications > > > > with the University of Nottingham may be monitored as permitted > by > > > > UK legislation. > > > >? ? ? ? [[alternative HTML version deleted]] > > > > > > > > ______________________________________________ > > > > R-help at r-project.org mailing list > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > PLEASE do read the posting guide http://www.R- > project.org/posting- > > > > guide.html and provide commented, minimal, self-contained, > > > > reproducible code. > > > > > > ______________________________________________ > > > R-help at r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide http://www.R-project.org/posting- > > > guide.html and provide commented, minimal, self-contained, > > > reproducible code.
HI Stuart, Just a small comment: id.d1$id.d_INCLUDE<-TRUE #is not needed. res1<- data.frame(flag=tapply(id.d[,2],id.d[,1],FUN=function(x) head(duplicated(x)|duplicated(x,fromLast=TRUE),1)|tail(duplicated(x)|duplicated(x,fromLast=TRUE),1))) res2<-id.d[id.d[,1]%in%names(res1[res1$flag==TRUE,])&(duplicated(id.d[,1:2])|duplicated(id.d[,1:2],fromLast=TRUE)),] res3<-res2[!res2$ID%in% res2[duplicated(res2)|duplicated(res2,fromLast=TRUE),]$ID,] id.d1<-id.d bad<-id.d1[id.d1$ID%in%res3$ID,] bad$INCLUDE<-FALSE res4<-merge(id.d1,bad,all=TRUE) res4$INCLUDE[is.na(res4$INCLUDE)]<-TRUE tail(res4) #?? ID???? DATE DG INCLUDE #35? 910 20080521? 4??? TRUE #36? 910 20091224? 2??? TRUE #37? 999 20050503? 2??? TRUE #38 1019 19870508? 1?? FALSE #39 1019 19870508? 2?? FALSE #40 1019 19880330? 1?? FALSE A.K. ----- Original Message ----- From: Stuart Leask <Stuart.Leask at nottingham.ac.uk> To: "arun (smartpink111 at yahoo.com)" <smartpink111 at yahoo.com>; PIKAL Petr <petr.pikal at precheza.cz>; "Rui Barradas (ruipbarradas at sapo.pt)" <ruipbarradas at sapo.pt> Cc: Sent: Wednesday, October 24, 2012 11:41 AM Subject: RE: [r] How to pick colums from a ragged array? (And, considering? the real application, the functions ideally should probably output a variable INCLUDE, the same length as the original data, with TRUE and FALSE for whether or not that row should be included...) -----Original Message----- From: Leask Stuart Sent: 24 October 2012 16:25 To: arun (smartpink111 at yahoo.com); 'PIKAL Petr'; Rui Barradas (ruipbarradas at sapo.pt) Subject: RE: [r] How to pick colums from a ragged array? Arun, Petr, Rui, many thanks for your help, and the functions you have written. You'll recall I wanted to remove these first (or last) duplicates, because they represented instances where two different diagnoses (in this case, variable DG, value 1, 2, 3, 4 or 5) had been recorded on the same day - so I can't say which was 'first' (or 'last'). Your functions have revealed something I wasn't expecting: In some cases, the diagnoses recorded on the duplicated DATEs are the same! This is a surprise to me, but probably reflects someone going to two different departments in a clinic, and both departments submit data. I have to say that crazy things like this are often a feature of real data, which I'm sure you've come across yourselves. Of course, I don't want to remove records in which I can determine an unambiguous 'first diagnosis'. You have all put in so much effort on my behalf, I'm ashamed to ask, but I wonder if any of the functions you've written could do this with a little more Indexing and the 'duplicate' function So the function should only exclude an ID, having identified a first (or last) DATE duplicate, the DGs for these two dates are different. Test dataset: ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 ,547,794,814,814,814,814,814,814,841,841,841,841,841 ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 ,1019) DATE <- c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 ,20060111,20071119,20080107,20080407,20080521,20080521,20041005 ,20070905,20020814,20021125,20040429,20040429,20071205,20071205 ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 ,20070112,20070514, 19870508,20040205,20040205, 20080521,20080521 ,20091224,20050503,19870508,19870508,19880330) DG<- c(1,2,1,1,4,4,3,2,3,2,1,2,3,2,1,2,2,2,2,2,2,1,2,1,1,1,1,1,1,4,3,3,3,4,3,2,2,2,1,1) id.d<-data.frame(ID,DATE,DG) id.d # Considering Ruis? getRepeat function: g.r<-getRepeat(id.d)? ? # defaults to first = TRUE getRepeat(id.d, first = FALSE)? to get the last ones g.rr<-do.call(rbind, g.r) # put the data into a matrix # I can remove the date duplicates with: g.rr[rep(!duplicated(g.rr)[(1:(dim(g.rr)[1]/2))*2],each=2),] I'm not sure how to add this to your suggestions, Arun & Petr... Stuart -----Original Message----- From: PIKAL Petr [mailto:petr.pikal at precheza.cz] Sent: 23 October 2012 15:24 To: Stuart Leask Subject: RE: [r] How to pick colums from a ragged array? Hi I assumed that id.d is data frame id.d <- data.frame (ID,DATE ) and fff(id.d) works for me Petr> -----Original Message----- > From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk] > Sent: Tuesday, October 23, 2012 3:13 PM > To: PIKAL Petr > Subject: RE: [r] How to pick colums from a ragged array? > > Hi Petr. > I see what you mean it should do, but when I run it I get an error > (see below). > Stuart > > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > + ,547,794,814,814,814,814,814,814,841,841,841,841,841 > + ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > + ,1019) > > > > DATE <- > +? c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 > +? ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 > +? ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 > +? ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 > +? ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210 > +? ,20091224,20050503,19870508,19870508,19880330) > > > >? id.d <- cbind (ID,DATE ) > > fff<-function(data, first=TRUE, remove=FALSE) { > + > + testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > + x[nrow(x),2]==x[nrow(x)-1,2] > + > + if(first) sel <- as.numeric(names(which(unlist(sapply(split(data, > + data[,1]), testfirst))))) else sel <- > + as.numeric(names(which(unlist(sapply(split(data, data[,1]), > + testlast))))) > + > + if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,] > + } > > > > fff(id.d) > Error in x[1, 2] : incorrect number of dimensions > > > > -----Original Message----- > From: PIKAL Petr [mailto:petr.pikal at precheza.cz] > Sent: 23 October 2012 13:51 > To: Stuart Leask; r-help at r-project.org > Subject: RE: [r] How to pick colums from a ragged array? > > Hi > > > -----Original Message----- > > From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk] > > Sent: Tuesday, October 23, 2012 2:29 PM > > To: PIKAL Petr; r-help at r-project.org > > Subject: RE: [r] How to pick colums from a ragged array? > > > > Hi there. > > > > Not sure I follow what you are doing. > > > > I want a list of all the IDs that have duplicate DATE entries, only > > when the DATE is the earliest (or last) date for that ID. > > And that is what the function (with 3 small modifications) does > > > fff<-function(data, first=TRUE, remove=FALSE) { > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > x[nrow(x),2]==x[nrow(x)-1,2] > > if(first) sel <- as.numeric(names(which(unlist(sapply(split(data, > data[,1]), testfirst))))) else sel <- > as.numeric(names(which(unlist(sapply(split(data, data[,1]), > testlast))))) > > if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,] } > > See the result of your refined data > > fff(id.d) >? ? ? ID? ? ? DATE > 5? 167 2004-02-05 > 6? 167 2004-02-05 > 22? 841 2005-04-21 > 23? 841 2005-04-21 > 24? 841 2006-04-28 > 25? 841 2006-06-02 > 26? 841 2006-08-16 > 27? 841 2006-10-25 > 28? 841 2006-11-29 > 29? 841 2007-01-12 > 30? 841 2007-05-14 > 38 1019 1987-05-08 > 39 1019 1987-05-08 > 40 1019 1988-03-30 > > fff(id.d, first=F) >? ? ID? ? ? DATE > 5 167 2004-02-05 > 6 167 2004-02-05 > > fff(id.d, remove=T) >? ? ID? ? ? DATE > 1? 58 2006-08-21 > 2? 58 2006-12-07 > 3? 58 2008-01-02 > 4? 58 2009-09-04 > 7? 323 2005-11-11 > 8? 323 2006-01-11 > 9? 323 2007-11-19 > 10 323 2008-01-07 > 11 323 2008-04-07 > 12 323 2008-05-21 > 13 323 2008-07-11 > 14 547 2004-10-05 > 15 794 2007-09-05 > 16 814 2002-08-14 > 17 814 2002-11-25 > 18 814 2004-04-29 > 19 814 2004-04-29 > 20 814 2007-12-05 > 21 814 2008-02-27 > 31 910 1987-05-08 > 32 910 2004-02-05 > 33 910 2004-02-05 > 34 910 2009-11-20 > 35 910 2009-12-10 > 36 910 2009-12-24 > 37 999 2005-05-03 > > > > You can do surgery on fff function to see what result comes from some > piece of the function e.g. > > sapply(split(id.d, id.d[,1]), testlast) > > Regards > Petr > > > > > I have refined my test dataset, to include some tests (e.g. 910 has > > the same dup as 1019, but for 910 it's not the earliest date): > > > > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > > ,547,794,814,814,814,814,814,814,841,841,841,841,841 > > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > > ,1019) > > > > DATE <- > >? c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 > >? ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 > >? ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 > >? ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 > >? ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210 > >? ,20091224,20050503,19870508,19870508,19880330) > > > > Correct output: > > "167"? "841"? "1019" > > > > Stuart > > > > -----Original Message----- > > From: PIKAL Petr [mailto:petr.pikal at precheza.cz] > > Sent: 23 October 2012 13:15 > > To: Stuart Leask; r-help at r-project.org > > Subject: RE: [r] How to pick colums from a ragged array? > > > > Hi > > > > Rui's answer brought me to more elaborated solution which still > > needs data frame to be ordered by date > > > > fff<-function(data, first=TRUE, remove=FALSE) { > > > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > > x[length(x),2]==x[length(x)-1,2] > > > > if(first) sel <- as.numeric(names(which(sapply(split(data, > > data[,1]), > > testfirst)))) else sel <- as.numeric(names(which(sapply(split(data, > > data[,1]), testlast)))) > > > > if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] } > > > > > > > fff(id.d) > >? ? ID? ? DATE > > 31 910 20091105 > > 32 910 20091105 > > 33 910 20091117 > > 34 910 20091119 > > 35 910 20091120 > > 36 910 20091210 > > 37 910 20091224 > > 38 910 20091224 > > > > > fff(id.d, remove=T) > >? ? ? ID? ? DATE > > 1? ? 58 20060821 > > 2? ? 58 20061207 > > 3? ? 58 20080102 > > 4? ? 58 20090904 > > 5? 167 20040205 > > 6? 167 20040323 > > 7? 323 20051111 > > 8? 323 20060111 > > 9? 323 20071119 > > 10? 323 20080107 > > 11? 323 20080407 > > 12? 323 20080521 > > 13? 323 20080711 > > 14? 547 20041005 > > 15? 794 20070905 > > 16? 814 20020814 > > 17? 814 20021125 > > 18? 814 20040429 > > 19? 814 20040429 > > 20? 814 20071205 > > 21? 814 20080227 > > 22? 841 20050421 > > 23? 841 20060130 > > 24? 841 20060428 > > 25? 841 20060602 > > 26? 841 20060816 > > 27? 841 20061025 > > 28? 841 20061129 > > 29? 841 20070112 > > 30? 841 20070514 > > 39? 999 20050503 > > 40 1019 19870508 > > 41 1019 19880223 > > 42 1019 19880330 > > 43 1019 19880330 > > > > > > > Regards > > Petr > > > > > > > -----Original Message----- > > > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > > > project.org] On Behalf Of PIKAL Petr > > > Sent: Tuesday, October 23, 2012 1:49 PM > > > To: Stuart Leask; r-help at r-project.org > > > Subject: Re: [R] [r] How to pick colums from a ragged array? > > > > > > Hi > > > > > > I did not check your code and rather followed your explanation. > BTW, > > > thanks for test data. > > > > > > small change in data frame to make DATE as Date class > > > > > > datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <- > > > data.frame(ID,datum ) > > > > > > ordering by date > > > > > > id.d<-id.d[order(id.d$datum),] > > > > > > > > > two functions to test if first two dates are the same or last two > > > dates are the same > > > > > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > > > x[length(x),2]==x[length(x)-1,2] > > > > > > change one last date in the data frame to be the same as previous > > > > > > id.d[35,2]<-id.d[36,2] > > > > > > and here are results > > > > > > sapply(split(id.d, id.d$ID), testlast) > > >? ? 58? 167? 323? 547? 794? 814? 841? 910? 999? 1019 > > > FALSE FALSE FALSE? ? NA? ? NA FALSE FALSE? TRUE? ? NA FALSE > > > > > > > sapply(split(id.d, id.d$ID), testfirst) > > >? ? 58? 167? 323? 547? 794? 814? 841? 910? 999? 1019 > > > FALSE FALSE FALSE? ? NA? ? NA FALSE FALSE FALSE? ? NA FALSE > > > > > > Now you can select ID which is true and remove it from your data > > > which(sapply(split(id.d, id.d$ID), testlast)) > > > > > > and use it for your data frame to subset/remove id.d$ID => > > as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast)))) > [1] > > > FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > > > FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > > FALSE > > > FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > > FALSE > > > FALSE TRUE? TRUE [37]? TRUE? TRUE? TRUE? TRUE > > > > > > However I am not sure if this is exactly what you want. > > > > > > Regards > > > Petr > > > > > > > -----Original Message----- > > > > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > > > > project.org] On Behalf Of Stuart Leask > > > > Sent: Tuesday, October 23, 2012 11:38 AM > > > > To: r-help at r-project.org > > > > Subject: [R] [r] How to pick colums from a ragged array? > > > > > > > > I have a large dataset (~1 million rows) of three variables: ID > > > > (patient's name), DATE (of appointment) and DIAGNOSIS (given on > > that > > > > date). > > > > Patients may have been assigned more than one diagnosis at any > one > > > > appointment - leading to two rows, same ID and DATE but > > > > different DIAGNOSIS. > > > > The diagnoses may change between appointments. > > > > > > > > I want to subset the data in two ways: > > > > > > > > -? ? ? ? ? define groups of patients by the first diagnosis given > > > > > > > > -? ? ? ? ? define groups of patients by the last diagnosis given. > > > > > > > > The problem: > > > > Unfortunately, a small number of patients have been given more > > > > than one diagnosis at their first (or last) appointment. These > > > > individuals I need to identify and remove, as it's not possible > to > > > > say uniquely what their first (or last) diagnosis was. So I need > > > > to identify and remove these individuals which have pairs of > > > > rows with the same ID > > > and > > > > (lowest or highest) DATE. The size of the dataset precludes the > > > option > > > > of doing this by eye. > > > > > > > > I suspect there is a very elegant way of doing this in R. > > > > > > > > This is what I've come up with: > > > > > > > > > > > > -? ? ? ? ? Sort by DATE then ID > > > > > > > > -? ? ? ? ? Make a ragged array of DATE by ID > > > > > > > > -? ? ? ? ? Remove IDs that only occur once. > > > > > > > > -? ? ? ? ? Subtract the first and second DATEs. Remove IDs for > > which > > > > this = zero, as this will only be true for IDs for which the > > > > appointment is recorded twice (because there were two diagnoses > > > > recorded on this date). > > > > > > > > -? ? ? ? ? (Then do the same to get the 'last appointment' > > > duplicates, > > > > by reversing the initial sort by DATE.) > > > > > > > > I am stuck at the 'Subtract dates' step: I would like to get the > > > > data out of the ragged array by columns (so e.g. I end up with a > > > > matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out > > > > by column from the ragged array. > > > > > > > > I hope someone can help. My ugly code is below, with some data > for > > > > testing. > > > > > > > > > > > > Stuart > > > > > > > > > > > > Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior > > > > Lecturer and Honorary Consultant Pychiatrist Institute of Mental > > > > Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. > > UK > > > > Tel. +44 > > > > 115 82 30419 > > > > > stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk > > > > > > > > > Google 'Dr Stuart Leask' > > > > > > > > > > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > > > > ,547,794,814,814,814,814,814,814,841,841,841,841,841 > > > > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > > > > ,1019) > > > > > > > > DATE <- > > > > c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 > > > > ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 > > > > ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 > > > > ,20050421,20060130,20060428,20060602,20060816,20061025,20061129 > > > > ,20070112,20070514,20091105,20091117,20091119,20091120,20091210 > > > > ,20091224,20050503,19870508,19880223,19880330) > > > > > > > > id.d <- cbind (ID,DATE ) > > > > rag.a? <-? split ( id.d [ ,2 ], id.d [ ,1])? ? ? ? ? ? ? # > create > > > > ragged array, 1-n DATES for every NAME > > > > > > > > # Inelegant attempt to remove IDs that only have one entry: > > > > > > > > rag.s <-tapply? (id.d [ ,2], id.d [ ,1], sum)? ? ? ? ? ? #add up > > the > > > > dates per row > > > > # Since DATE is in 'year mo da', if there's only one date, sum > > > > will > > > be > > > > less than 2100000: > > > > rag.t <- rag.s [ rag.s > 21000000 ] > > > > multi.dates <- rownames ( rag.t )? ? ? ? ? ? ? ? ? ? ? ? # all > the > > > IDs > > > > with >1 date > > > > rag.am <- rag.a [ multi.dates ]? ? ? ? ? ? ? ? ? ? ? ? ? # > rag.am > > > only > > > > has IDs with > 1 Date > > > > > > > > > > > > # But now I'm stuck. > > > > # Each row of the array is rag.am$ID. > > > > # So I can't pick columns of DATEs from the ragged array. > > > > > > > > This message and any attachment are intended solely for the > > > > addressee and may contain confidential information. If you have > > > > received this message in error, please send it back to me, and > > > > immediately delete > > > it. > > > > Please do not use, copy or disclose the information contained in > > > > this message or in any attachment.? Any views or opinions > > > > expressed by the author of this email do not necessarily reflect > > > > the views of the University of Nottingham. > > > > > > > > This message has been checked for viruses but the contents of an > > > > attachment may still contain software viruses which could damage > > > > your computer system: > > > > you are advised to perform your own checks. Email communications > > > > with the University of Nottingham may be monitored as permitted > by > > > > UK legislation. > > > >? ? ? ? [[alternative HTML version deleted]] > > > > > > > > ______________________________________________ > > > > R-help at r-project.org mailing list > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > PLEASE do read the posting guide http://www.R- > project.org/posting- > > > > guide.html and provide commented, minimal, self-contained, > > > > reproducible code. > > > > > > ______________________________________________ > > > R-help at r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide http://www.R-project.org/posting- > > > guide.html and provide commented, minimal, self-contained, > > > reproducible code.
Hello,
Using one of Arun's ideas, some post ago, this new function returns a
logical index into id.d of the rows that should be _removed_, hence rm1
and rm2. I think
getRepLogical <- function(x, first = TRUE){
fun <- if(first) head else tail
dte <- tapply(x[,2], x[,1], FUN = function(x) duplicated(fun(x, 2)))
len <- tapply(x[,2], x[,1], FUN = length)
lst <- lapply(seq_along(dte), function(i) c(dte[[i]], rep(FALSE,
if(len[[i]] > 2) len[[i]] - 2 else 0)))
lst <- if(first) lst else lapply(lst, rev)
i1 <- unlist(lst)
dg <- tapply(x[,3], x[,1], FUN = function(x) !duplicated(fun(x, 2)))
lst <- lapply(seq_along(dte), function(i) c(dg[[i]], rep(FALSE,
if(len[[i]] > 2) len[[i]] - 2 else 0)))
lst <- if(first) lst else lapply(lst, rev)
i2 <- unlist(lst)
i1 & i2
}
rm1 <- getRepLogical(id.d)
rm2 <- getRepLogical(id.d, first = FALSE)
id.d[rm1, ]
id.d[rm2, ]
id.d$INCLUDE <- !(rm1 | rm2)
Hope this helps,
Rui Barradas
Em 24-10-2012 16:41, Stuart Leask escreveu:> (And, considering the real application, the functions ideally should
probably output a variable INCLUDE, the same length as the original data, with
TRUE and FALSE for whether or not that row should be included...)
>
> -----Original Message-----
> From: Leask Stuart
> Sent: 24 October 2012 16:25
> To: arun (smartpink111 at yahoo.com); 'PIKAL Petr'; Rui Barradas
(ruipbarradas at sapo.pt)
> Subject: RE: [r] How to pick colums from a ragged array?
>
> Arun, Petr, Rui, many thanks for your help, and the functions you have
written.
>
> You'll recall I wanted to remove these first (or last) duplicates,
because they represented instances where two different diagnoses (in this case,
variable DG, value 1, 2, 3, 4 or 5) had been recorded on the same day - so I
can't say which was 'first' (or 'last').
>
> Your functions have revealed something I wasn't expecting: In some
cases, the diagnoses recorded on the duplicated DATEs are the same!
> This is a surprise to me, but probably reflects someone going to two
different departments in a clinic, and both departments submit data. I have to
say that crazy things like this are often a feature of real data, which I'm
sure you've come across yourselves.
>
> Of course, I don't want to remove records in which I can determine an
unambiguous 'first diagnosis'.
>
> You have all put in so much effort on my behalf, I'm ashamed to ask,
but I wonder if any of the functions you've written could do this with a
little more
> Indexing and the 'duplicate' function
> So the function should only exclude an ID, having identified a first (or
last) DATE duplicate, the DGs for these two dates are different.
>
> Test dataset:
>
> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> ,547,794,814,814,814,814,814,814,841,841,841,841,841
> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> ,1019)
>
> DATE <-
> c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
> ,20060111,20071119,20080107,20080407,20080521,20080521,20041005
> ,20070905,20020814,20021125,20040429,20040429,20071205,20071205
> ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
> ,20070112,20070514, 19870508,20040205,20040205, 20080521,20080521
> ,20091224,20050503,19870508,19870508,19880330)
>
> DG<-
>
c(1,2,1,1,4,4,3,2,3,2,1,2,3,2,1,2,2,2,2,2,2,1,2,1,1,1,1,1,1,4,3,3,3,4,3,2,2,2,1,1)
>
> id.d<-data.frame(ID,DATE,DG)
> id.d
>
> # Considering Ruis getRepeat function:
>
> g.r<-getRepeat(id.d) # defaults to first = TRUE getRepeat(id.d, first
= FALSE) to get the last ones
> g.rr<-do.call(rbind, g.r) # put the data into a matrix
>
> # I can remove the date duplicates with:
> g.rr[rep(!duplicated(g.rr)[(1:(dim(g.rr)[1]/2))*2],each=2),]
>
> I'm not sure how to add this to your suggestions, Arun & Petr...
>
>
> Stuart
>
>
> -----Original Message-----
> From: PIKAL Petr [mailto:petr.pikal at precheza.cz]
> Sent: 23 October 2012 15:24
> To: Stuart Leask
> Subject: RE: [r] How to pick colums from a ragged array?
>
> Hi
>
> I assumed that id.d is data frame
>
> id.d <- data.frame (ID,DATE )
>
> and
>
> fff(id.d)
>
> works for me
>
> Petr
>
>
>> -----Original Message-----
>> From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk]
>> Sent: Tuesday, October 23, 2012 3:13 PM
>> To: PIKAL Petr
>> Subject: RE: [r] How to pick colums from a ragged array?
>>
>> Hi Petr.
>> I see what you mean it should do, but when I run it I get an error
>> (see below).
>> Stuart
>>
>>
>>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>> + ,547,794,814,814,814,814,814,814,841,841,841,841,841
>> + ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>> + ,1019)
>>> DATE <-
>> + c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
>> + ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>> + ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>> + ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
>> + ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
>> + ,20091224,20050503,19870508,19870508,19880330)
>>> id.d <- cbind (ID,DATE )
>>> fff<-function(data, first=TRUE, remove=FALSE) {
>> +
>> + testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
>> + x[nrow(x),2]==x[nrow(x)-1,2]
>> +
>> + if(first) sel <- as.numeric(names(which(unlist(sapply(split(data,
>> + data[,1]), testfirst))))) else sel <-
>> + as.numeric(names(which(unlist(sapply(split(data, data[,1]),
>> + testlast)))))
>> +
>> + if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,]
>> + }
>>> fff(id.d)
>> Error in x[1, 2] : incorrect number of dimensions
>> -----Original Message-----
>> From: PIKAL Petr [mailto:petr.pikal at precheza.cz]
>> Sent: 23 October 2012 13:51
>> To: Stuart Leask; r-help at r-project.org
>> Subject: RE: [r] How to pick colums from a ragged array?
>>
>> Hi
>>
>>> -----Original Message-----
>>> From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk]
>>> Sent: Tuesday, October 23, 2012 2:29 PM
>>> To: PIKAL Petr; r-help at r-project.org
>>> Subject: RE: [r] How to pick colums from a ragged array?
>>>
>>> Hi there.
>>>
>>> Not sure I follow what you are doing.
>>>
>>> I want a list of all the IDs that have duplicate DATE entries, only
>>> when the DATE is the earliest (or last) date for that ID.
>> And that is what the function (with 3 small modifications) does
>>
>>
>> fff<-function(data, first=TRUE, remove=FALSE) {
>>
>> testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
>> x[nrow(x),2]==x[nrow(x)-1,2]
>>
>> if(first) sel <- as.numeric(names(which(unlist(sapply(split(data,
>> data[,1]), testfirst))))) else sel <-
>> as.numeric(names(which(unlist(sapply(split(data, data[,1]),
>> testlast)))))
>>
>> if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,] }
>>
>> See the result of your refined data
>>
>> fff(id.d)
>> ID DATE
>> 5 167 2004-02-05
>> 6 167 2004-02-05
>> 22 841 2005-04-21
>> 23 841 2005-04-21
>> 24 841 2006-04-28
>> 25 841 2006-06-02
>> 26 841 2006-08-16
>> 27 841 2006-10-25
>> 28 841 2006-11-29
>> 29 841 2007-01-12
>> 30 841 2007-05-14
>> 38 1019 1987-05-08
>> 39 1019 1987-05-08
>> 40 1019 1988-03-30
>>> fff(id.d, first=F)
>> ID DATE
>> 5 167 2004-02-05
>> 6 167 2004-02-05
>>> fff(id.d, remove=T)
>> ID DATE
>> 1 58 2006-08-21
>> 2 58 2006-12-07
>> 3 58 2008-01-02
>> 4 58 2009-09-04
>> 7 323 2005-11-11
>> 8 323 2006-01-11
>> 9 323 2007-11-19
>> 10 323 2008-01-07
>> 11 323 2008-04-07
>> 12 323 2008-05-21
>> 13 323 2008-07-11
>> 14 547 2004-10-05
>> 15 794 2007-09-05
>> 16 814 2002-08-14
>> 17 814 2002-11-25
>> 18 814 2004-04-29
>> 19 814 2004-04-29
>> 20 814 2007-12-05
>> 21 814 2008-02-27
>> 31 910 1987-05-08
>> 32 910 2004-02-05
>> 33 910 2004-02-05
>> 34 910 2009-11-20
>> 35 910 2009-12-10
>> 36 910 2009-12-24
>> 37 999 2005-05-03
>> You can do surgery on fff function to see what result comes from some
>> piece of the function e.g.
>>
>> sapply(split(id.d, id.d[,1]), testlast)
>>
>> Regards
>> Petr
>>
>>> I have refined my test dataset, to include some tests (e.g. 910 has
>>> the same dup as 1019, but for 910 it's not the earliest date):
>>>
>>>
>>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>>> ,547,794,814,814,814,814,814,814,841,841,841,841,841
>>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>>> ,1019)
>>>
>>> DATE <-
>>> c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
>>> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>>> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>>> ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
>>> ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
>>> ,20091224,20050503,19870508,19870508,19880330)
>>>
>>> Correct output:
>>> "167" "841" "1019"
>>>
>>> Stuart
>>>
>>> -----Original Message-----
>>> From: PIKAL Petr [mailto:petr.pikal at precheza.cz]
>>> Sent: 23 October 2012 13:15
>>> To: Stuart Leask; r-help at r-project.org
>>> Subject: RE: [r] How to pick colums from a ragged array?
>>>
>>> Hi
>>>
>>> Rui's answer brought me to more elaborated solution which still
>>> needs data frame to be ordered by date
>>>
>>> fff<-function(data, first=TRUE, remove=FALSE) {
>>>
>>> testfirst <- function(x) x[1,2]==x[2,2] testlast <-
function(x)
>>> x[length(x),2]==x[length(x)-1,2]
>>>
>>> if(first) sel <- as.numeric(names(which(sapply(split(data,
>>> data[,1]),
>>> testfirst)))) else sel <-
as.numeric(names(which(sapply(split(data,
>>> data[,1]), testlast))))
>>>
>>> if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] }
>>>
>>>
>>>> fff(id.d)
>>> ID DATE
>>> 31 910 20091105
>>> 32 910 20091105
>>> 33 910 20091117
>>> 34 910 20091119
>>> 35 910 20091120
>>> 36 910 20091210
>>> 37 910 20091224
>>> 38 910 20091224
>>>
>>>> fff(id.d, remove=T)
>>> ID DATE
>>> 1 58 20060821
>>> 2 58 20061207
>>> 3 58 20080102
>>> 4 58 20090904
>>> 5 167 20040205
>>> 6 167 20040323
>>> 7 323 20051111
>>> 8 323 20060111
>>> 9 323 20071119
>>> 10 323 20080107
>>> 11 323 20080407
>>> 12 323 20080521
>>> 13 323 20080711
>>> 14 547 20041005
>>> 15 794 20070905
>>> 16 814 20020814
>>> 17 814 20021125
>>> 18 814 20040429
>>> 19 814 20040429
>>> 20 814 20071205
>>> 21 814 20080227
>>> 22 841 20050421
>>> 23 841 20060130
>>> 24 841 20060428
>>> 25 841 20060602
>>> 26 841 20060816
>>> 27 841 20061025
>>> 28 841 20061129
>>> 29 841 20070112
>>> 30 841 20070514
>>> 39 999 20050503
>>> 40 1019 19870508
>>> 41 1019 19880223
>>> 42 1019 19880330
>>> 43 1019 19880330
>>> Regards
>>> Petr
>>>
>>>
>>>> -----Original Message-----
>>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at
r-
>>>> project.org] On Behalf Of PIKAL Petr
>>>> Sent: Tuesday, October 23, 2012 1:49 PM
>>>> To: Stuart Leask; r-help at r-project.org
>>>> Subject: Re: [R] [r] How to pick colums from a ragged array?
>>>>
>>>> Hi
>>>>
>>>> I did not check your code and rather followed your explanation.
>> BTW,
>>>> thanks for test data.
>>>>
>>>> small change in data frame to make DATE as Date class
>>>>
>>>> datum<-as.Date(as.character(DATE),
format="%Y%m%d") id.d <-
>>>> data.frame(ID,datum )
>>>>
>>>> ordering by date
>>>>
>>>> id.d<-id.d[order(id.d$datum),]
>>>>
>>>>
>>>> two functions to test if first two dates are the same or last
two
>>>> dates are the same
>>>>
>>>> testfirst <- function(x) x[1,2]==x[2,2] testlast <-
function(x)
>>>> x[length(x),2]==x[length(x)-1,2]
>>>>
>>>> change one last date in the data frame to be the same as
previous
>>>>
>>>> id.d[35,2]<-id.d[36,2]
>>>>
>>>> and here are results
>>>>
>>>> sapply(split(id.d, id.d$ID), testlast)
>>>> 58 167 323 547 794 814 841 910 999 1019
>>>> FALSE FALSE FALSE NA NA FALSE FALSE TRUE NA FALSE
>>>>
>>>>> sapply(split(id.d, id.d$ID), testfirst)
>>>> 58 167 323 547 794 814 841 910 999 1019
>>>> FALSE FALSE FALSE NA NA FALSE FALSE FALSE NA FALSE
>>>>
>>>> Now you can select ID which is true and remove it from your
data
>>>> which(sapply(split(id.d, id.d$ID), testlast))
>>>>
>>>> and use it for your data frame to subset/remove id.d$ID
=>>>> as.numeric(names(which(sapply(split(id.d, id.d$ID),
testlast))))
>> [1]
>>>> FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE
>>>> FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE
>>> FALSE
>>>> FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE
>>> FALSE
>>>> FALSE TRUE TRUE [37] TRUE TRUE TRUE TRUE
>>>>
>>>> However I am not sure if this is exactly what you want.
>>>>
>>>> Regards
>>>> Petr
>>>>
>>>>> -----Original Message-----
>>>>> From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-
>>>>> project.org] On Behalf Of Stuart Leask
>>>>> Sent: Tuesday, October 23, 2012 11:38 AM
>>>>> To: r-help at r-project.org
>>>>> Subject: [R] [r] How to pick colums from a ragged array?
>>>>>
>>>>> I have a large dataset (~1 million rows) of three
variables: ID
>>>>> (patient's name), DATE (of appointment) and DIAGNOSIS
(given on
>>> that
>>>>> date).
>>>>> Patients may have been assigned more than one diagnosis at
any
>> one
>>>>> appointment - leading to two rows, same ID and DATE but
>>>>> different DIAGNOSIS.
>>>>> The diagnoses may change between appointments.
>>>>>
>>>>> I want to subset the data in two ways:
>>>>>
>>>>> - define groups of patients by the first diagnosis
given
>>>>>
>>>>> - define groups of patients by the last diagnosis
given.
>>>>>
>>>>> The problem:
>>>>> Unfortunately, a small number of patients have been given
more
>>>>> than one diagnosis at their first (or last) appointment.
These
>>>>> individuals I need to identify and remove, as it's not
possible
>> to
>>>>> say uniquely what their first (or last) diagnosis was. So I
need
>>>>> to identify and remove these individuals which have pairs
of
>>>>> rows with the same ID
>>>> and
>>>>> (lowest or highest) DATE. The size of the dataset precludes
the
>>>> option
>>>>> of doing this by eye.
>>>>>
>>>>> I suspect there is a very elegant way of doing this in R.
>>>>>
>>>>> This is what I've come up with:
>>>>>
>>>>>
>>>>> - Sort by DATE then ID
>>>>>
>>>>> - Make a ragged array of DATE by ID
>>>>>
>>>>> - Remove IDs that only occur once.
>>>>>
>>>>> - Subtract the first and second DATEs. Remove IDs
for
>>> which
>>>>> this = zero, as this will only be true for IDs for which
the
>>>>> appointment is recorded twice (because there were two
diagnoses
>>>>> recorded on this date).
>>>>>
>>>>> - (Then do the same to get the 'last
appointment'
>>>> duplicates,
>>>>> by reversing the initial sort by DATE.)
>>>>>
>>>>> I am stuck at the 'Subtract dates' step: I would
like to get the
>>>>> data out of the ragged array by columns (so e.g. I end up
with a
>>>>> matrix of ID, 1st DATE, 2nd DATE). But I can't get the
dates out
>>>>> by column from the ragged array.
>>>>>
>>>>> I hope someone can help. My ugly code is below, with some
data
>> for
>>>>> testing.
>>>>>
>>>>>
>>>>> Stuart
>>>>>
>>>>>
>>>>> Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical
Senior
>>>>> Lecturer and Honorary Consultant Pychiatrist Institute of
Mental
>>>>> Health, Innovation Park Triumph Road, Nottingham, Notts.
NG7 2TU.
>>> UK
>>>>> Tel. +44
>>>>> 115 82 30419
>>>>>
>> stuart.leask at nottingham.ac.uk<mailto:stuart.leask at
nottingham.ac.uk
>>>>> Google 'Dr Stuart Leask'
>>>>>
>>>>>
>>>>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>>>>> ,547,794,814,814,814,814,814,814,841,841,841,841,841
>>>>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>>>>> ,1019)
>>>>>
>>>>> DATE <-
>>>>>
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
>>>>>
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>>>>>
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>>>>>
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
>>>>>
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
>>>>> ,20091224,20050503,19870508,19880223,19880330)
>>>>>
>>>>> id.d <- cbind (ID,DATE )
>>>>> rag.a <- split ( id.d [ ,2 ], id.d [ ,1])
#
>> create
>>>>> ragged array, 1-n DATES for every NAME
>>>>>
>>>>> # Inelegant attempt to remove IDs that only have one entry:
>>>>>
>>>>> rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum)
#add up
>>> the
>>>>> dates per row
>>>>> # Since DATE is in 'year mo da', if there's
only one date, sum
>>>>> will
>>>> be
>>>>> less than 2100000:
>>>>> rag.t <- rag.s [ rag.s > 21000000 ]
>>>>> multi.dates <- rownames ( rag.t )
# all
>> the
>>>> IDs
>>>>> with >1 date
>>>>> rag.am <- rag.a [ multi.dates ]
#
>> rag.am
>>>> only
>>>>> has IDs with > 1 Date
>>>>>
>>>>>
>>>>> # But now I'm stuck.
>>>>> # Each row of the array is rag.am$ID.
>>>>> # So I can't pick columns of DATEs from the ragged
array.
>>>>>
>>>>> This message and any attachment are intended solely for the
>>>>> addressee and may contain confidential information. If you
have
>>>>> received this message in error, please send it back to me,
and
>>>>> immediately delete
>>>> it.
>>>>> Please do not use, copy or disclose the information
contained in
>>>>> this message or in any attachment. Any views or opinions
>>>>> expressed by the author of this email do not necessarily
reflect
>>>>> the views of the University of Nottingham.
>>>>>
>>>>> This message has been checked for viruses but the contents
of an
>>>>> attachment may still contain software viruses which could
damage
>>>>> your computer system:
>>>>> you are advised to perform your own checks. Email
communications
>>>>> with the University of Nottingham may be monitored as
permitted
>> by
>>>>> UK legislation.
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-
>> project.org/posting-
>>>>> guide.html and provide commented, minimal, self-contained,
>>>>> reproducible code.
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
http://www.R-project.org/posting-
>>>> guide.html and provide commented, minimal, self-contained,
>>>> reproducible code.
Sorry, forgot to cc to rhelp Petr> -----Original Message----- > From: PIKAL Petr > Sent: Thursday, October 25, 2012 11:19 AM > To: 'Stuart Leask'; arun (smartpink111 at yahoo.com) > Subject: RE: [r] How to pick colums from a ragged array? > > Hi > > If I understand correctly you now want only to identify rows for which > for a given ID, two or more first or last DATEs are same but DG is > different and put TRUE/FALSE to new column > > fff<-function(data) { > > data$Identify <- FALSE > > testfirst <- function(x) (x[1,"DATE"]==x[2,"DATE"]) & > (x[1,"DG"]!=x[2,"DG"]) testlast <- function(x) { > (x[nrow(x),"DATE"]==x[nrow(x)-1,"DATE"]) & (x[nrow(x),"DG"]!=x[nrow(x)- > 1,"DG"]) > } > > > sel <- as.numeric(names(which(unlist(sapply(split(data,data[,1]), > testfirst))))) > > sel <- c(sel, as.numeric(names(which(unlist(sapply(split(data, > data[,1]), testlast)))))) > > data[data[,1] %in% sel,"Identify"] <- TRUE data } > > I slightly modified my code to get rid of necessary user selection of > first or last variant and put both together, add a new column and > extended testing functions to evaluate DG and look if they are the same > or different. > > Does it suit to your purpose? > > Regards > Petr > > > > > -----Original Message----- > > From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk] > > Sent: Wednesday, October 24, 2012 5:25 PM > > To: arun (smartpink111 at yahoo.com); PIKAL Petr; Rui Barradas > > (ruipbarradas at sapo.pt) > > Subject: RE: [r] How to pick colums from a ragged array? > > > > Arun, Petr, Rui, many thanks for your help, and the functions you > have > > written. > > > > You'll recall I wanted to remove these first (or last) duplicates, > > because they represented instances where two different diagnoses (in > > this case, variable DG, value 1, 2, 3, 4 or 5) had been recorded on > > the same day - so I can't say which was 'first' (or 'last'). > > > > Your functions have revealed something I wasn't expecting: In some > > cases, the diagnoses recorded on the duplicated DATEs are the same! > > This is a surprise to me, but probably reflects someone going to two > > different departments in a clinic, and both departments submit data. > I > > have to say that crazy things like this are often a feature of real > > data, which I'm sure you've come across yourselves. > > > > Of course, I don't want to remove records in which I can determine an > > unambiguous 'first diagnosis'. > > > > You have all put in so much effort on my behalf, I'm ashamed to ask, > > but I wonder if any of the functions you've written could do this > with > > a little more Indexing and the 'duplicate' function So the function > > should only exclude an ID, having identified a first (or last) DATE > > duplicate, the DGs for these two dates are different. > > > > Test dataset: > > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > > ,547,794,814,814,814,814,814,814,841,841,841,841,841 > > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > > ,1019) > > > > DATE <- > > c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 > > ,20060111,20071119,20080107,20080407,20080521,20080521,20041005 > > ,20070905,20020814,20021125,20040429,20040429,20071205,20071205 > > ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 > > ,20070112,20070514, 19870508,20040205,20040205, 20080521,20080521 > > ,20091224,20050503,19870508,19870508,19880330) > > > > DG<- > > > c(1,2,1,1,4,4,3,2,3,2,1,2,3,2,1,2,2,2,2,2,2,1,2,1,1,1,1,1,1,4,3,3,3,4, > > 3 > > ,2,2,2,1,1) > > > > id.d<-data.frame(ID,DATE,DG) > > id.d > > > > # Considering Ruis getRepeat function: > > > > g.r<-getRepeat(id.d) # defaults to first = TRUE getRepeat(id.d, > > first = FALSE) to get the last ones > > g.rr<-do.call(rbind, g.r) # put the data into a matrix > > > > # I can remove the date duplicates with: > > g.rr[rep(!duplicated(g.rr)[(1:(dim(g.rr)[1]/2))*2],each=2),] > > > > I'm not sure how to add this to your suggestions, Arun & Petr... > > > > > > Stuart > > > > > > -----Original Message----- > > From: PIKAL Petr [mailto:petr.pikal at precheza.cz] > > Sent: 23 October 2012 15:24 > > To: Stuart Leask > > Subject: RE: [r] How to pick colums from a ragged array? > > > > Hi > > > > I assumed that id.d is data frame > > > > id.d <- data.frame (ID,DATE ) > > > > and > > > > fff(id.d) > > > > works for me > > > > Petr > > > > > > > -----Original Message----- > > > From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk] > > > Sent: Tuesday, October 23, 2012 3:13 PM > > > To: PIKAL Petr > > > Subject: RE: [r] How to pick colums from a ragged array? > > > > > > Hi Petr. > > > I see what you mean it should do, but when I run it I get an error > > > (see below). > > > Stuart > > > > > > > > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > > > + ,547,794,814,814,814,814,814,814,841,841,841,841,841 > > > + ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > > > + ,1019) > > > > > > > > DATE <- > > > + c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 > > > + ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 > > > + ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 > > > + ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 > > > + ,20070112,20070514, 19870508,20040205,20040205, > 20091120,20091210 > > > + ,20091224,20050503,19870508,19870508,19880330) > > > > > > > > id.d <- cbind (ID,DATE ) > > > > fff<-function(data, first=TRUE, remove=FALSE) { > > > + > > > + testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > > > + x[nrow(x),2]==x[nrow(x)-1,2] > > > + > > > + if(first) sel <- as.numeric(names(which(unlist(sapply(split(data, > > > + data[,1]), testfirst))))) else sel <- > > > + as.numeric(names(which(unlist(sapply(split(data, data[,1]), > > > + testlast))))) > > > + > > > + if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% > > > + sel,] } > > > > > > > > fff(id.d) > > > Error in x[1, 2] : incorrect number of dimensions > > > > > > > > > > -----Original Message----- > > > From: PIKAL Petr [mailto:petr.pikal at precheza.cz] > > > Sent: 23 October 2012 13:51 > > > To: Stuart Leask; r-help at r-project.org > > > Subject: RE: [r] How to pick colums from a ragged array? > > > > > > Hi > > > > > > > -----Original Message----- > > > > From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk] > > > > Sent: Tuesday, October 23, 2012 2:29 PM > > > > To: PIKAL Petr; r-help at r-project.org > > > > Subject: RE: [r] How to pick colums from a ragged array? > > > > > > > > Hi there. > > > > > > > > Not sure I follow what you are doing. > > > > > > > > I want a list of all the IDs that have duplicate DATE entries, > > > > only when the DATE is the earliest (or last) date for that ID. > > > > > > And that is what the function (with 3 small modifications) does > > > > > > > > > fff<-function(data, first=TRUE, remove=FALSE) { > > > > > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > > > x[nrow(x),2]==x[nrow(x)-1,2] > > > > > > if(first) sel <- as.numeric(names(which(unlist(sapply(split(data, > > > data[,1]), testfirst))))) else sel <- > > > as.numeric(names(which(unlist(sapply(split(data, data[,1]), > > > testlast))))) > > > > > > if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,] > > > } > > > > > > See the result of your refined data > > > > > > fff(id.d) > > > ID DATE > > > 5 167 2004-02-05 > > > 6 167 2004-02-05 > > > 22 841 2005-04-21 > > > 23 841 2005-04-21 > > > 24 841 2006-04-28 > > > 25 841 2006-06-02 > > > 26 841 2006-08-16 > > > 27 841 2006-10-25 > > > 28 841 2006-11-29 > > > 29 841 2007-01-12 > > > 30 841 2007-05-14 > > > 38 1019 1987-05-08 > > > 39 1019 1987-05-08 > > > 40 1019 1988-03-30 > > > > fff(id.d, first=F) > > > ID DATE > > > 5 167 2004-02-05 > > > 6 167 2004-02-05 > > > > fff(id.d, remove=T) > > > ID DATE > > > 1 58 2006-08-21 > > > 2 58 2006-12-07 > > > 3 58 2008-01-02 > > > 4 58 2009-09-04 > > > 7 323 2005-11-11 > > > 8 323 2006-01-11 > > > 9 323 2007-11-19 > > > 10 323 2008-01-07 > > > 11 323 2008-04-07 > > > 12 323 2008-05-21 > > > 13 323 2008-07-11 > > > 14 547 2004-10-05 > > > 15 794 2007-09-05 > > > 16 814 2002-08-14 > > > 17 814 2002-11-25 > > > 18 814 2004-04-29 > > > 19 814 2004-04-29 > > > 20 814 2007-12-05 > > > 21 814 2008-02-27 > > > 31 910 1987-05-08 > > > 32 910 2004-02-05 > > > 33 910 2004-02-05 > > > 34 910 2009-11-20 > > > 35 910 2009-12-10 > > > 36 910 2009-12-24 > > > 37 999 2005-05-03 > > > > > > > > > > You can do surgery on fff function to see what result comes from > > > some piece of the function e.g. > > > > > > sapply(split(id.d, id.d[,1]), testlast) > > > > > > Regards > > > Petr > > > > > > > > > > > I have refined my test dataset, to include some tests (e.g. 910 > > > > has the same dup as 1019, but for 910 it's not the earliest > date): > > > > > > > > > > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > > > > ,547,794,814,814,814,814,814,814,841,841,841,841,841 > > > > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > > > > ,1019) > > > > > > > > DATE <- > > > > c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 > > > > ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 > > > > ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 > > > > ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 > > > > ,20070112,20070514, 19870508,20040205,20040205, > 20091120,20091210 > > > > ,20091224,20050503,19870508,19870508,19880330) > > > > > > > > Correct output: > > > > "167" "841" "1019" > > > > > > > > Stuart > > > > > > > > -----Original Message----- > > > > From: PIKAL Petr [mailto:petr.pikal at precheza.cz] > > > > Sent: 23 October 2012 13:15 > > > > To: Stuart Leask; r-help at r-project.org > > > > Subject: RE: [r] How to pick colums from a ragged array? > > > > > > > > Hi > > > > > > > > Rui's answer brought me to more elaborated solution which still > > > > needs data frame to be ordered by date > > > > > > > > fff<-function(data, first=TRUE, remove=FALSE) { > > > > > > > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > > > > x[length(x),2]==x[length(x)-1,2] > > > > > > > > if(first) sel <- as.numeric(names(which(sapply(split(data, > > > > data[,1]), > > > > testfirst)))) else sel <- > > > > as.numeric(names(which(sapply(split(data, > > > > data[,1]), testlast)))) > > > > > > > > if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] } > > > > > > > > > > > > > fff(id.d) > > > > ID DATE > > > > 31 910 20091105 > > > > 32 910 20091105 > > > > 33 910 20091117 > > > > 34 910 20091119 > > > > 35 910 20091120 > > > > 36 910 20091210 > > > > 37 910 20091224 > > > > 38 910 20091224 > > > > > > > > > fff(id.d, remove=T) > > > > ID DATE > > > > 1 58 20060821 > > > > 2 58 20061207 > > > > 3 58 20080102 > > > > 4 58 20090904 > > > > 5 167 20040205 > > > > 6 167 20040323 > > > > 7 323 20051111 > > > > 8 323 20060111 > > > > 9 323 20071119 > > > > 10 323 20080107 > > > > 11 323 20080407 > > > > 12 323 20080521 > > > > 13 323 20080711 > > > > 14 547 20041005 > > > > 15 794 20070905 > > > > 16 814 20020814 > > > > 17 814 20021125 > > > > 18 814 20040429 > > > > 19 814 20040429 > > > > 20 814 20071205 > > > > 21 814 20080227 > > > > 22 841 20050421 > > > > 23 841 20060130 > > > > 24 841 20060428 > > > > 25 841 20060602 > > > > 26 841 20060816 > > > > 27 841 20061025 > > > > 28 841 20061129 > > > > 29 841 20070112 > > > > 30 841 20070514 > > > > 39 999 20050503 > > > > 40 1019 19870508 > > > > 41 1019 19880223 > > > > 42 1019 19880330 > > > > 43 1019 19880330 > > > > > > > > > > > > > Regards > > > > Petr > > > > > > > > > > > > > -----Original Message----- > > > > > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > > > > > project.org] On Behalf Of PIKAL Petr > > > > > Sent: Tuesday, October 23, 2012 1:49 PM > > > > > To: Stuart Leask; r-help at r-project.org > > > > > Subject: Re: [R] [r] How to pick colums from a ragged array? > > > > > > > > > > Hi > > > > > > > > > > I did not check your code and rather followed your explanation. > > > BTW, > > > > > thanks for test data. > > > > > > > > > > small change in data frame to make DATE as Date class > > > > > > > > > > datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <- > > > > > data.frame(ID,datum ) > > > > > > > > > > ordering by date > > > > > > > > > > id.d<-id.d[order(id.d$datum),] > > > > > > > > > > > > > > > two functions to test if first two dates are the same or last > > > > > two dates are the same > > > > > > > > > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > > > > > x[length(x),2]==x[length(x)-1,2] > > > > > > > > > > change one last date in the data frame to be the same as > > > > > previous > > > > > > > > > > id.d[35,2]<-id.d[36,2] > > > > > > > > > > and here are results > > > > > > > > > > sapply(split(id.d, id.d$ID), testlast) > > > > > 58 167 323 547 794 814 841 910 999 1019 > > > > > FALSE FALSE FALSE NA NA FALSE FALSE TRUE NA FALSE > > > > > > > > > > > sapply(split(id.d, id.d$ID), testfirst) > > > > > 58 167 323 547 794 814 841 910 999 1019 > > > > > FALSE FALSE FALSE NA NA FALSE FALSE FALSE NA FALSE > > > > > > > > > > Now you can select ID which is true and remove it from your > data > > > > > which(sapply(split(id.d, id.d$ID), testlast)) > > > > > > > > > > and use it for your data frame to subset/remove id.d$ID => > > > > as.numeric(names(which(sapply(split(id.d, id.d$ID), > testlast)))) > > > [1] > > > > > FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > > > > > FALSE FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE > FALSE > > > > > FALSE > > > > FALSE > > > > > FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE > FALSE > > > > FALSE > > > > > FALSE TRUE TRUE [37] TRUE TRUE TRUE TRUE > > > > > > > > > > However I am not sure if this is exactly what you want. > > > > > > > > > > Regards > > > > > Petr > > > > > > > > > > > -----Original Message----- > > > > > > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > > > > > > project.org] On Behalf Of Stuart Leask > > > > > > Sent: Tuesday, October 23, 2012 11:38 AM > > > > > > To: r-help at r-project.org > > > > > > Subject: [R] [r] How to pick colums from a ragged array? > > > > > > > > > > > > I have a large dataset (~1 million rows) of three variables: > > > > > > ID (patient's name), DATE (of appointment) and DIAGNOSIS > > > > > > (given on > > > > that > > > > > > date). > > > > > > Patients may have been assigned more than one diagnosis at > any > > > one > > > > > > appointment - leading to two rows, same ID and DATE but > > > > > > different DIAGNOSIS. > > > > > > The diagnoses may change between appointments. > > > > > > > > > > > > I want to subset the data in two ways: > > > > > > > > > > > > - define groups of patients by the first diagnosis > > given > > > > > > > > > > > > - define groups of patients by the last diagnosis > > given. > > > > > > > > > > > > The problem: > > > > > > Unfortunately, a small number of patients have been given > more > > > > > > than one diagnosis at their first (or last) appointment. > These > > > > > > individuals I need to identify and remove, as it's not > > > > > > possible > > > to > > > > > > say uniquely what their first (or last) diagnosis was. So I > > need > > > > > > to identify and remove these individuals which have pairs of > > > > > > rows with the same ID > > > > > and > > > > > > (lowest or highest) DATE. The size of the dataset precludes > > > > > > the > > > > > option > > > > > > of doing this by eye. > > > > > > > > > > > > I suspect there is a very elegant way of doing this in R. > > > > > > > > > > > > This is what I've come up with: > > > > > > > > > > > > > > > > > > - Sort by DATE then ID > > > > > > > > > > > > - Make a ragged array of DATE by ID > > > > > > > > > > > > - Remove IDs that only occur once. > > > > > > > > > > > > - Subtract the first and second DATEs. Remove IDs > for > > > > which > > > > > > this = zero, as this will only be true for IDs for which the > > > > > > appointment is recorded twice (because there were two > > > > > > diagnoses recorded on this date). > > > > > > > > > > > > - (Then do the same to get the 'last appointment' > > > > > duplicates, > > > > > > by reversing the initial sort by DATE.) > > > > > > > > > > > > I am stuck at the 'Subtract dates' step: I would like to get > > the > > > > > > data out of the ragged array by columns (so e.g. I end up > with > > a > > > > > > matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates > > out > > > > > > by column from the ragged array. > > > > > > > > > > > > I hope someone can help. My ugly code is below, with some > data > > > for > > > > > > testing. > > > > > > > > > > > > > > > > > > Stuart > > > > > > > > > > > > > > > > > > Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior > > > > > > Lecturer and Honorary Consultant Pychiatrist Institute of > > Mental > > > > > > Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 > > 2TU. > > > > UK > > > > > > Tel. +44 > > > > > > 115 82 30419 > > > > > > > > > stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk > > > > > > > > > > > > > Google 'Dr Stuart Leask' > > > > > > > > > > > > > > > > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > > > > > > ,547,794,814,814,814,814,814,814,841,841,841,841,841 > > > > > > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > > > > > > ,1019) > > > > > > > > > > > > DATE <- > > > > > > > > c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 > > > > > > > ,20060111,20071119,20080107,20080407,20080521,20080711,2004100 > > > > > > 5 > > > > > > > ,20070905,20020814,20021125,20040429,20040429,20071205,2008022 > > > > > > 7 > > > > > > > ,20050421,20060130,20060428,20060602,20060816,20061025,2006112 > > > > > > 9 > > > > > > > ,20070112,20070514,20091105,20091117,20091119,20091120,2009121 > > > > > > 0 > > > > > > ,20091224,20050503,19870508,19880223,19880330) > > > > > > > > > > > > id.d <- cbind (ID,DATE ) > > > > > > rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # > > > create > > > > > > ragged array, 1-n DATES for every NAME > > > > > > > > > > > > # Inelegant attempt to remove IDs that only have one entry: > > > > > > > > > > > > rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) > #add > > up > > > > the > > > > > > dates per row > > > > > > # Since DATE is in 'year mo da', if there's only one date, > sum > > > > > > will > > > > > be > > > > > > less than 2100000: > > > > > > rag.t <- rag.s [ rag.s > 21000000 ] > > > > > > multi.dates <- rownames ( rag.t ) # > all > > > the > > > > > IDs > > > > > > with >1 date > > > > > > rag.am <- rag.a [ multi.dates ] # > > > rag.am > > > > > only > > > > > > has IDs with > 1 Date > > > > > > > > > > > > > > > > > > # But now I'm stuck. > > > > > > # Each row of the array is rag.am$ID. > > > > > > # So I can't pick columns of DATEs from the ragged array. > > > > > > > > > > > > This message and any attachment are intended solely for the > > > > > > addressee and may contain confidential information. If you > > > > > > have received this message in error, please send it back to > > > > > > me, and immediately delete > > > > > it. > > > > > > Please do not use, copy or disclose the information contained > > in > > > > > > this message or in any attachment. Any views or opinions > > > > > > expressed by the author of this email do not necessarily > > reflect > > > > > > the views of the University of Nottingham. > > > > > > > > > > > > This message has been checked for viruses but the contents of > > an > > > > > > attachment may still contain software viruses which could > > damage > > > > > > your computer system: > > > > > > you are advised to perform your own checks. Email > > communications > > > > > > with the University of Nottingham may be monitored as > > > > > > permitted > > > by > > > > > > UK legislation. > > > > > > [[alternative HTML version deleted]] > > > > > > > > > > > > ______________________________________________ > > > > > > R-help at r-project.org mailing list > > > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > > > PLEASE do read the posting guide http://www.R- > > > project.org/posting- > > > > > > guide.html and provide commented, minimal, self-contained, > > > > > > reproducible code. > > > > > > > > > > ______________________________________________ > > > > > R-help at r-project.org mailing list > > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > > PLEASE do read the posting guide http://www.R- > > project.org/posting- > > > > > guide.html and provide commented, minimal, self-contained, > > > > > reproducible code.
Hi Stuart,
So, I guess my result (below) serves the purpose!
A.K.
----- Original Message -----
From: Stuart Leask <Stuart.Leask at nottingham.ac.uk>
To: arun <smartpink111 at yahoo.com>
Cc:
Sent: Thursday, October 25, 2012 3:13 AM
Subject: RE: [r] How to pick colums from a ragged array?
Even confusing myself now, serves me right for replying late at night!
** If DGs are the same, then the first (or last) diagnosis is unambiguous even
if date is duplicated - so I can use the data.**
Consider we want INCLUDE.FIRST to look at first dates.
Duplicate dates: 167, 323,814, 841, 910 1019
AND This dup is the first date: 167, 841, 1019
AND This dup has different DGs: 841 1019
= give all rows of 841 and 1019? FALSE.
(All other rows TRUE)
Now consider we want INCLUDE.LAST to look at last dates.
Duplicate dates: 167, 323,814, 841, 910 1019
AND This dup is the last date: 167, 323, 814
AND This dup has different DGs: 323
= give all rows of 323 FALSE.
(All others TRUE)
Of course, I'm happy to run a function twice, either one with a
'first/last' switch, or one that
assumes initial order of sort by DATE determines whether you end up with first
or last date duplicates.
-----Original Message-----
From: arun [mailto:smartpink111 at yahoo.com]
Sent: 24 October 2012 22:59
To: Stuart Leask
Subject: Re: [r] How to pick colums from a ragged array?
Hi Stuart,
So, 167 should be FALSE eventhough DG is same because it comes under
earliest/first date, but TRUE for 814 because it comes under latest/last date.?
167 comes under both cases.
Let me try to make sense of that:
I am just pasting my earlier solution and its results again to see whether we
are on the same page:
res1<- data.frame(flag=tapply(id.d[,2],id.d[,1],FUN=function(x)
head(duplicated(x)|duplicated(x,fromLast=TRUE),1)|tail(duplicated(x)|duplicated(x,fromLast=TRUE),1)))
res2<-id.d[id.d[,1]%in%names(res1[res1$flag==TRUE,])&(duplicated(id.d[,1:2])|duplicated(id.d[,1:2],fromLast=TRUE)),]
res3<-res2[!res2$ID%in%
res2[duplicated(res2)|duplicated(res2,fromLast=TRUE),]$ID,]
id.d1<-id.d
bad<-id.d1[id.d1$ID%in%res3$ID,]
bad$INCLUDE<-FALSE
res4<-merge(id.d1,bad,all=TRUE)
res4$INCLUDE[is.na(res4$INCLUDE)]<-TRUE
res4
? ? ID? ? DATE DG INCLUDE
1? ? 58 20060821? 1? ? TRUE
2? ? 58 20061207? 2? ? TRUE
3? ? 58 20080102? 1? ? TRUE
4? ? 58 20090904? 1? ? TRUE
5? 167 20040205? 4? ? TRUE
6? 167 20040205? 4? ? TRUE
7? 323 20051111? 3? FALSE
8? 323 20060111? 2? FALSE
9? 323 20071119? 3? FALSE
10? 323 20080107? 2? FALSE
11? 323 20080407? 1? FALSE
12? 323 20080521? 2? FALSE
13? 323 20080521? 3? FALSE
14? 547 20041005? 2? ? TRUE
15? 794 20070905? 1? ? TRUE
16? 814 20020814? 2? ? TRUE
17? 814 20021125? 2? ? TRUE
18? 814 20040429? 2? ? TRUE
19? 814 20040429? 2? ? TRUE
20? 814 20071205? 2? ? TRUE
21? 814 20071205? 2? ? TRUE
22? 841 20050421? 1? FALSE
23? 841 20050421? 2? FALSE
24? 841 20060428? 1? FALSE
25? 841 20060602? 1? FALSE
26? 841 20060816? 1? FALSE
27? 841 20061025? 1? FALSE
28? 841 20061129? 1? FALSE
29? 841 20070112? 1? FALSE
30? 841 20070514? 4? FALSE
31? 910 19870508? 3? ? TRUE
32? 910 20040205? 3? ? TRUE
33? 910 20040205? 3? ? TRUE
34? 910 20080521? 3? ? TRUE
35? 910 20080521? 4? ? TRUE
36? 910 20091224? 2? ? TRUE
37? 999 20050503? 2? ? TRUE
38 1019 19870508? 1? FALSE
39 1019 19870508? 2? FALSE
40 1019 19880330? 1? FALSE
A.K.
----- Original Message -----
From: Stuart Leask <Stuart.Leask at nottingham.ac.uk>
To: arun <smartpink111 at yahoo.com>; Rui Barradas <ruipbarradas at
sapo.pt>
Cc: R help <r-help at r-project.org>
Sent: Wednesday, October 24, 2012 5:40 PM
Subject: RE: [r] How to pick colums from a ragged array?
I mis-typed, missing an if. I think you've got it, but let me try again:
"The function should:
-? put FALSE in a column for every instance of an ID IF ( that ID has a first
(or last) DATE duplicated ) AND IF (the DGs for the duplicated dates are
different)."
So for the earliest/first date function, INCLUDE should be TRUE, apart from
FALSE for _all_ the instances of IDs 167, 841 and 1019 For the latest/last date
function, INCLUDE should be TRUE, apart from FALSE for all the instances of ID?
323.
Stuart
-----Original Message-----
From: arun [mailto:smartpink111 at yahoo.com]
Sent: 24 October 2012 21:30
To: Rui Barradas
Cc: R help; Stuart Leask
Subject: Re: [r] How to pick colums from a ragged array?
Hi,
According to the OP "So the function should only exclude an ID, having
identified a first (or last) DATE duplicate, the DGs for these two dates are
different."
Rui:
By running your modified function (using dte <- tapply(x[,2], x[,1], FUN =
function(x) duplicated(fun(x, 2),fromLast = TRUE))),
id.d$INCLUDE <- !(rm1 | rm2)
head(id.d)
#? ? ID? ? DATE DG INCLUDE
#1? ? 58 20060821? 1? ? TRUE
#2? ? 58 20061207? 2? ? TRUE
#3? ? 58 20080102? 1? ? TRUE
#4? ? 58 20090904? 1? ? TRUE
#5? 167 20040205? 4? FALSE
#6? 167 20040205? 4? FALSE
For #167, DGs are same.? Not sure whether to exclude it or not.
My modified solution is similar but I am excluding 167 and 814.
fun1<-function(dat){
res1first<- data.frame(flag=tapply(dat[,2],dat[,1],FUN=function(x)
head(duplicated(x)|duplicated(x,fromLast=TRUE),1)))
res1last<- data.frame(flag=tapply(dat[,2],dat[,1],FUN=function(x)
tail(duplicated(x)|duplicated(x,fromLast=TRUE),1)))
res2first<-dat[dat[,1]%in%names(res1first[res1first$flag==TRUE,])&(duplicated(dat[,1:2])|duplicated(dat[,1:2],fromLast=TRUE)),]
res2last<-dat[dat[,1]%in%names(res1last[res1last$flag==TRUE,])&(duplicated(dat[,1:2])|duplicated(dat[,1:2],fromLast=TRUE)),]
res3first<-res2first[!res2first$ID%in%
res2first[duplicated(res2first)|duplicated(res2first,fromLast=TRUE),]$ID,]
res3last<-res2last[!res2last$ID%in%
res2last[duplicated(res2last)|duplicated(res2last,fromLast=TRUE),]$ID,]
res3firstsubset<-do.call(rbind,lapply(split(res3first,res3first$ID),head,1))
res3firstsubset$INCLUDE<-FALSE
res3lastsubset<-do.call(rbind,lapply(split(res3last,res3last$ID),tail,1))
res3lastsubset$INCLUDE<-FALSE
res4<-merge(dat,merge(res3first,merge(res3firstsubset,merge(res3lastsubset,res3last,all=TRUE),all=TRUE),all=TRUE),all=TRUE)
res4$INCLUDE[is.na(res4$INCLUDE)]<-TRUE
res4
}
tail(fun1(id.d))
#? ? ID? ? DATE DG INCLUDE
#35? 910 20080521? 4? ? TRUE
#36? 910 20091224? 2? ? TRUE
#37? 999 20050503? 2? ? TRUE
#38 1019 19870508? 1? ? TRUE
#39 1019 19870508? 2? FALSE
#40 1019 19880330? 1? ? TRUE
A.K.
----- Original Message -----
From: Rui Barradas <ruipbarradas at sapo.pt>
To: arun <smartpink111 at yahoo.com>
Cc: R help <r-help at r-project.org>; Stuart Leask <Stuart.Leask at
nottingham.ac.uk>
Sent: Wednesday, October 24, 2012 2:50 PM
Subject: Re: [r] How to pick colums from a ragged array?
Hello,
Inline.
Em 24-10-2012 19:05, arun escreveu:> Hi Rui,
>
> I think now our results are matching except in the INCLUDE column
>
> id.d[c(11:13,22:24,38:40),]
> #? ? ID? ? DATE DG INCLUDE
> #11? 323 20080407? 1? ? TRUE
> #12? 323 20080521? 2? FALSE
> #13? 323 20080521? 3? ? TRUE
> #22? 841 20050421? 1? ? TRUE
> #23? 841 20050421? 2? FALSE
> #24? 841 20060428? 1? ? TRUE
> #38 1019 19870508? 2? ? TRUE
> #39 1019 19870508? 1? FALSE
> #40 1019 19880330? 1? ? TRUE
>
>
> I thought all the rows with the above IDS would be FALSE
Why? Look at the last ID, 1019. The last of all must be included, the date
doesn't repeat. And one of the first must also be included, if not we would
be completely excluding that date. Or at least this is how I'm understanding
the problem.
Rui Barradas>? (from my solution):
>
> res4[c(11:13,22:24,38:40),]
>? ? ? ID? ? DATE DG INCLUDE
> #11? 323 20080407? 1? FALSE
> #12? 323 20080521? 2? FALSE
> #13? 323 20080521? 3? FALSE
> #22? 841 20050421? 1? FALSE
> #23? 841 20050421? 2? FALSE
> #24? 841 20060428? 1? FALSE
> #38 1019 19870508? 1? FALSE
> #39 1019 19870508? 2? FALSE
> #40 1019 19880330? 1? FALSE
>
> A.K.
>
>
>
>
> ----- Original Message -----
> From: Rui Barradas <ruipbarradas at sapo.pt>
> To: Stuart Leask <Stuart.Leask at nottingham.ac.uk>
> Cc: "arun (smartpink111 at yahoo.com)" <smartpink111 at
yahoo.com>; PIKAL
> Petr <petr.pikal at precheza.cz>; r-help <r-help at
r-project.org>
> Sent: Wednesday, October 24, 2012 1:41 PM
> Subject: Re: [r] How to pick colums from a ragged array?
>
> Hello,
>
> Using one of Arun's ideas, some post ago, this new function returns a
> logical index into id.d of the rows that should be _removed_, hence
> rm1 and rm2. I think
>
>
>
> getRepLogical <- function(x, first = TRUE){
>? ? ? fun <- if(first) head else tail
>? ? ? dte <- tapply(x[,2], x[,1], FUN = function(x) duplicated(fun(x,
>2)))
>? ? ? len <- tapply(x[,2], x[,1], FUN = length)
>? ? ? lst <- lapply(seq_along(dte), function(i) c(dte[[i]], rep(FALSE,
>if(len[[i]] > 2) len[[i]] - 2 else 0)))
>? ? ? lst <- if(first) lst else lapply(lst, rev)
>? ? ? i1 <- unlist(lst)
>? ? ? dg <- tapply(x[,3], x[,1], FUN = function(x) !duplicated(fun(x,
>2)))
>? ? ? lst <- lapply(seq_along(dte), function(i) c(dg[[i]], rep(FALSE,
>if(len[[i]] > 2) len[[i]] - 2 else 0)))
>? ? ? lst <- if(first) lst else lapply(lst, rev)
>? ? ? i2 <- unlist(lst)
>? ? ? i1 & i2
> }
>
> rm1 <- getRepLogical(id.d)
> rm2 <- getRepLogical(id.d, first = FALSE)
>
> id.d[rm1, ]
> id.d[rm2, ]
>
> id.d$INCLUDE <- !(rm1 | rm2)
>
>
> Hope this helps,
>
> Rui Barradas
> Em 24-10-2012 16:41, Stuart Leask escreveu:
>> (And, considering? the real application, the functions ideally should
>> probably output a variable INCLUDE, the same length as the original
>> data, with TRUE and FALSE for whether or not that row should be
>> included...)
>>
>> -----Original Message-----
>> From: Leask Stuart
>> Sent: 24 October 2012 16:25
>> To: arun (smartpink111 at yahoo.com); 'PIKAL Petr'; Rui
Barradas
>> (ruipbarradas at sapo.pt)
>> Subject: RE: [r] How to pick colums from a ragged array?
>>
>> Arun, Petr, Rui, many thanks for your help, and the functions you have
written.
>>
>> You'll recall I wanted to remove these first (or last) duplicates,
because they represented instances where two different diagnoses (in this case,
variable DG, value 1, 2, 3, 4 or 5) had been recorded on the same day - so I
can't say which was 'first' (or 'last').
>>
>> Your functions have revealed something I wasn't expecting: In some
cases, the diagnoses recorded on the duplicated DATEs are the same!
>> This is a surprise to me, but probably reflects someone going to two
different departments in a clinic, and both departments submit data. I have to
say that crazy things like this are often a feature of real data, which I'm
sure you've come across yourselves.
>>
>> Of course, I don't want to remove records in which I can determine
an unambiguous 'first diagnosis'.
>>
>> You have all put in so much effort on my behalf, I'm ashamed to
ask,
>> but I wonder if any of the functions you've written could do this
>> with a little more Indexing and the 'duplicate' function So the
function should only exclude an ID, having identified a first (or last) DATE
duplicate, the DGs for these two dates are different.
>>
>> Test dataset:
>>
>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>> ,547,794,814,814,814,814,814,814,841,841,841,841,841
>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>> ,1019)
>>
>> DATE <-
>>? ? c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
>>? ? ,20060111,20071119,20080107,20080407,20080521,20080521,20041005
>>? ? ,20070905,20020814,20021125,20040429,20040429,20071205,20071205
>>? ? ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
>>? ? ,20070112,20070514, 19870508,20040205,20040205, 20080521,20080521
>>? ? ,20091224,20050503,19870508,19870508,19880330)
>>
>> DG<-
>> c(1,2,1,1,4,4,3,2,3,2,1,2,3,2,1,2,2,2,2,2,2,1,2,1,1,1,1,1,1,4,3,3,3,4
>> ,3,2,2,2,1,1)
>>
>> id.d<-data.frame(ID,DATE,DG)
>> id.d
>>
>> # Considering Ruis? getRepeat function:
>>
>> g.r<-getRepeat(id.d)? ? # defaults to first = TRUE getRepeat(id.d,
>> first = FALSE)? to get the last ones g.rr<-do.call(rbind, g.r) # put
>> the data into a matrix
>>
>> # I can remove the date duplicates with:
>> g.rr[rep(!duplicated(g.rr)[(1:(dim(g.rr)[1]/2))*2],each=2),]
>>
>> I'm not sure how to add this to your suggestions, Arun &
Petr...
>>
>>
>> Stuart
>>
>>
>> -----Original Message-----
>> From: PIKAL Petr [mailto:petr.pikal at precheza.cz]
>> Sent: 23 October 2012 15:24
>> To: Stuart Leask
>> Subject: RE: [r] How to pick colums from a ragged array?
>>
>> Hi
>>
>> I assumed that id.d is data frame
>>
>> id.d <- data.frame (ID,DATE )
>>
>> and
>>
>> fff(id.d)
>>
>> works for me
>>
>> Petr
>>
>>
>>> -----Original Message-----
>>> From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk]
>>> Sent: Tuesday, October 23, 2012 3:13 PM
>>> To: PIKAL Petr
>>> Subject: RE: [r] How to pick colums from a ragged array?
>>>
>>> Hi Petr.
>>> I see what you mean it should do, but when I run it I get an error
>>> (see below).
>>> Stuart
>>>
>>>
>>>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>>> + ,547,794,814,814,814,814,814,814,841,841,841,841,841
>>> + ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>>> + ,1019)
>>>> DATE <-
>>> +? c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
>>> +? ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>>> +? ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>>> +? ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
>>> +? ,20070112,20070514, 19870508,20040205,20040205,
20091120,20091210
>>> +? ,20091224,20050503,19870508,19870508,19880330)
>>>>? ? id.d <- cbind (ID,DATE )
>>>> fff<-function(data, first=TRUE, remove=FALSE) {
>>> +
>>> + testfirst <- function(x) x[1,2]==x[2,2] testlast <-
function(x)
>>> + x[nrow(x),2]==x[nrow(x)-1,2]
>>> +
>>> + if(first) sel <-
as.numeric(names(which(unlist(sapply(split(data,
>>> + data[,1]), testfirst))))) else sel <-
>>> + as.numeric(names(which(unlist(sapply(split(data, data[,1]),
>>> + testlast)))))
>>> +
>>> + if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in%
>>> + sel,] }
>>>> fff(id.d)
>>> Error in x[1, 2] : incorrect number of dimensions -----Original
>>> Message-----
>>> From: PIKAL Petr [mailto:petr.pikal at precheza.cz]
>>> Sent: 23 October 2012 13:51
>>> To: Stuart Leask; r-help at r-project.org
>>> Subject: RE: [r] How to pick colums from a ragged array?
>>>
>>> Hi
>>>
>>>> -----Original Message-----
>>>> From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk]
>>>> Sent: Tuesday, October 23, 2012 2:29 PM
>>>> To: PIKAL Petr; r-help at r-project.org
>>>> Subject: RE: [r] How to pick colums from a ragged array?
>>>>
>>>> Hi there.
>>>>
>>>> Not sure I follow what you are doing.
>>>>
>>>> I want a list of all the IDs that have duplicate DATE entries,
only
>>>> when the DATE is the earliest (or last) date for that ID.
>>> And that is what the function (with 3 small modifications) does
>>>
>>>
>>> fff<-function(data, first=TRUE, remove=FALSE) {
>>>
>>> testfirst <- function(x) x[1,2]==x[2,2] testlast <-
function(x)
>>> x[nrow(x),2]==x[nrow(x)-1,2]
>>>
>>> if(first) sel <-
as.numeric(names(which(unlist(sapply(split(data,
>>> data[,1]), testfirst))))) else sel <-
>>> as.numeric(names(which(unlist(sapply(split(data, data[,1]),
>>> testlast)))))
>>>
>>> if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,]
>>> }
>>>
>>> See the result of your refined data
>>>
>>> fff(id.d)
>>>? ? ? ? ID? ? ? DATE
>>> 5? 167 2004-02-05
>>> 6? 167 2004-02-05
>>> 22? 841 2005-04-21
>>> 23? 841 2005-04-21
>>> 24? 841 2006-04-28
>>> 25? 841 2006-06-02
>>> 26? 841 2006-08-16
>>> 27? 841 2006-10-25
>>> 28? 841 2006-11-29
>>> 29? 841 2007-01-12
>>> 30? 841 2007-05-14
>>> 38 1019 1987-05-08
>>> 39 1019 1987-05-08
>>> 40 1019 1988-03-30
>>>> fff(id.d, first=F)
>>>? ? ? ID? ? ? DATE
>>> 5 167 2004-02-05
>>> 6 167 2004-02-05
>>>> fff(id.d, remove=T)
>>>? ? ? ? ID? ? ? DATE
>>> 1? 58 2006-08-21
>>> 2? 58 2006-12-07
>>> 3? 58 2008-01-02
>>> 4? 58 2009-09-04
>>> 7? 323 2005-11-11
>>> 8? 323 2006-01-11
>>> 9? 323 2007-11-19
>>> 10 323 2008-01-07
>>> 11 323 2008-04-07
>>> 12 323 2008-05-21
>>> 13 323 2008-07-11
>>> 14 547 2004-10-05
>>> 15 794 2007-09-05
>>> 16 814 2002-08-14
>>> 17 814 2002-11-25
>>> 18 814 2004-04-29
>>> 19 814 2004-04-29
>>> 20 814 2007-12-05
>>> 21 814 2008-02-27
>>> 31 910 1987-05-08
>>> 32 910 2004-02-05
>>> 33 910 2004-02-05
>>> 34 910 2009-11-20
>>> 35 910 2009-12-10
>>> 36 910 2009-12-24
>>> 37 999 2005-05-03
>>> You can do surgery on fff function to see what result comes from
>>>some? piece of the function e.g.
>>>
>>> sapply(split(id.d, id.d[,1]), testlast)
>>>
>>> Regards
>>> Petr
>>>
>>>> I have refined my test dataset, to include some tests (e.g. 910
has
>>>> the same dup as 1019, but for 910 it's not the earliest
date):
>>>>
>>>>
>>>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>>>> ,547,794,814,814,814,814,814,814,841,841,841,841,841
>>>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>>>> ,1019)
>>>>
>>>> DATE <-
>>>>
>>>>c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
>>>>? ?
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>>>>? ?
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>>>>? ?
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
>>>>? ? ,20070112,20070514, 19870508,20040205,20040205,
>>>>20091120,20091210
>>>>? ? ,20091224,20050503,19870508,19870508,19880330)
>>>>
>>>> Correct output:
>>>> "167"? "841"? "1019"
>>>>
>>>> Stuart
>>>>
>>>> -----Original Message-----
>>>> From: PIKAL Petr [mailto:petr.pikal at precheza.cz]
>>>> Sent: 23 October 2012 13:15
>>>> To: Stuart Leask; r-help at r-project.org
>>>> Subject: RE: [r] How to pick colums from a ragged array?
>>>>
>>>> Hi
>>>>
>>>> Rui's answer brought me to more elaborated solution which
still
>>>> needs data frame to be ordered by date
>>>>
>>>> fff<-function(data, first=TRUE, remove=FALSE) {
>>>>
>>>> testfirst <- function(x) x[1,2]==x[2,2] testlast <-
function(x)
>>>> x[length(x),2]==x[length(x)-1,2]
>>>>
>>>> if(first) sel <- as.numeric(names(which(sapply(split(data,
>>>> data[,1]),
>>>> testfirst)))) else sel <-
as.numeric(names(which(sapply(split(data,
>>>> data[,1]), testlast))))
>>>>
>>>> if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] }
>>>>
>>>>
>>>>> fff(id.d)
>>>>? ? ? ? ID? ? DATE
>>>> 31 910 20091105
>>>> 32 910 20091105
>>>> 33 910 20091117
>>>> 34 910 20091119
>>>> 35 910 20091120
>>>> 36 910 20091210
>>>> 37 910 20091224
>>>> 38 910 20091224
>>>>
>>>>> fff(id.d, remove=T)
>>>>? ? ? ? ID? ? DATE
>>>> 1? ? 58 20060821
>>>> 2? ? 58 20061207
>>>> 3? ? 58 20080102
>>>> 4? ? 58 20090904
>>>> 5? 167 20040205
>>>> 6? 167 20040323
>>>> 7? 323 20051111
>>>> 8? 323 20060111
>>>> 9? 323 20071119
>>>> 10? 323 20080107
>>>> 11? 323 20080407
>>>> 12? 323 20080521
>>>> 13? 323 20080711
>>>> 14? 547 20041005
>>>> 15? 794 20070905
>>>> 16? 814 20020814
>>>> 17? 814 20021125
>>>> 18? 814 20040429
>>>> 19? 814 20040429
>>>> 20? 814 20071205
>>>> 21? 814 20080227
>>>> 22? 841 20050421
>>>> 23? 841 20060130
>>>> 24? 841 20060428
>>>> 25? 841 20060602
>>>> 26? 841 20060816
>>>> 27? 841 20061025
>>>> 28? 841 20061129
>>>> 29? 841 20070112
>>>> 30? 841 20070514
>>>> 39? 999 20050503
>>>> 40 1019 19870508
>>>> 41 1019 19880223
>>>> 42 1019 19880330
>>>> 43 1019 19880330
>>>> Regards
>>>> Petr
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-
>>>>> project.org] On Behalf Of PIKAL Petr
>>>>> Sent: Tuesday, October 23, 2012 1:49 PM
>>>>> To: Stuart Leask; r-help at r-project.org
>>>>> Subject: Re: [R] [r] How to pick colums from a ragged
array?
>>>>>
>>>>> Hi
>>>>>
>>>>> I did not check your code and rather followed your
explanation.
>>> BTW,
>>>>> thanks for test data.
>>>>>
>>>>> small change in data frame to make DATE as Date class
>>>>>
>>>>> datum<-as.Date(as.character(DATE),
format="%Y%m%d") id.d <-
>>>>> data.frame(ID,datum )
>>>>>
>>>>> ordering by date
>>>>>
>>>>> id.d<-id.d[order(id.d$datum),]
>>>>>
>>>>>
>>>>> two functions to test if first two dates are the same or
last two
>>>>> dates are the same
>>>>>
>>>>> testfirst <- function(x) x[1,2]==x[2,2] testlast <-
function(x)
>>>>> x[length(x),2]==x[length(x)-1,2]
>>>>>
>>>>> change one last date in the data frame to be the same as
previous
>>>>>
>>>>> id.d[35,2]<-id.d[36,2]
>>>>>
>>>>> and here are results
>>>>>
>>>>> sapply(split(id.d, id.d$ID), testlast)
>>>>>? ? ? 58? 167? 323? 547? 794? 814? 841? 910? 999?
1019
>>>>>FALSE FALSE FALSE? ? NA? ? NA FALSE FALSE? TRUE? ? NA FALSE
>>>>>
>>>>>> sapply(split(id.d, id.d$ID), testfirst)
>>>>>? ? ? 58? 167? 323? 547? 794? 814? 841? 910? 999?
1019
>>>>>FALSE FALSE FALSE? ? NA? ? NA FALSE FALSE FALSE? ? NA FALSE
>>>>>
>>>>> Now you can select ID which is true and remove it from your
data
>>>>> which(sapply(split(id.d, id.d$ID), testlast))
>>>>>
>>>>> and use it for your data frame to subset/remove id.d$ID
=>>>>> as.numeric(names(which(sapply(split(id.d, id.d$ID),
testlast))))
>>> [1]
>>>>> FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE
>>>>> FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE
>>>> FALSE
>>>>> FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE
>>>> FALSE
>>>>> FALSE TRUE? TRUE [37]? TRUE? TRUE? TRUE? TRUE
>>>>>
>>>>> However I am not sure if this is exactly what you want.
>>>>>
>>>>> Regards
>>>>> Petr
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-
>>>>>> project.org] On Behalf Of Stuart Leask
>>>>>> Sent: Tuesday, October 23, 2012 11:38 AM
>>>>>> To: r-help at r-project.org
>>>>>> Subject: [R] [r] How to pick colums from a ragged
array?
>>>>>>
>>>>>> I have a large dataset (~1 million rows) of three
variables: ID
>>>>>> (patient's name), DATE (of appointment) and
DIAGNOSIS (given on
>>>> that
>>>>>> date).
>>>>>> Patients may have been assigned more than one diagnosis
at any
>>> one
>>>>>> appointment - leading to two rows, same ID and DATE but
different
>>>>>> DIAGNOSIS.
>>>>>> The diagnoses may change between appointments.
>>>>>>
>>>>>> I want to subset the data in two ways:
>>>>>>
>>>>>> -? ? ? ? ? define groups of patients by the first
diagnosis given
>>>>>>
>>>>>> -? ? ? ? ? define groups of patients by the last
diagnosis given.
>>>>>>
>>>>>> The problem:
>>>>>> Unfortunately, a small number of patients have been
given more
>>>>>> than one diagnosis at their first (or last)
appointment. These
>>>>>> individuals I need to identify and remove, as it's
not possible
>>> to
>>>>>> say uniquely what their first (or last) diagnosis was.
So I need
>>>>>> to identify and remove these individuals which have
pairs of rows
>>>>>> with the same ID
>>>>> and
>>>>>> (lowest or highest) DATE. The size of the dataset
precludes the
>>>>> option
>>>>>> of doing this by eye.
>>>>>>
>>>>>> I suspect there is a very elegant way of doing this in
R.
>>>>>>
>>>>>> This is what I've come up with:
>>>>>>
>>>>>>
>>>>>> -? ? ? ? ? Sort by DATE then ID
>>>>>>
>>>>>> -? ? ? ? ? Make a ragged array of DATE by ID
>>>>>>
>>>>>> -? ? ? ? ? Remove IDs that only occur once.
>>>>>>
>>>>>> -? ? ? ? ? Subtract the first and second DATEs. Remove
IDs for
>>>> which
>>>>>> this = zero, as this will only be true for IDs for
which the
>>>>>> appointment is recorded twice (because there were two
diagnoses
>>>>>> recorded on this date).
>>>>>>
>>>>>> -? ? ? ? ? (Then do the same to get the 'last
appointment'
>>>>> duplicates,
>>>>>> by reversing the initial sort by DATE.)
>>>>>>
>>>>>> I am stuck at the 'Subtract dates' step: I
would like to get the
>>>>>> data out of the ragged array by columns (so e.g. I end
up with a
>>>>>> matrix of ID, 1st DATE, 2nd DATE). But I can't get
the dates out
>>>>>> by column from the ragged array.
>>>>>>
>>>>>> I hope someone can help. My ugly code is below, with
some data
>>> for
>>>>>> testing.
>>>>>>
>>>>>>
>>>>>> Stuart
>>>>>>
>>>>>>
>>>>>> Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical
Senior
>>>>>> Lecturer and Honorary Consultant Pychiatrist Institute
of Mental
>>>>>> Health, Innovation Park Triumph Road, Nottingham,
Notts. NG7 2TU.
>>>> UK
>>>>>> Tel. +44
>>>>>> 115 82 30419
>>>>>>
>>> stuart.leask at nottingham.ac.uk<mailto:stuart.leask at
nottingham.ac.uk
>>>>>> Google 'Dr Stuart Leask'
>>>>>>
>>>>>>
>>>>>> ID <-
c(58,58,58,58,167,167,323,323,323,323,323,323,323
>>>>>> ,547,794,814,814,814,814,814,814,841,841,841,841,841
>>>>>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>>>>>> ,1019)
>>>>>>
>>>>>> DATE <-
>>>>>>
c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
>>>>>>
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>>>>>>
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>>>>>>
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
>>>>>>
,20070112,20070514,20091105,20091117,20091119,20091120,20091210
>>>>>> ,20091224,20050503,19870508,19880223,19880330)
>>>>>>
>>>>>> id.d <- cbind (ID,DATE )
>>>>>> rag.a? <-? split ( id.d [ ,2 ], id.d [ ,1])? ? ? ? ?
? ? #
>>> create
>>>>>> ragged array, 1-n DATES for every NAME
>>>>>>
>>>>>> # Inelegant attempt to remove IDs that only have one
entry:
>>>>>>
>>>>>> rag.s <-tapply? (id.d [ ,2], id.d [ ,1], sum)? ? ? ?
? ? #add up
>>>> the
>>>>>> dates per row
>>>>>> # Since DATE is in 'year mo da', if there's
only one date, sum
>>>>>> will
>>>>> be
>>>>>> less than 2100000:
>>>>>> rag.t <- rag.s [ rag.s > 21000000 ] multi.dates
<- rownames (
>>>>>> rag.t )? ? ? ? ? ? ? ? ? ? ? ? # all
>>> the
>>>>> IDs
>>>>>> with >1 date
>>>>>> rag.am <- rag.a [ multi.dates ]? ? ? ? ? ? ? ? ? ? ?
? ? #
>>> rag.am
>>>>> only
>>>>>> has IDs with > 1 Date
>>>>>>
>>>>>>
>>>>>> # But now I'm stuck.
>>>>>> # Each row of the array is rag.am$ID.
>>>>>> # So I can't pick columns of DATEs from the ragged
array.
>>>>>>
>>>>>> This message and any attachment are intended solely for
the
>>>>>> addressee and may contain confidential information. If
you have
>>>>>> received this message in error, please send it back to
me, and
>>>>>> immediately delete
>>>>> it.
>>>>>> Please do not use, copy or disclose the information
contained in
>>>>>> this message or in any attachment.? Any views or
opinions
>>>>>> expressed by the author of this email do not
necessarily reflect
>>>>>> the views of the University of Nottingham.
>>>>>>
>>>>>> This message has been checked for viruses but the
contents of an
>>>>>> attachment may still contain software viruses which
could damage
>>>>>> your computer system:
>>>>>> you are advised to perform your own checks. Email
communications
>>>>>> with the University of Nottingham may be monitored as
permitted
>>> by
>>>>>> UK legislation.
>>>>>>? ? ? ? ? ? [[alternative HTML version deleted]]
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide http://www.R-
>>> project.org/posting-
>>>>>> guide.html and provide commented, minimal,
self-contained,
>>>>>> reproducible code.
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
http://www.R-project.org/posting-
>>>>> guide.html and provide commented, minimal, self-contained,
>>>>> reproducible code.