arun
2013-Jun-08 23:47 UTC
[R] Subset dataframe with loop searching for unique values in two columns
Hi, You could try this: dat2<- read.table(text=' ?case pin some_data ?"A"? "1" "data"? "A"? "2" "data"? "A"? "1" "data"? "A"? "2" "data"? "B"? "1" "data"? "B"? "2" "data" ',sep="",header=TRUE,stringsAsFactors=FALSE)? dat2[!duplicated(dat2[,1:2]),] #? case pin some_data #1??? A?? 1????? data #2??? A?? 2????? data #5??? B?? 1????? data #6??? B?? 2????? data #or ?dat2[row.names(unique(dat2[,1:2])),] ##assuming that the third column is different for the duplicated `case` and `pin` ?# case pin some_data #1 ?? A?? 1????? data #2??? A?? 2????? data #5??? B?? 1????? data #6??? B?? 2????? data #If `some_data` is same for duplicated rows: unique(dat2) #? case pin some_data #1??? A?? 1????? data #2??? A?? 2????? data #5??? B?? 1????? data #6??? B?? 2????? data A.K. Hello, First off, I'm sure that this is posted somewhere but I've not been able to find what I'm looking for. Please forgive the duplication and thank you for your help!!!! I have a crime dataset of over 500k observations in one file. To simplify my problem, I have a dataframe that has a "case" ID in one column, a personal ID number (pin) in another, and associated "data" in subsequent columns. Example: ? ? ?case pin some_data [1,] "A" ?"1" "data" ? [2,] "A" ?"2" "data" ? [3,] "A" ?"1" "data" ? [4,] "A" ?"2" "data" ? [5,] "B" ?"1" "data" ? [6,] "B" ?"2" "data" ? I would like to subset the data so that only unique PINs and CASES are left with the subsequent data ? ? ?case pin some_data [1,] "A" ?"1" "data" ? [2,] "A" ?"2" "data" ? ? [5,] "B" ?"1" "data" ? [6,] "B" ?"2" "data" ? I'm teaching my self how to program in R and I'm thinking that I want a loop to say something like: - find and keep first row of unique PIN & CASE - if PIN is duplicate but CASE is different, keep first row of dupe PIN & new CASE Longer Explanation: The PIN identifies an arrested offender. I want to check and see if there was recidivism, repeat offenses and arrests, for each offender/PIN. The way I can do that is by checking whether a PIN has multiple CASE numbers. I also want to keep the single arrests in the dataset too. I have over 6 million cases for several years. I hope this makes sense, I've been banging my head for a while on this one and really would appreciate the help!!