Daniel E. Bunker
2005-Feb-03 16:42 UTC
[R] Efficient selection and alteration of dataframe records
Hi All, I am writing a simulation that examines the effects of species extinctions on ecological communties by sequentially removing individuals of a given species (sometimes using weighted probabilities) and replacing the lost individuals with species identities randomly sampled from the remaining individuals. Thus I use two dataframes. One contains all the individuals and their species identities (plotdf). The other contains each species and their associated weights (traitdf). While I have code that works, it runs slowly. I suspect there is a more efficient way. First, I 'sample' one species from the species file (traitdf), then I use that result to 'subset' the individuals dataframe (plotdf) into two new files: individuals of the extincted species (plotdf.del) and retained individuals (plotdf.old). I then use a 'for' loop to run through each record in plotdf.del and randomly sample a new species identity from plotdf.old. (Note that I also need one species specific variable from the traitdf dataframe, which I have been attaching using 'merge.') When all are replaced, I simply 'rbind' plotdf.old and plotdf.del back together. I then delete another species, etc, etc. My guess is that there is a way to replace the lost individuals using a 'sample' that simply excludes the lost individuals (records). This would avoid splitting the data frame and 'rbind'ing it back together. If I could also inlcude a second variable from the 'sample'd records, this would eliminate the need for the 'merge'. I am running R2.0.0 on windows 2000. Simplified code is below. Any suggestions would be greatly appreciated. Thanks for your time, Dan plotdf=data.frame( tag=1:100, pspp=c(rep("Sp1",40),rep("Sp2",30),rep("Sp3",20),rep("Sp4",5),rep("Sp5",5)), dim=runif(100)*100) plotdf[1,] abun.table=as.data.frame(table(plotdf$pspp)) #2.1 calculate Smax (count of species) Smax=length(abun.table$Freq[abun.table$Freq>0]) Smax traitdf=data.frame( tspp=c("Sp1","Sp2","Sp3","Sp4","Sp5"), width=runif(5), abun=abun.table$Freq) traitdf[1,] rm(abun.table) #3. merge plotdf and traitdf plotdft=merge(plotdf, traitdf, by.x ="pspp", by.y="tspp") #4 define summary dataframe sumdf sumdf=data.frame(s.n=NA, s.S=NA, s.crop=NA) #reset all data to raw data. #b. calculate crop in plotdft with all species present plotdft$crop=plotdft$width*exp(-2.0+2.42*(log(plotdft$dim))) #c. sum crop sumcrop=sum(plotdft$crop) #d. write n, S, crop to sumdf sumdflength=length(sumdf$s.n) sumdf[sumdflength+1,1]=1; sumdf[sumdflength+1,2]=Smax; sumdf[sumdflength+1,3]=sumcrop; #6. SPECIES DELETION LOOP. This is the species deletion loop. #a. repeat from n=1:Smax-1 (S=Smax-n+1) for(n in 1:(Smax-1)) { S=Smax-n+1; #b. remove and replace one species #1. sample one species based on weight (e.g., abundance) #delsp = sample(traitdf$tspp, size=1);delsp delsp = sample(traitdf$tspp, size=1, prob=traitdf[,3]); #2. select traitdf records that match delsp traitdf.del = subset(traitdf, tspp==delsp);traitdf.del[1,] #3. and delete that species from trait data traitdf = subset(traitdf, tspp!=delsp[1]); #4. split that species from plot data into new df plotdf.old = subset(plotdf, plotdf$pspp!=delsp);plotdf.old[1,] plotdf.del = subset(plotdf, plotdf$pspp==delsp);plotdf.del[1,] #5. replace delsp params with params randomly selected from remaining spp: for (x in 1:length(plotdf.del$pspp)){ newsp = sample(plotdf.old$pspp, size=1);#print(newsp[1]) plotdf.del$pspp[x]=newsp[1] } #6. rbind plotdf and splitdf into plotdf, plotdf=rbind(plotdf.old,plotdf.del);plotdf[1,] #b. calculate standing crop,etc #1. merge plotdf and traitdf plotdft=merge(plotdf, traitdf, by.x ="pspp", by.y="tspp") #2. calculate crop in plotdft plotdft$crop=plotdft$width*exp(-2.0+2.42*log(plotdft$dim)) #3. sum crop sumcrop=sum(plotdft$crop) #4. calculate S abun.table=as.data.frame(table(plotdf$pspp)) S=length(abun.table$Freq[abun.table$Freq>0]) #c. write n, S, crop to sumdf sumdflength=length(sumdf$s.n) sumdf[sumdflength+1,1]=n+1; sumdf[sumdflength+1,2]=S; sumdf[sumdflength+1,3]=sumcrop; }#d. REPEAT SPECIES DELETION LOOP #housekeeping rm(delsp, plotdf, plotdf.del, plotdf.old, plotdft, traitdf.del) gc() #8. plot results, fit line print(sumdf) traitdf plot(sumdf$s.S, sumdf$s.crop) -- Daniel E. Bunker Associate Coordinator - BioMERGE Post-Doctoral Research Scientist Columbia University Department of Ecology, Evolution and Environmental Biology 1020 Schermerhorn Extension 1200 Amsterdam Avenue New York, NY 10027-5557 212-854-9881 212-854-8188 fax deb37 at columbia.edu
Gabor Grothendieck
2005-Feb-03 21:29 UTC
[R] Efficient selection and alteration of dataframe records
I did not attempt to follow your code or discussion but you could try these: 1. try to pin down what part of your code is taking the time 2. try to eliminate the loop, if possible 3. use matrices rather than data frames -- matrices are faster Daniel E. Bunker <deb37 <at> columbia.edu> writes: : : Hi All, : : I am writing a simulation that examines the effects of species : extinctions on ecological communties by sequentially removing : individuals of a given species (sometimes using weighted probabilities) : and replacing the lost individuals with species identities randomly : sampled from the remaining individuals. Thus I use two dataframes. One : contains all the individuals and their species identities (plotdf). The : other contains each species and their associated weights (traitdf). : : While I have code that works, it runs slowly. I suspect there is a more : efficient way. : : First, I 'sample' one species from the species file (traitdf), then I : use that result to 'subset' the individuals dataframe (plotdf) into two : new files: individuals of the extincted species (plotdf.del) and : retained individuals (plotdf.old). : : I then use a 'for' loop to run through each record in plotdf.del and : randomly sample a new species identity from plotdf.old. (Note that I : also need one species specific variable from the traitdf dataframe, : which I have been attaching using 'merge.') When all are replaced, I : simply 'rbind' plotdf.old and plotdf.del back together. I then delete : another species, etc, etc. : : My guess is that there is a way to replace the lost individuals using a : 'sample' that simply excludes the lost individuals (records). This : would avoid splitting the data frame and 'rbind'ing it back together. : If I could also inlcude a second variable from the 'sample'd records, : this would eliminate the need for the 'merge'. : : I am running R2.0.0 on windows 2000. : : Simplified code is below. : : Any suggestions would be greatly appreciated. : : Thanks for your time, Dan : : plotdf=data.frame( : tag=1:100, : : pspp=c(rep("Sp1",40),rep("Sp2",30),rep("Sp3",20),rep("Sp4",5),rep("Sp5",5)), : dim=runif(100)*100) : plotdf[1,] : : abun.table=as.data.frame(table(plotdf$pspp)) : : #2.1 calculate Smax (count of species) : Smax=length(abun.table$Freq[abun.table$Freq>0]) : Smax : : traitdf=data.frame( : tspp=c("Sp1","Sp2","Sp3","Sp4","Sp5"), : width=runif(5), : abun=abun.table$Freq) : traitdf[1,] : : rm(abun.table) : : #3. merge plotdf and traitdf : plotdft=merge(plotdf, traitdf, by.x ="pspp", by.y="tspp") : : #4 define summary dataframe sumdf : sumdf=data.frame(s.n=NA, s.S=NA, s.crop=NA) : : #reset all data to raw data. : #b. calculate crop in plotdft with all species present : plotdft$crop=plotdft$width*exp(-2.0+2.42*(log(plotdft$dim))) : #c. sum crop : sumcrop=sum(plotdft$crop) : #d. write n, S, crop to sumdf : sumdflength=length(sumdf$s.n) : sumdf[sumdflength+1,1]=1; : sumdf[sumdflength+1,2]=Smax; : sumdf[sumdflength+1,3]=sumcrop; : : #6. SPECIES DELETION LOOP. This is the species deletion loop. : #a. repeat from n=1:Smax-1 (S=Smax-n+1) : for(n in 1:(Smax-1)) { : S=Smax-n+1; : : #b. remove and replace one species : #1. sample one species based on weight (e.g., abundance) : #delsp = sample(traitdf$tspp, size=1);delsp : delsp = sample(traitdf$tspp, size=1, prob=traitdf[,3]); : : #2. select traitdf records that match delsp : traitdf.del = subset(traitdf, tspp==delsp);traitdf.del[1,] : : #3. and delete that species from trait data : traitdf = subset(traitdf, tspp!=delsp[1]); : : #4. split that species from plot data into new df : plotdf.old = subset(plotdf, plotdf$pspp!=delsp);plotdf.old[1,] : plotdf.del = subset(plotdf, plotdf$pspp==delsp);plotdf.del[1,] : : #5. replace delsp params with params randomly selected from : remaining spp: : for (x in 1:length(plotdf.del$pspp)){ : newsp = sample(plotdf.old$pspp, size=1);#print(newsp[1]) : plotdf.del$pspp[x]=newsp[1] : } : #6. rbind plotdf and splitdf into plotdf, : plotdf=rbind(plotdf.old,plotdf.del);plotdf[1,] : : #b. calculate standing crop,etc : #1. merge plotdf and traitdf : plotdft=merge(plotdf, traitdf, by.x ="pspp", by.y="tspp") : : #2. calculate crop in plotdft : plotdft$crop=plotdft$width*exp(-2.0+2.42*log(plotdft$dim)) : : #3. sum crop : sumcrop=sum(plotdft$crop) : : #4. calculate S : abun.table=as.data.frame(table(plotdf$pspp)) : S=length(abun.table$Freq[abun.table$Freq>0]) : : #c. write n, S, crop to sumdf : sumdflength=length(sumdf$s.n) : sumdf[sumdflength+1,1]=n+1; : sumdf[sumdflength+1,2]=S; : sumdf[sumdflength+1,3]=sumcrop; : }#d. REPEAT SPECIES DELETION LOOP : #housekeeping : rm(delsp, plotdf, plotdf.del, plotdf.old, plotdft, traitdf.del) : gc() : : #8. plot results, fit line : print(sumdf) : traitdf : : plot(sumdf$s.S, sumdf$s.crop) :