thr3ads.net - R help - [R] Efficient selection and alteration of dataframe records [Feb 2005]

If this information is useful, please help other people find it:
Share via:

Daniel E. Bunker

2005-Feb-03 16:42 UTC

[R] Efficient selection and alteration of dataframe records

Hi All,

I am writing a simulation that examines the effects of species 
extinctions on ecological communties by sequentially removing 
individuals of a given species (sometimes using weighted probabilities) 
and replacing the lost individuals with species identities randomly 
sampled from the remaining individuals. Thus I  use two dataframes. One 
contains all the individuals and their species identities (plotdf).  The 
other contains each species and their associated weights (traitdf).

While I have code that works, it runs slowly.  I suspect there is a more 
efficient way.

First, I 'sample' one species from the species file (traitdf), then I 
use that result to 'subset' the individuals dataframe (plotdf) into two 
new files: individuals of the extincted species (plotdf.del) and 
retained individuals (plotdf.old). 

I then use a 'for' loop to run through each record in plotdf.del and 
randomly sample a new species identity from plotdf.old.  (Note that I 
also need one species specific variable from the traitdf dataframe, 
which I have been attaching using 'merge.') When all are replaced, I 
simply 'rbind' plotdf.old and plotdf.del back together.  I then delete 
another species, etc, etc.

My guess is that there is a way to replace the lost individuals using a 
'sample' that simply excludes the lost individuals (records).  This 
would avoid splitting the data frame and 'rbind'ing it back together.  
If I could also inlcude a second variable from the 'sample'd records, 
this would eliminate the need for the 'merge'.

I am running R2.0.0 on windows 2000. 

Simplified code is below.

Any suggestions would be greatly appreciated.

Thanks for your time, Dan

plotdf=data.frame(
    tag=1:100,
    
pspp=c(rep("Sp1",40),rep("Sp2",30),rep("Sp3",20),rep("Sp4",5),rep("Sp5",5)),
    dim=runif(100)*100)
plotdf[1,]

abun.table=as.data.frame(table(plotdf$pspp))

#2.1 calculate Smax (count of species)
Smax=length(abun.table$Freq[abun.table$Freq>0])
Smax

traitdf=data.frame(
   
tspp=c("Sp1","Sp2","Sp3","Sp4","Sp5"),
    width=runif(5),
    abun=abun.table$Freq)
traitdf[1,]

rm(abun.table)
   
#3. merge plotdf and traitdf
plotdft=merge(plotdf, traitdf, by.x ="pspp", by.y="tspp")

#4 define summary dataframe sumdf
sumdf=data.frame(s.n=NA, s.S=NA, s.crop=NA)

    #reset all data to raw data.
    #b. calculate crop in plotdft with all species present
    plotdft$crop=plotdft$width*exp(-2.0+2.42*(log(plotdft$dim)))
    #c. sum crop
    sumcrop=sum(plotdft$crop)
    #d. write n, S, crop to sumdf
    sumdflength=length(sumdf$s.n)       
    sumdf[sumdflength+1,1]=1;
    sumdf[sumdflength+1,2]=Smax;
    sumdf[sumdflength+1,3]=sumcrop;

    #6. SPECIES DELETION LOOP. This is the species deletion loop.
    #a. repeat from n=1:Smax-1 (S=Smax-n+1)
    for(n in 1:(Smax-1)) {
        S=Smax-n+1;

        #b. remove and replace one species
        #1. sample one species based on weight (e.g., abundance)
        #delsp = sample(traitdf$tspp, size=1);delsp
        delsp = sample(traitdf$tspp, size=1, prob=traitdf[,3]);

        #2. select traitdf records that match delsp
        traitdf.del = subset(traitdf, tspp==delsp);traitdf.del[1,]

        #3. and delete that species from trait data
        traitdf = subset(traitdf, tspp!=delsp[1]);

        #4. split that species from plot data into new df
        plotdf.old = subset(plotdf, plotdf$pspp!=delsp);plotdf.old[1,]
        plotdf.del = subset(plotdf, plotdf$pspp==delsp);plotdf.del[1,]

            #5. replace delsp params with params randomly selected from 
remaining spp:
            for (x in 1:length(plotdf.del$pspp)){
                newsp = sample(plotdf.old$pspp, size=1);#print(newsp[1])
                plotdf.del$pspp[x]=newsp[1]
            }
            #6. rbind plotdf and splitdf into plotdf,
            plotdf=rbind(plotdf.old,plotdf.del);plotdf[1,]

    #b. calculate standing crop,etc
        #1. merge plotdf and traitdf
        plotdft=merge(plotdf, traitdf, by.x ="pspp",
by.y="tspp")

        #2. calculate crop in plotdft
        plotdft$crop=plotdft$width*exp(-2.0+2.42*log(plotdft$dim))

        #3. sum crop
        sumcrop=sum(plotdft$crop)

        #4. calculate S
        abun.table=as.data.frame(table(plotdf$pspp))
        S=length(abun.table$Freq[abun.table$Freq>0])

    #c. write  n, S, crop to sumdf
        sumdflength=length(sumdf$s.n)
        sumdf[sumdflength+1,1]=n+1;
        sumdf[sumdflength+1,2]=S;
        sumdf[sumdflength+1,3]=sumcrop;   
    }#d. REPEAT SPECIES DELETION LOOP
    #housekeeping
    rm(delsp, plotdf, plotdf.del, plotdf.old, plotdft, traitdf.del)
    gc()

#8. plot results, fit line
print(sumdf)
traitdf

plot(sumdf$s.S, sumdf$s.crop)






-- 

Daniel E. Bunker
Associate Coordinator - BioMERGE
Post-Doctoral Research Scientist
Columbia University
Department of Ecology, Evolution and Environmental Biology
1020 Schermerhorn Extension
1200 Amsterdam Avenue
New York, NY 10027-5557

212-854-9881
212-854-8188 fax
deb37 at columbia.edu

Gabor Grothendieck

2005-Feb-03 21:29 UTC

head link

[R] Efficient selection and alteration of dataframe records

I did not attempt to follow your code or discussion but you could
try these:

1. try to pin down what part of your code is taking the time
2. try to eliminate the loop, if possible
3. use matrices rather than data frames -- matrices are faster


Daniel E. Bunker <deb37 <at> columbia.edu> writes:

: 
: Hi All,
: 
: I am writing a simulation that examines the effects of species 
: extinctions on ecological communties by sequentially removing 
: individuals of a given species (sometimes using weighted probabilities) 
: and replacing the lost individuals with species identities randomly 
: sampled from the remaining individuals. Thus I  use two dataframes. One 
: contains all the individuals and their species identities (plotdf).  The 
: other contains each species and their associated weights (traitdf).
: 
: While I have code that works, it runs slowly.  I suspect there is a more 
: efficient way.
: 
: First, I 'sample' one species from the species file (traitdf), then I 
: use that result to 'subset' the individuals dataframe (plotdf) into
two
: new files: individuals of the extincted species (plotdf.del) and 
: retained individuals (plotdf.old). 
: 
: I then use a 'for' loop to run through each record in plotdf.del and 
: randomly sample a new species identity from plotdf.old.  (Note that I 
: also need one species specific variable from the traitdf dataframe, 
: which I have been attaching using 'merge.') When all are replaced, I 
: simply 'rbind' plotdf.old and plotdf.del back together.  I then delete
: another species, etc, etc.
: 
: My guess is that there is a way to replace the lost individuals using a 
: 'sample' that simply excludes the lost individuals (records).  This 
: would avoid splitting the data frame and 'rbind'ing it back together.
: If I could also inlcude a second variable from the 'sample'd records, 
: this would eliminate the need for the 'merge'.
: 
: I am running R2.0.0 on windows 2000. 
: 
: Simplified code is below.
: 
: Any suggestions would be greatly appreciated.
: 
: Thanks for your time, Dan
: 
: plotdf=data.frame(
:     tag=1:100,
: 
:
pspp=c(rep("Sp1",40),rep("Sp2",30),rep("Sp3",20),rep("Sp4",5),rep("Sp5",5)),
:     dim=runif(100)*100)
: plotdf[1,]
: 
: abun.table=as.data.frame(table(plotdf$pspp))
: 
: #2.1 calculate Smax (count of species)
: Smax=length(abun.table$Freq[abun.table$Freq>0])
: Smax
: 
: traitdf=data.frame(
:    
tspp=c("Sp1","Sp2","Sp3","Sp4","Sp5"),
:     width=runif(5),
:     abun=abun.table$Freq)
: traitdf[1,]
: 
: rm(abun.table)
: 
: #3. merge plotdf and traitdf
: plotdft=merge(plotdf, traitdf, by.x ="pspp", by.y="tspp")
: 
: #4 define summary dataframe sumdf
: sumdf=data.frame(s.n=NA, s.S=NA, s.crop=NA)
: 
:     #reset all data to raw data.
:     #b. calculate crop in plotdft with all species present
:     plotdft$crop=plotdft$width*exp(-2.0+2.42*(log(plotdft$dim)))
:     #c. sum crop
:     sumcrop=sum(plotdft$crop)
:     #d. write n, S, crop to sumdf
:     sumdflength=length(sumdf$s.n)       
:     sumdf[sumdflength+1,1]=1;
:     sumdf[sumdflength+1,2]=Smax;
:     sumdf[sumdflength+1,3]=sumcrop;
: 
:     #6. SPECIES DELETION LOOP. This is the species deletion loop.
:     #a. repeat from n=1:Smax-1 (S=Smax-n+1)
:     for(n in 1:(Smax-1)) {
:         S=Smax-n+1;
: 
:         #b. remove and replace one species
:         #1. sample one species based on weight (e.g., abundance)
:         #delsp = sample(traitdf$tspp, size=1);delsp
:         delsp = sample(traitdf$tspp, size=1, prob=traitdf[,3]);
: 
:         #2. select traitdf records that match delsp
:         traitdf.del = subset(traitdf, tspp==delsp);traitdf.del[1,]
: 
:         #3. and delete that species from trait data
:         traitdf = subset(traitdf, tspp!=delsp[1]);
: 
:         #4. split that species from plot data into new df
:         plotdf.old = subset(plotdf, plotdf$pspp!=delsp);plotdf.old[1,]
:         plotdf.del = subset(plotdf, plotdf$pspp==delsp);plotdf.del[1,]
: 
:             #5. replace delsp params with params randomly selected from 
: remaining spp:
:             for (x in 1:length(plotdf.del$pspp)){
:                 newsp = sample(plotdf.old$pspp, size=1);#print(newsp[1])
:                 plotdf.del$pspp[x]=newsp[1]
:             }
:             #6. rbind plotdf and splitdf into plotdf,
:             plotdf=rbind(plotdf.old,plotdf.del);plotdf[1,]
: 
:     #b. calculate standing crop,etc
:         #1. merge plotdf and traitdf
:         plotdft=merge(plotdf, traitdf, by.x ="pspp",
by.y="tspp")
: 
:         #2. calculate crop in plotdft
:         plotdft$crop=plotdft$width*exp(-2.0+2.42*log(plotdft$dim))
: 
:         #3. sum crop
:         sumcrop=sum(plotdft$crop)
: 
:         #4. calculate S
:         abun.table=as.data.frame(table(plotdf$pspp))
:         S=length(abun.table$Freq[abun.table$Freq>0])
: 
:     #c. write  n, S, crop to sumdf
:         sumdflength=length(sumdf$s.n)
:         sumdf[sumdflength+1,1]=n+1;
:         sumdf[sumdflength+1,2]=S;
:         sumdf[sumdflength+1,3]=sumcrop;   
:     }#d. REPEAT SPECIES DELETION LOOP
:     #housekeeping
:     rm(delsp, plotdf, plotdf.del, plotdf.old, plotdft, traitdf.del)
:     gc()
: 
: #8. plot results, fit line
: print(sumdf)
: traitdf
: 
: plot(sumdf$s.S, sumdf$s.crop)
:

Apparently Analagous Threads

Search for more maybe matching threads

R help - Feb 2005 - Efficient selection and alteration of dataframe records

[R] Efficient selection and alteration of dataframe records

[R] Efficient selection and alteration of dataframe records

Apparently Analagous Threads