Thomas Lumley
2003-Apr-11 21:14 UTC
[R] Can I improve the efficiency of my scan() command?
On Sat, 12 Apr 2003, Ko-Kang Kevin Wang wrote:> Hi, > > Suppose I use the following codes to read in a data set. > > ############################################### > > rating <- scan("../Data/Rating.csv", > + what = list( > + usage = "", > + mileage = 0, > + sex = "", > + excess = "", > + ncd = "", > + primage = "", > + minage = "", > + drivers = "", > + district = "", > + cargroup = "", > + car.age = 0, > + wsclms = "", > + adclms = "", > + ftclms = "", > + pdclms = "", > + piclms = "", > + adincur = 0, > + pdincur = 0, > + wsincur = 0, > + ftincur = 0, > + piincur = 0, > + record = 0, > + days = 0, > + minagen = 0, > + primagen = 0), > + sep=",", quiet = TRUE, skip = 1) > > rating.df <- as.data.frame(rating) > > rating.df <- rating.df[, c(-6, -7, -22)] > > attach(rating.df) > > summary(rating.df)<snip>> ######################################################################### > > It worked all right, but I'm just wondering if there is a more efficient > way (it takes about 10 minutes to run the above scripts, for my 300,000 x > 25 CSV file)? >It should be quicker not to convert to a data frame. You can just keep the data as a list of vectors and lapply() the summary() function. -thomas
Ko-Kang Kevin Wang
2003-Apr-11 21:23 UTC
[R] Can I improve the efficiency of my scan() command?
Hi, Suppose I use the following codes to read in a data set. ###############################################> rating <- scan("../Data/Rating.csv",+ what = list( + usage = "", + mileage = 0, + sex = "", + excess = "", + ncd = "", + primage = "", + minage = "", + drivers = "", + district = "", + cargroup = "", + car.age = 0, + wsclms = "", + adclms = "", + ftclms = "", + pdclms = "", + piclms = "", + adincur = 0, + pdincur = 0, + wsincur = 0, + ftincur = 0, + piincur = 0, + record = 0, + days = 0, + minagen = 0, + primagen = 0), + sep=",", quiet = TRUE, skip = 1)> rating.df <- as.data.frame(rating) > rating.df <- rating.df[, c(-6, -7, -22)] > attach(rating.df) > summary(rating.df)usage mileage sex excess ncd drivers S :125788 Min. : 288 F: 82208 0 : 4744 0: 880 1:100791 SB: 12581 1st Qu.: 5000 M:217792 100:161311 1: 2819 2:175100 SC:161524 Median : 8000 75 :133945 2: 5245 3: 19146 ST: 107 Mean : 7640 3: 5230 4: 4156 3rd Qu.:10000 4:285826 5: 515 Max. :40000 6: 69 7: 223 district cargroup car.age wsclms adclms 6 :59053 8 :44524 Min. :-1.000 0:294521 0:292852 5 :57113 6 :39171 1st Qu.: 4.000 1: 5267 1: 6720 7 :51166 9 :38965 Median : 7.000 2: 201 2: 405 4 :50643 7 :35139 Mean : 7.234 3: 11 3: 23 3 :33041 10 :31091 3rd Qu.:10.000 8 :16437 5 :27456 Max. :30.000 (Other):32547 (Other):83654 ftclms pdclms piclms adincur pdincur 0:298661 :281056 :281056 Min. : 0.00 Min. : -4985.2 1: 1316 0: 15277 0: 18131 1st Qu.: 0.00 1st Qu.: 0.0 2: 22 1: 3587 1: 809 Median : 0.00 Median : 0.0 3: 1 2: 79 2: 4 Mean : 21.25 Mean : 225.4 3: 1 3rd Qu.: 0.00 3rd Qu.: 0.0 Max. :13779.55 Max. : 25050.0 NA's :281056.0 wsincur ftincur piincur days Min. : 0.00 Min. : 0.000 Min. : 0.0 Min. : 0.0 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.0 1st Qu.:123.0 Median : 0.00 Median : 0.000 Median : 0.0 Median :340.0 Mean : 2.07 Mean : 5.183 Mean : 345.8 Mean :248.7 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0.0 3rd Qu.:364.0 Max. :2004.64 Max. :25082.910 Max. :484550.1 Max. :365.0 NA's :281056.0 minagen primagen Min. :17.00 Min. :17.00 1st Qu.:41.00 1st Qu.:43.00 Median :56.00 Median :53.00 Mean :63.81 Mean :53.25 3rd Qu.:99.00 3rd Qu.:64.00 Max. :99.00 Max. :93.00 ######################################################################### It worked all right, but I'm just wondering if there is a more efficient way (it takes about 10 minutes to run the above scripts, for my 300,000 x 25 CSV file)? For example, the CSV file has 25 columns but I don't need 3 of them (6, 7, and 22). What I have done is to scan them in anyway, convert the list into a data frame then remove the 3 columns. Just wonder if it is possible to simply ignore them in scan() to make the process faster? -- Cheers, Kevin ------------------------------------------------------------------------------ /* Time is the greatest teacher, unfortunately it kills its students */ -- Ko-Kang Kevin Wang Master of Science (MSc) Student SLC Tutor and Lab Demonstrator Department of Statistics University of Auckland New Zealand Homepage: http://www.stat.auckland.ac.nz/~kwan022 Ph: 373-7599 x88475 (City) x88480 (Tamaki)
Pierre Kleiber
2003-Apr-11 22:07 UTC
[R] Can I improve the efficiency of my scan() command?
Ko-Kang Kevin Wang wrote:> Hi, > > Suppose I use the following codes to read in a data set. > > ############################################### > >>rating <- scan("../Data/Rating.csv", > > + what = list(> + usage = "", > + mileage = 0, > + sex = "", > + excess = "", > + ncd = "", > + primage = "", > + minage = "", > + drivers = "", > + district = "", > + cargroup = "", > + car.age = 0, > + wsclms = "", [...]> > ######################################################################### > > It worked all right, but I'm just wondering if there is a more efficient > way (it takes about 10 minutes to run the above scripts, for my 300,000 x > 25 CSV file)? > > For example, the CSV file has 25 columns but I don't need 3 of them (6, 7, > and 22). What I have done is to scan them in anyway, convert the list > into a data frame then remove the 3 columns. Just wonder if it is > possible to simply ignore them in scan() to make the process faster? >It might not make a lot of difference in your case where you are reading many fields and want to ignore a few, but if you want to read a few out of many, it would help to preprocess the input file using, for example, awk as in the following, which would pick up fields 1, 2, and 4: > con <- pipe("awk -F , '{print $1,$3 $4}' ../Data/Rating.csv") > rating <- scan(con, what = list( + usage = "", + mileage = 0, + excess = "") + , quiet = TRUE, skip = 1) > close(con) I do this sort of thing a lot using various utilities; so I've defined the following function to take care of opening and closing the connection: scanpipe <- function(x,...) { con <- pipe(x) out <- scan(con,...) close(con) out } -- ----------------------------------------------------------------- Pierre Kleiber Email: pkleiber at honlab.nmfs.hawaii.edu Fishery Biologist Tel: 808 983-5399/737-7544 NOAA FISHERIES - Honolulu Laboratory Fax: 808 983-2902 2570 Dole St., Honolulu, HI 96822-2396 -----------------------------------------------------------------
> From: Pierre Kleiber [mailto:pkleiber at honlab.nmfs.hawaii.edu] > > Ko-Kang Kevin Wang wrote:[snipped]> > > > It worked all right, but I'm just wondering if there is a > more efficient > > way (it takes about 10 minutes to run the above scripts, > for my 300,000 x > > 25 CSV file)? > > > > For example, the CSV file has 25 columns but I don't need 3 > of them (6, 7, > > and 22). What I have done is to scan them in anyway, > convert the list > > into a data frame then remove the 3 columns. Just wonder if it is > > possible to simply ignore them in scan() to make the process faster? > > > > > It might not make a lot of difference in your case where you are > reading many fields and want to ignore a few, but if you want to read > a few out of many, it would help to preprocess the input file using, > for example, awk as in the following, which would pick up fields 1, 2, > and 4: > > > con <- pipe("awk -F , '{print $1,$3 $4}' ../Data/Rating.csv") > > rating <- scan(con, what = list( > + usage = "", > + mileage = 0, > + excess = "") > + , quiet = TRUE, skip = 1) > > close(con)Or even pipe("cut -d, -f1,3-4 ...") Andy> > I do this sort of thing a lot using various utilities; so I've defined > the following function to take care of opening and closing the > connection: > > scanpipe <- function(x,...) { > con <- pipe(x) > out <- scan(con,...) > close(con) > out > } > > > -- > ----------------------------------------------------------------- > Pierre Kleiber Email: pkleiber at honlab.nmfs.hawaii.edu > Fishery Biologist Tel: 808 983-5399/737-7544 > NOAA FISHERIES - Honolulu Laboratory Fax: 808 983-2902 > 2570 Dole St., Honolulu, HI 96822-2396 > ----------------------------------------------------------------- > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help >------------------------------------------------------------------------------
Prof Brian Ripley
2003-Apr-12 07:14 UTC
[R] Can I improve the efficiency of my scan() command?
On Sat, 12 Apr 2003, Ko-Kang Kevin Wang wrote: [...]> For example, the CSV file has 25 columns but I don't need 3 of them (6, 7, > and 22). What I have done is to scan them in anyway, convert the list > into a data frame then remove the 3 columns. Just wonder if it is > possible to simply ignore them in scan() to make the process faster?Yes: see the help page If any of the types is `NULL', the corresponding field is skipped (but a `NULL' component appears in the result). If you don't need a data frame, don't do the conversion. You might well find read.table setting colClasses is faster than converting by as.data.frame. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595