thr3ads.net - R help - [R] Can I improve the efficiency of my scan() command? [Apr 2003]

If this information is useful, please help other people find it:
Share via:

Thomas Lumley

2003-Apr-11 21:14 UTC

[R] Can I improve the efficiency of my scan() command?

On Sat, 12 Apr 2003, Ko-Kang Kevin Wang wrote:
> Hi,
>
> Suppose I use the following codes to read in a data set.
>
> ###############################################
> > rating <- scan("../Data/Rating.csv",
> +                what = list(
> +                  usage = "",
> +                  mileage = 0,
> +                  sex = "",
> +                  excess = "",
> +                  ncd = "",
> +                  primage = "",
> +                  minage = "",
> +                  drivers = "",
> +                  district = "",
> +                  cargroup = "",
> +                  car.age = 0,
> +                  wsclms = "",
> +                  adclms = "",
> +                  ftclms = "",
> +                  pdclms = "",
> +                  piclms = "",
> +                  adincur = 0,
> +                  pdincur = 0,
> +                  wsincur = 0,
> +                  ftincur = 0,
> +                  piincur = 0,
> +                  record = 0,
> +                  days = 0,
> +                  minagen = 0,
> +                  primagen = 0),
> +                sep=",", quiet = TRUE, skip = 1)
> > rating.df <- as.data.frame(rating)
> > rating.df <- rating.df[, c(-6, -7, -22)]
> > attach(rating.df)
> > summary(rating.df)
<snip>> #########################################################################
>
> It worked all right, but I'm just wondering if there is a more
efficient
> way (it takes about 10 minutes to run the above scripts, for my 300,000 x
> 25 CSV file)?
>
It should be quicker not to convert to a data frame.  You can just keep
the data as a list of vectors and lapply() the summary() function.

	-thomas

Ko-Kang Kevin Wang

2003-Apr-11 21:23 UTC

head link

[R] Can I improve the efficiency of my scan() command?

Hi,

Suppose I use the following codes to read in a data set.

###############################################> rating <- scan("../Data/Rating.csv",+                what = list(
+                  usage = "",
+                  mileage = 0,
+                  sex = "",
+                  excess = "",
+                  ncd = "",
+                  primage = "",
+                  minage = "",
+                  drivers = "",
+                  district = "",
+                  cargroup = "",
+                  car.age = 0,
+                  wsclms = "",
+                  adclms = "",
+                  ftclms = "",
+                  pdclms = "",
+                  piclms = "",
+                  adincur = 0,
+                  pdincur = 0,
+                  wsincur = 0,
+                  ftincur = 0,
+                  piincur = 0,
+                  record = 0,
+                  days = 0,
+                  minagen = 0,
+                  primagen = 0),
+                sep=",", quiet = TRUE, skip =
1)> rating.df <- as.data.frame(rating)
> rating.df <- rating.df[, c(-6, -7, -22)]
> attach(rating.df)
> summary(rating.df) usage          mileage      sex        excess       ncd        drivers   
 S :125788   Min.   :  288   F: 82208   0  :  4744   0:   880   1:100791  
 SB: 12581   1st Qu.: 5000   M:217792   100:161311   1:  2819   2:175100  
 SC:161524   Median : 8000              75 :133945   2:  5245   3: 19146  
 ST:   107   Mean   : 7640                           3:  5230   4:  4156  
             3rd Qu.:10000                           4:285826   5:   515  
             Max.   :40000                                      6:    69  
                                                                7:   223  
    district        cargroup        car.age       wsclms     adclms    
 6      :59053   8      :44524   Min.   :-1.000   0:294521   0:292852  
 5      :57113   6      :39171   1st Qu.: 4.000   1:  5267   1:  6720  
 7      :51166   9      :38965   Median : 7.000   2:   201   2:   405  
 4      :50643   7      :35139   Mean   : 7.234   3:    11   3:    23  
 3      :33041   10     :31091   3rd Qu.:10.000                        
 8      :16437   5      :27456   Max.   :30.000                        
 (Other):32547   (Other):83654                                         
 ftclms     pdclms     piclms        adincur            pdincur        
 0:298661    :281056    :281056   Min.   :    0.00   Min.   : -4985.2  
 1:  1316   0: 15277   0: 18131   1st Qu.:    0.00   1st Qu.:     0.0  
 2:    22   1:  3587   1:   809   Median :    0.00   Median :     0.0  
 3:     1   2:    79   2:     4   Mean   :   21.25   Mean   :   225.4  
            3:     1              3rd Qu.:    0.00   3rd Qu.:     0.0  
                                  Max.   :13779.55   Max.   : 25050.0  
                                                     NA's   :281056.0  
    wsincur           ftincur             piincur              days      
 Min.   :   0.00   Min.   :    0.000   Min.   :     0.0   Min.   :  0.0  
 1st Qu.:   0.00   1st Qu.:    0.000   1st Qu.:     0.0   1st Qu.:123.0  
 Median :   0.00   Median :    0.000   Median :     0.0   Median :340.0  
 Mean   :   2.07   Mean   :    5.183   Mean   :   345.8   Mean   :248.7  
 3rd Qu.:   0.00   3rd Qu.:    0.000   3rd Qu.:     0.0   3rd Qu.:364.0  
 Max.   :2004.64   Max.   :25082.910   Max.   :484550.1   Max.   :365.0  
                                       NA's   :281056.0                  
    minagen         primagen    
 Min.   :17.00   Min.   :17.00  
 1st Qu.:41.00   1st Qu.:43.00  
 Median :56.00   Median :53.00  
 Mean   :63.81   Mean   :53.25  
 3rd Qu.:99.00   3rd Qu.:64.00  
 Max.   :99.00   Max.   :93.00  
                             
#########################################################################

It worked all right, but I'm just wondering if there is a more efficient 
way (it takes about 10 minutes to run the above scripts, for my 300,000 x 
25 CSV file)?

For example, the CSV file has 25 columns but I don't need 3 of them (6, 7, 
and 22).  What I have done is to scan them in anyway, convert the list 
into a data frame then remove the 3 columns.  Just wonder if it is 
possible to simply ignore them in scan() to make the process faster?

-- 
Cheers,

Kevin

------------------------------------------------------------------------------
/* Time is the greatest teacher, unfortunately it kills its students */

--
Ko-Kang Kevin Wang
Master of Science (MSc) Student
SLC Tutor and Lab Demonstrator
Department of Statistics
University of Auckland
New Zealand
Homepage: http://www.stat.auckland.ac.nz/~kwan022
Ph: 373-7599
    x88475 (City)
    x88480 (Tamaki)

Pierre Kleiber

2003-Apr-11 22:07 UTC

head link

[R] Can I improve the efficiency of my scan() command?

Ko-Kang Kevin Wang wrote:> Hi,
> 
> Suppose I use the following codes to read in a data set.
> 
> ###############################################
> 
>>rating <- scan("../Data/Rating.csv",
> 
> +                what = list( > +                  usage = "",
 > +                  mileage = 0,
 > +                  sex = "",
 > +                  excess = "",
 > +                  ncd = "",
 > +                  primage = "",
 > +                  minage = "",
 > +                  drivers = "",
 > +                  district = "",
 > +                  cargroup = "",
 > +                  car.age = 0,
 > +                  wsclms = "",

[...]>                              
> #########################################################################
> 
> It worked all right, but I'm just wondering if there is a more
efficient
> way (it takes about 10 minutes to run the above scripts, for my 300,000 x 
> 25 CSV file)?
> 
> For example, the CSV file has 25 columns but I don't need 3 of them (6,
7,
> and 22).  What I have done is to scan them in anyway, convert the list 
> into a data frame then remove the 3 columns.  Just wonder if it is 
> possible to simply ignore them in scan() to make the process faster?
> 

It might not make a lot of difference in your case where you are
reading many fields and want to ignore a few, but if you want to read
a few out of many, it would help to preprocess the input file using,
for example, awk as in the following, which would pick up fields 1, 2,
and 4:

 > con <- pipe("awk -F , '{print $1,$3 $4}'
../Data/Rating.csv")
 > rating <- scan(con, what = list(
+                  usage = "",
+                  mileage = 0,
+                  excess = "")
+            , quiet = TRUE, skip = 1)
 > close(con)

I do this sort of thing a lot using various utilities; so I've defined
the following function to take care of opening and closing the
connection:

scanpipe <- function(x,...) {
   con <- pipe(x)
   out <- scan(con,...)
   close(con)
   out
}


-- 
-----------------------------------------------------------------
Pierre Kleiber             Email: pkleiber at honlab.nmfs.hawaii.edu
Fishery Biologist                     Tel: 808 983-5399/737-7544
NOAA FISHERIES - Honolulu Laboratory         Fax: 808 983-2902
2570 Dole St., Honolulu, HI 96822-2396
-----------------------------------------------------------------

Liaw, Andy

2003-Apr-11 22:28 UTC

head link

[R] Can I improve the efficiency of my scan() command?

> From: Pierre Kleiber [mailto:pkleiber at honlab.nmfs.hawaii.edu]
> 
> Ko-Kang Kevin Wang wrote:
[snipped]> > 
> > It worked all right, but I'm just wondering if there is a 
> more efficient 
> > way (it takes about 10 minutes to run the above scripts, 
> for my 300,000 x 
> > 25 CSV file)?
> > 
> > For example, the CSV file has 25 columns but I don't need 3 
> of them (6, 7, 
> > and 22).  What I have done is to scan them in anyway, 
> convert the list 
> > into a data frame then remove the 3 columns.  Just wonder if it is 
> > possible to simply ignore them in scan() to make the process faster?
> > 
> 
> 
> It might not make a lot of difference in your case where you are
> reading many fields and want to ignore a few, but if you want to read
> a few out of many, it would help to preprocess the input file using,
> for example, awk as in the following, which would pick up fields 1, 2,
> and 4:
> 
>  > con <- pipe("awk -F , '{print $1,$3 $4}'
../Data/Rating.csv")
>  > rating <- scan(con, what = list(
> +                  usage = "",
> +                  mileage = 0,
> +                  excess = "")
> +            , quiet = TRUE, skip = 1)
>  > close(con)
Or even pipe("cut -d, -f1,3-4 ...")

Andy
> 
> I do this sort of thing a lot using various utilities; so I've defined
> the following function to take care of opening and closing the
> connection:
> 
> scanpipe <- function(x,...) {
>    con <- pipe(x)
>    out <- scan(con,...)
>    close(con)
>    out
> }
> 
> 
> -- 
> -----------------------------------------------------------------
> Pierre Kleiber             Email: pkleiber at honlab.nmfs.hawaii.edu
> Fishery Biologist                     Tel: 808 983-5399/737-7544
> NOAA FISHERIES - Honolulu Laboratory         Fax: 808 983-2902
> 2570 Dole St., Honolulu, HI 96822-2396
> -----------------------------------------------------------------
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> 
------------------------------------------------------------------------------

Prof Brian Ripley

2003-Apr-12 07:14 UTC

head link

[R] Can I improve the efficiency of my scan() command?

On Sat, 12 Apr 2003, Ko-Kang Kevin Wang wrote:

[...]
> For example, the CSV file has 25 columns but I don't need 3 of them (6,
7,
> and 22).  What I have done is to scan them in anyway, convert the list 
> into a data frame then remove the 3 columns.  Just wonder if it is 
> possible to simply ignore them in scan() to make the process faster?
Yes: see the help page

      If any of the types is `NULL', the corresponding field is skipped
     (but a `NULL' component appears in the result).

If you don't need a data frame, don't do the conversion.  You might
well find read.table setting colClasses is faster than converting by 
as.data.frame.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Possibly Parallel Threads

Search for more reasonably related threads

R help - Apr 2003 - Can I improve the efficiency of my scan() command?

[R] Can I improve the efficiency of my scan() command?

[R] Can I improve the efficiency of my scan() command?

[R] Can I improve the efficiency of my scan() command?

[R] Can I improve the efficiency of my scan() command?

[R] Can I improve the efficiency of my scan() command?

Possibly Parallel Threads