thr3ads.net - R help - [R] select data from large CSV file [Jul 2007]

If this information is useful, please help other people find it:
Share via:

Lars Modig

2007-Jul-05 11:54 UTC

[R] select data from large CSV file

Hello


I?ve got a large CSV file (>500M) with statistical data. It?s devided in
12 columns and I don?t know how many lines.
The second column is the date and the second is a unique code for the
location, the rest is (lets say different whether data.  See example
below.
070704, 25, --,--,--,temperature, 22, --,--,30, 20,Y
070705, 25, --,--,--,temperature, 22, --,--,30, 20,Y
070705, 25, --,--,--,pressure, 1200, --,--,1000, 1100,N
070705, 26, --,--,--,temperature, 22, --,--,30, 20,Y


First I tried with data <- read.csv. and of course the memory got full.
Then I found in the archive that you could use scan. So then I wrote the
following lines below to search for location and store one location with
all different data in one variable.

# collect the different pnc's
 b=2                                        #compare from second number
 alike=TRUE                                 #Dim alike like a boolean
 stored = 910286609                         #first number is known
  for(i in 1: 100){                         #start counting and scaning
     data_final <- matrix(unlist(scan("C:/Documents and
Settings/modiglar/Desktop/temp/et.csv",sep="," ,
what=list("","","","","","","","","","","",""),
skip=i ,
n=12)),ncol=12, byrow=TRUE)


      a=1                                     #compare from the 1:th stored
      while( a < b){                          #---
                                              #
        if(as.numeric(data_final[2] != stored[a])) #compare
          { a=a+1                                  #
          alike=FALSE  }                           #
        else{                                      #
           alike=TRUE                              #
           break }                                 #
      }                                            # ---

      if (alike==FALSE){                           #
         stored[b]=as.numeric(data_final[2])       # Store new data
         b=b+1                                     #
      }
  }

#------------------------------------------------------------
# save 1 pnc at the time
d=1
saved_data = 1:1200 ; dim(saved_data) <- c(12,100)
save_data_nr = 1                               #Stored number
  for(i in 1: 100){                            #start counting and scaning
     data_final <- matrix(unlist(scan("C:/Documents and
Settings/modiglar/Desktop/temp/et.csv",sep="," ,
what=list("","","","","","","","","","","",""),
skip=i ,
n=12)),ncol=12, byrow=TRUE)


      if(as.numeric(data_final[2] == stored[save_data_nr])) #compare
        { saved_data[,d] <-  matrix(unlist(data_final),ncol=12,
byrow=TRUE)  #Store new data
         d=d+1 }                                         #
                                                         #
                                                         #
 }
As you can see I?m not so familiar with R, and therefore I have probably
done this the wrong way.

As I understand when running this, is that scan opens up the file count
down to the line that should be read and read it, then closing the file
again. So when I?m starting to come to line number at 10000 then it
starting to take time. I let the computer run over night, but it was still
far from finished when I stopped the loop.

So how should I do this? Maybe I also need to sort on the date, and that
is hopefully in order so then you should be able to cut the file every
time you hit a new month but that will also take time if I do it like
this.

Thank you for your help in advance.

Lars

Stephen C. Upton

2007-Jul-05 15:40 UTC

head link

[R] select data from large CSV file

Hi Lars,

I haven't tried this, but I believe there were a couple of messages on 
the list recently on reading large files that basically used scan with 
connections, and reading in by blocks.

see ?scan, ?connections

HTH
steve

Lars Modig wrote:> Hello
>
>
> I?ve got a large CSV file (>500M) with statistical data. It?s devided in
> 12 columns and I don?t know how many lines.
> The second column is the date and the second is a unique code for the
> location, the rest is (lets say different whether data.  See example
> below.
> 070704, 25, --,--,--,temperature, 22, --,--,30, 20,Y
> 070705, 25, --,--,--,temperature, 22, --,--,30, 20,Y
> 070705, 25, --,--,--,pressure, 1200, --,--,1000, 1100,N
> 070705, 26, --,--,--,temperature, 22, --,--,30, 20,Y
> ?
> First I tried with data <- read.csv. and of course the memory got full.
> Then I found in the archive that you could use scan. So then I wrote the
> following lines below to search for location and store one location with
> all different data in one variable.
>
> # collect the different pnc's
>  b=2                                        #compare from second number
>  alike=TRUE                                 #Dim alike like a boolean
>  stored = 910286609                         #first number is known
>   for(i in 1: 100){                         #start counting and scaning
>      data_final <- matrix(unlist(scan("C:/Documents and
> Settings/modiglar/Desktop/temp/et.csv",sep="," ,
>
what=list("","","","","","","","","","","",""),
skip=i ,
> n=12)),ncol=12, byrow=TRUE)
>
>
>       a=1                                     #compare from the 1:th stored
>       while( a < b){                          #---
>                                               #
>         if(as.numeric(data_final[2] != stored[a])) #compare
>           { a=a+1                                  #
>           alike=FALSE  }                           #
>         else{                                      #
>            alike=TRUE                              #
>            break }                                 #
>       }                                            # ---
>
>       if (alike==FALSE){                           #
>          stored[b]=as.numeric(data_final[2])       # Store new data
>          b=b+1                                     #
>       }
>   }
>
> #------------------------------------------------------------
> # save 1 pnc at the time
> d=1
> saved_data = 1:1200 ; dim(saved_data) <- c(12,100)
> save_data_nr = 1                               #Stored number
>   for(i in 1: 100){                            #start counting and scaning
>      data_final <- matrix(unlist(scan("C:/Documents and
> Settings/modiglar/Desktop/temp/et.csv",sep="," ,
>
what=list("","","","","","","","","","","",""),
skip=i ,
> n=12)),ncol=12, byrow=TRUE)
>
>
>       if(as.numeric(data_final[2] == stored[save_data_nr])) #compare
>         { saved_data[,d] <-  matrix(unlist(data_final),ncol=12,
> byrow=TRUE)  #Store new data
>          d=d+1 }                                         #
>                                                          #
>                                                          #
>  }
> As you can see I?m not so familiar with R, and therefore I have probably
> done this the wrong way.
>
> As I understand when running this, is that scan opens up the file count
> down to the line that should be read and read it, then closing the file
> again. So when I?m starting to come to line number at 10000 then it
> starting to take time. I let the computer run over night, but it was still
> far from finished when I stopped the loop.
>
> So how should I do this? Maybe I also need to sort on the date, and that
> is hopefully in order so then you should be able to cut the file every
> time you hit a new month but that will also take time if I do it like
> this.
>
> Thank you for your help in advance.
>
> Lars
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
>

Reasonably Related Threads

Search for more seemingly similar threads

R help - Jul 2007 - select data from large CSV file

[R] select data from large CSV file

[R] select data from large CSV file

Reasonably Related Threads