Hello
I?ve got a large CSV file (>500M) with statistical data. It?s devided in
12 columns and I don?t know how many lines.
The second column is the date and the second is a unique code for the
location, the rest is (lets say different whether data. See example
below.
070704, 25, --,--,--,temperature, 22, --,--,30, 20,Y
070705, 25, --,--,--,temperature, 22, --,--,30, 20,Y
070705, 25, --,--,--,pressure, 1200, --,--,1000, 1100,N
070705, 26, --,--,--,temperature, 22, --,--,30, 20,Y
First I tried with data <- read.csv. and of course the memory got full.
Then I found in the archive that you could use scan. So then I wrote the
following lines below to search for location and store one location with
all different data in one variable.
# collect the different pnc's
b=2 #compare from second number
alike=TRUE #Dim alike like a boolean
stored = 910286609 #first number is known
for(i in 1: 100){ #start counting and scaning
data_final <- matrix(unlist(scan("C:/Documents and
Settings/modiglar/Desktop/temp/et.csv",sep="," ,
what=list("","","","","","","","","","","",""),
skip=i ,
n=12)),ncol=12, byrow=TRUE)
a=1 #compare from the 1:th stored
while( a < b){ #---
#
if(as.numeric(data_final[2] != stored[a])) #compare
{ a=a+1 #
alike=FALSE } #
else{ #
alike=TRUE #
break } #
} # ---
if (alike==FALSE){ #
stored[b]=as.numeric(data_final[2]) # Store new data
b=b+1 #
}
}
#------------------------------------------------------------
# save 1 pnc at the time
d=1
saved_data = 1:1200 ; dim(saved_data) <- c(12,100)
save_data_nr = 1 #Stored number
for(i in 1: 100){ #start counting and scaning
data_final <- matrix(unlist(scan("C:/Documents and
Settings/modiglar/Desktop/temp/et.csv",sep="," ,
what=list("","","","","","","","","","","",""),
skip=i ,
n=12)),ncol=12, byrow=TRUE)
if(as.numeric(data_final[2] == stored[save_data_nr])) #compare
{ saved_data[,d] <- matrix(unlist(data_final),ncol=12,
byrow=TRUE) #Store new data
d=d+1 } #
#
#
}
As you can see I?m not so familiar with R, and therefore I have probably
done this the wrong way.
As I understand when running this, is that scan opens up the file count
down to the line that should be read and read it, then closing the file
again. So when I?m starting to come to line number at 10000 then it
starting to take time. I let the computer run over night, but it was still
far from finished when I stopped the loop.
So how should I do this? Maybe I also need to sort on the date, and that
is hopefully in order so then you should be able to cut the file every
time you hit a new month but that will also take time if I do it like
this.
Thank you for your help in advance.
Lars
Hi Lars, I haven't tried this, but I believe there were a couple of messages on the list recently on reading large files that basically used scan with connections, and reading in by blocks. see ?scan, ?connections HTH steve Lars Modig wrote:> Hello > > > I?ve got a large CSV file (>500M) with statistical data. It?s devided in > 12 columns and I don?t know how many lines. > The second column is the date and the second is a unique code for the > location, the rest is (lets say different whether data. See example > below. > 070704, 25, --,--,--,temperature, 22, --,--,30, 20,Y > 070705, 25, --,--,--,temperature, 22, --,--,30, 20,Y > 070705, 25, --,--,--,pressure, 1200, --,--,1000, 1100,N > 070705, 26, --,--,--,temperature, 22, --,--,30, 20,Y > ? > First I tried with data <- read.csv. and of course the memory got full. > Then I found in the archive that you could use scan. So then I wrote the > following lines below to search for location and store one location with > all different data in one variable. > > # collect the different pnc's > b=2 #compare from second number > alike=TRUE #Dim alike like a boolean > stored = 910286609 #first number is known > for(i in 1: 100){ #start counting and scaning > data_final <- matrix(unlist(scan("C:/Documents and > Settings/modiglar/Desktop/temp/et.csv",sep="," , > what=list("","","","","","","","","","","",""), skip=i , > n=12)),ncol=12, byrow=TRUE) > > > a=1 #compare from the 1:th stored > while( a < b){ #--- > # > if(as.numeric(data_final[2] != stored[a])) #compare > { a=a+1 # > alike=FALSE } # > else{ # > alike=TRUE # > break } # > } # --- > > if (alike==FALSE){ # > stored[b]=as.numeric(data_final[2]) # Store new data > b=b+1 # > } > } > > #------------------------------------------------------------ > # save 1 pnc at the time > d=1 > saved_data = 1:1200 ; dim(saved_data) <- c(12,100) > save_data_nr = 1 #Stored number > for(i in 1: 100){ #start counting and scaning > data_final <- matrix(unlist(scan("C:/Documents and > Settings/modiglar/Desktop/temp/et.csv",sep="," , > what=list("","","","","","","","","","","",""), skip=i , > n=12)),ncol=12, byrow=TRUE) > > > if(as.numeric(data_final[2] == stored[save_data_nr])) #compare > { saved_data[,d] <- matrix(unlist(data_final),ncol=12, > byrow=TRUE) #Store new data > d=d+1 } # > # > # > } > As you can see I?m not so familiar with R, and therefore I have probably > done this the wrong way. > > As I understand when running this, is that scan opens up the file count > down to the line that should be read and read it, then closing the file > again. So when I?m starting to come to line number at 10000 then it > starting to take time. I let the computer run over night, but it was still > far from finished when I stopped the loop. > > So how should I do this? Maybe I also need to sort on the date, and that > is hopefully in order so then you should be able to cut the file every > time you hit a new month but that will also take time if I do it like > this. > > Thank you for your help in advance. > > Lars > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > >