Dear all, I have a repository file (let's call it repo.txt) that contain two columns like this: # tag value AAA 0.2 AAT 0.3 AAC 0.02 AAG 0.02 ATA 0.3 ATT 0.7 Given another query vector> qr <- c("AAC", "ATT")I would like to find the corresponding value for each query above, yielding: 0.02 0.7 However, I want to avoid slurping whole repo.txt into an object (e.g. hash). Is there any ways to do that? The reason I want to do that because repo.txt is very2 large size (milions of lines, with tag length > 30 bp), and my PC memory is too small to keep it. - Gundala Viswanath Jakarta - Indonesia
On Fri, 2009-01-16 at 18:02 +0900, Gundala Viswanath wrote:> Dear all, > > I have a repository file (let's call it repo.txt) > that contain two columns like this: > > # tag value > AAA 0.2 > AAT 0.3 > AAC 0.02 > AAG 0.02 > ATA 0.3 > ATT 0.7 > > Given another query vector > > > qr <- c("AAC", "ATT") > > I would like to find the corresponding value for each query above, > yielding: > > 0.02 > 0.7 > > However, I want to avoid slurping whole repo.txt into an object (e.g. hash). > Is there any ways to do that? > > The reason I want to do that because repo.txt is very2 large size > (milions of lines, > with tag length > 30 bp), and my PC memory is too small to keep it. > > - Gundala Viswanath > Jakarta - IndonesiaHello, You can always store your repo.txt into a database, say, SQLite, and select only the values you want via an SQL query. Thus, you will prevent loading the full file into memory. Best regards, Carlos J. Gil Bellosta http://www.datanalytics.com
you might try to iteratively read a limited number of line of lines in a batch using readLines: # filename, the name of your file # n, the maximal count of lines to read in a batch connection = file(filename, open="rt") while (length(lines <- readLines(con=connection, n=n))) { # do your stuff here } close(connection) ?file ?readLines vQ Gundala Viswanath wrote:> Dear all, > > I have a repository file (let's call it repo.txt) > that contain two columns like this: > > # tag value > AAA 0.2 > AAT 0.3 > AAC 0.02 > AAG 0.02 > ATA 0.3 > ATT 0.7 > > Given another query vector > > >> qr <- c("AAC", "ATT") >> > > I would like to find the corresponding value for each query above, > yielding: > > 0.02 > 0.7 > > However, I want to avoid slurping whole repo.txt into an object (e.g. hash). > Is there any ways to do that? > > The reason I want to do that because repo.txt is very2 large size > (milions of lines, > with tag length > 30 bp), and my PC memory is too small to keep it. > >
I agree on the database solution. Database are the rigth tool to solve this kind of problem. Only consider the start up cost of setting up the database. This could be a very time consuming task if someone is not familiar with database technology. Using file() is not a real reading of all the file. This function will simply open a connection to the file without reading it. countLines should do something lile "wc -l" from a bash shell I would say that if this is a one time job this solution should work even thought is not the fastest. In case this job is a repetitive one, then a database solution is surely better A. Wacek Kusnierczyk wrote:> if the file is really large, reading it twice may add considerable penalty: > > r at quantide.com wrote: > >> Something like this should work >> >> library(R.utils) >> out = numeric() >> qr = c("AAC", "ATT") >> n =countLines("test.txt") >> > > # 1st pass > > >> file = file("test.txt", "r") >> for (i in 1:n){ >> > > # 2nd pass > > >> line = readLines(file, n = 1) >> A = strsplit (line, split = " ")[[1]][1] >> if(is.element(A, qr)) { >> value = as.numeric(strsplit (line, split = " ")[[1]][2]) >> out = c(out, value) >> } >> } >> > > if this is a one-go task, counting the lines does not pay, and why > bother. if this is a repetitive task, a database-based solution will > probably be a better idea. > > vQ > >
On Fri, Jan 16, 2009 at 5:52 AM, r at quantide.com <r at quantide.com> wrote:> I agree on the database solution. > Database are the rigth tool to solve this kind of problem. > Only consider the start up cost of setting up the database. This could be a > very time consuming task if someone is not familiar with database > technology.Using sqldf as mentioned previously on this thread allows one to use the SQLite database with no setup at all. sqldf automatically creates the database, generates the record layout, loads the file (not going through R but outside of R so R does not slow it down) and extracts the portion you want into R issuing the appropriate calls to RSQLite/DBI and destroying the database afterwards all automatically. When you install sqldf it automatically installs RSQLite and the SQLite database itself so the entire installation is just one line: install.packages("sqldf")
Hi,> Unless you specify an in-memory database the database is stored on disk.Thanks for your explanation. I just downloaded 'sqldf'. Where can I find the option for that? In sqldf I can't see the command. I looked at: envir = parent.frame() doesn't appear to be the one. - Gundala Viswanath Jakarta - Indonesia> > On Fri, Jan 16, 2009 at 10:59 AM, Gundala Viswanath <gundalav at gmail.com> wrote: >> Hi Gabor, >> >>> the file itself is read into a database >> >> The above doesn't use RAM memory? >> >> Rgds, >> GV. >> >>> without ever going through R so your memory requirements correspond to what >>> you extract, not the size of the file. >>> >>> On Fri, Jan 16, 2009 at 10:49 AM, Gundala Viswanath <gundalav at gmail.com> wrote: >>>> Hi Gabor, >>>> >>>> Do you mean storing data in "sqldf', doesn't take memory? >>>> For example, I have 3GB data file. with standard R object using read.table() >>>> the object size will explode twice ~6GB. My current 4GB RAM >>>> cannot handle that. >>>> >>>> Do you mean with "sqldf", this is not the issue? >>>> Why is that? >>>> >>>> Sorry for my naive question. >>>> >>>> - Gundala Viswanath >>>> Jakarta - Indonesia >>>> >>>> >>>> >>>> On Fri, Jan 16, 2009 at 9:09 PM, Gabor Grothendieck >>>> <ggrothendieck at gmail.com> wrote: >>>>> On Fri, Jan 16, 2009 at 5:52 AM, r at quantide.com <r at quantide.com> wrote: >>>>>> I agree on the database solution. >>>>>> Database are the rigth tool to solve this kind of problem. >>>>>> Only consider the start up cost of setting up the database. This could be a >>>>>> very time consuming task if someone is not familiar with database >>>>>> technology. >>>>> >>>>> Using sqldf as mentioned previously on this thread allows one to use >>>>> the SQLite database with no setup at all. sqldf automatically creates >>>>> the database, generates the record layout, loads the file (not going through >>>>> R but outside of R so R does not slow it down) and extracts the >>>>> portion you want into R issuing the appropriate calls to RSQLite/DBI and >>>>> destroying the database afterwards all automatically. When you >>>>> install sqldf it automatically installs RSQLite and the SQLite database >>>>> itself so the entire installation is just one line: install.packages("sqldf") >>>>> >>>>> ______________________________________________ >>>>> R-help at r-project.org mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>> >>> >> >