thr3ads.net - R help - [R] iterators : checkFunc with ireadLines [May 2020]

If this information is useful, please help other people find it:
Share via:
Jeff Newmiller
2020-May-20 03:01 UTC
[R] iterators : checkFunc with ireadLines

There is also apparently a package called disk.frame that you might consider.

On May 19, 2020 12:07:38 AM PDT, Laurent Rhelp <LaurentRHelp at free.fr>
wrote:>Ok, thank you for the advice I will take some time to see in details 
>these packages.
>
>
>Le 19/05/2020 ? 05:44, Jeff Newmiller a ?crit?:
>> Laurent... Bill is suggesting building your own indexed database...
>but this has been done before, so re-inventing the wheel seems
>inefficient and risky. It is actually impossible to create such a beast
>without reading the entire file into memory at least temporarily
>anyway, so you are better off looking at ways to process the entire
>file efficiently.
>>
>> For example, you could load the data into a sqlite database in a
>couple of lines of code and use SQL directly or use the sqldf data
>frame interface, or use dplyr to query the database.
>>
>> Or you could look at read_csv_chunked from readr package.
>>
>> On May 18, 2020 11:37:46 AM PDT, William Michels via R-help
><r-help at r-project.org> wrote:
>>> Hi Laurent,
>>>
>>> Thank you for explaining your size limitations. Below is an example
>>> using the read.fwf() function to grab the first column of your
input
>>> file (in 2000 row chunks). This column is converted to an index,
and
>>> the index is used to create an iterator useful for skipping lines
>when
>>> reading input with scan(). (You could try processing your large
file
>>> in successive 2000 line chunks, or whatever number of lines fits
>into
>>> memory). Maybe not as elegant as the approach you were going for,
>but
>>> read.fwf() should be pretty efficient:
>>>
>>>> sensors <-  c("N053", "N163")
>>>> read.fwf("test2.txt", widths=c(4), as.is=TRUE,
flush=TRUE, n=2000,
>>> skip=0)
>>>     V1
>>> 1 Time
>>> 2 N023
>>> 3 N053
>>> 4 N123
>>> 5 N163
>>> 6 N193
>>>> first_col <- read.fwf("test2.txt", widths=c(4),
as.is=TRUE,
>>> flush=TRUE, n=2000, skip=0)
>>>> which(first_col$V1 %in% sensors)
>>> [1] 3 5
>>>> index1 <- which(first_col$V1 %in% sensors)
>>>> iter_index1 <- iter(1:2000, checkFunc= function(n) {n %in%
index1})
>>>> unlist(scan(file="test2.txt",
>>>
what=list("","","","","","","","","",""),
flush=TRUE,
>multi.line=FALSE,
>>> skip=nextElem(iter_index1)-1, nlines=1, quiet=TRUE))
>>> [1] "N053"      "-0.014083"
"-0.004741" "0.001443"  "-0.010152"
>>> "-0.012996" "-0.005337" "-0.008738"
"-0.015094" "-0.012104"
>>>> unlist(scan(file="test2.txt",
>>>
what=list("","","","","","","","","",""),
flush=TRUE,
>multi.line=FALSE,
>>> skip=nextElem(iter_index1)-1, nlines=1, quiet=TRUE))
>>> [1] "N163"      "-0.054023"
"-0.049345" "-0.037158" "-0.04112"
>>> "-0.044612" "-0.036953" "-0.036061"
"-0.044516" "-0.046436"
>>> (Note for this email and the previous one, I've deleted the
first
>>> "hash" character from each line of your test file for
clarity).
>>>
>>> HTH, Bill.
>>>
>>> W. Michels, Ph.D.
>>>
>>>
>>>
>>>
>>>
>>> On Mon, May 18, 2020 at 3:35 AM Laurent Rhelp <LaurentRHelp at
free.fr>
>>> wrote:
>>>> Dear William,
>>>>    Thank you for your answer
>>>> My file is very large so I cannot read it in my memory (I
cannot
>use
>>>> read.table). So I want to put in memory only the line I need to
>>> process.
>>>> With readLines, as I did, it works but I would like to use an
>>> iterator
>>>> and a foreach loop to understand this way to do because I
thought
>>> that
>>>> it was a better solution to write a nice code.
>>>>
>>>>
>>>> Le 18/05/2020 ? 04:54, William Michels a ?crit :
>>>>> Apologies, Laurent, for this two-part answer. I
misunderstood your
>>>>> post where you stated you wanted to "filter(ing) some
>>>>> selected lines according to the line name... ." I
thought that
>>> meant
>>>>> you had a separate index (like a series of primes) that you
wanted
>>> to
>>>>> use to only read-in selected line numbers from a file (test
file
>>> below
>>>>> with numbers 1:1000 each on a separate line):
>>>>>
>>>>>> library(gmp)
>>>>>> library(iterators)
>>>>>> iprime <- iter(1:100, checkFunc = function(n)
isprime(n))
>>>>>> scan(file="one_thou_lines.txt",
skip=nextElem(iprime)-1,
>nlines=1)
>>>>> Read 1 item
>>>>> [1] 2
>>>>>> scan(file="one_thou_lines.txt",
skip=nextElem(iprime)-1,
>nlines=1)
>>>>> Read 1 item
>>>>> [1] 3
>>>>>> scan(file="one_thou_lines.txt",
skip=nextElem(iprime)-1,
>nlines=1)
>>>>> Read 1 item
>>>>> [1] 5
>>>>>> scan(file="one_thou_lines.txt",
skip=nextElem(iprime)-1,
>nlines=1)
>>>>> Read 1 item
>>>>> [1] 7
>>>>> However, what it really seems that you want to do is read
each
>line
>>> of
>>>>> a (possibly enormous) file, test each line
"string-wise" to keep
>or
>>>>> discard, and if you're keeping it, append the line to a
list. I
>can
>>>>> certainly see the advantage of this strategy for reading in
very,
>>> very
>>>>> large files, but it's not clear to me how the
"ireadLines"
>function
>>> (
>>>>> in the "iterators" package) will help you, since
it doesn't seem
>to
>>>>> generate anything but a sequential index.
>>>>>
>>>>> Anyway, below is an absolutely standard read-in of your
data using
>>>>> read.table(). Hopefully some of the code I've posted
has been
>>> useful
>>>>> to you.
>>>>>
>>>>>> sensors <-  c("N053", "N163")
>>>>>> read.table("test2.txt")
>>>>>       V1        V2        V3        V4        V5        V6
>V7
>>>>>      V8        V9       V10
>>>>> 1 Time  0.000000  0.000999  0.001999  0.002998  0.003998 
0.004997
>>>>> 0.005997  0.006996  0.007996
>>>>> 2 N023 -0.031323 -0.035026 -0.029759 -0.024886 -0.024464
-0.026816
>>>>> -0.033690 -0.041067 -0.038747
>>>>> 3 N053 -0.014083 -0.004741  0.001443 -0.010152 -0.012996
-0.005337
>>>>> -0.008738 -0.015094 -0.012104
>>>>> 4 N123 -0.019008 -0.013494 -0.013180 -0.029208 -0.032748
-0.020243
>>>>> -0.015089 -0.014439 -0.011681
>>>>> 5 N163 -0.054023 -0.049345 -0.037158 -0.041120 -0.044612
-0.036953
>>>>> -0.036061 -0.044516 -0.046436
>>>>> 6 N193 -0.022171 -0.022384 -0.022338 -0.023304 -0.022569
-0.021827
>>>>> -0.021996 -0.021755 -0.021846
>>>>>> Laurent_data <- read.table("test2.txt")
>>>>>> Laurent_data[Laurent_data$V1 %in% sensors, ]
>>>>>       V1        V2        V3        V4        V5        V6
>V7
>>>>>      V8        V9       V10
>>>>> 3 N053 -0.014083 -0.004741  0.001443 -0.010152 -0.012996
-0.005337
>>>>> -0.008738 -0.015094 -0.012104
>>>>> 5 N163 -0.054023 -0.049345 -0.037158 -0.041120 -0.044612
-0.036953
>>>>> -0.036061 -0.044516 -0.046436
>>>>>
>>>>> Best, Bill.
>>>>>
>>>>> W. Michels, Ph.D.
>>>>>
>>>>>
>>>>> On Sun, May 17, 2020 at 5:43 PM Laurent Rhelp
>>> <LaurentRHelp at free.fr> wrote:
>>>>>> Dear R-Help List,
>>>>>>
>>>>>>       I would like to use an iterator to read a file
filtering
>some
>>>>>> selected lines according to the line name in order to
use after a
>>>>>> foreach loop. I wanted to use the checkFunc argument as
the
>>> following
>>>>>> example found on internet to select only prime numbers
:
>>>>>>
>>>>>> |                                iprime <-
||iter||(1:100,
>>> checkFunc >>>>>> ||function||(n) ||isprime||(n))|
>>>>>>
>>>>>>
|(https://datawookie.netlify.app/blog/2013/11/iterators-in-r/)
>>>>>>
<https://datawookie.netlify.app/blog/2013/11/iterators-in-r/>|
>>>>>>
>>>>>> but the checkFunc argument seems not to be available
with the
>>> function
>>>>>> ireadLines (package iterators). So, I did the code
below to solve
>>> my
>>>>>> problem but I am sure that I miss something to use
iterators with
>>> files.
>>>>>> Since I found nothing on the web about ireadLines and
the
>>> checkFunc
>>>>>> argument, could somebody help me to understand how we
have to use
>>>>>> iterator (and foreach loop) on files keeping only
selected lines
>?
>>>>>>
>>>>>> Thank you very much
>>>>>> Laurent
>>>>>>
>>>>>> Presently here is my code:
>>>>>>
>>>>>> ##        mock file to read: test.txt
>>>>>> ##
>>>>>> # Time    0    0.000999    0.001999    0.002998   
0.003998
>>> 0.004997
>>>>>> 0.005997    0.006996    0.007996
>>>>>> # N023    -0.031323    -0.035026    -0.029759   
-0.024886
>>> -0.024464
>>>>>> -0.026816    -0.03369    -0.041067    -0.038747
>>>>>> # N053    -0.014083    -0.004741    0.001443   
-0.010152
>>> -0.012996
>>>>>> -0.005337    -0.008738    -0.015094    -0.012104
>>>>>> # N123    -0.019008    -0.013494    -0.01318   
-0.029208
>>> -0.032748
>>>>>> -0.020243    -0.015089    -0.014439    -0.011681
>>>>>> # N163    -0.054023    -0.049345    -0.037158   
-0.04112
>>> -0.044612
>>>>>> -0.036953    -0.036061    -0.044516    -0.046436
>>>>>> # N193    -0.022171    -0.022384    -0.022338   
-0.023304
>>> -0.022569
>>>>>> -0.021827    -0.021996    -0.021755    -0.021846
>>>>>>
>>>>>>
>>>>>> # sensors to keep
>>>>>>
>>>>>> sensors <-  c("N053", "N163")
>>>>>>
>>>>>>
>>>>>> library(iterators)
>>>>>>
>>>>>> library(rlist)
>>>>>>
>>>>>>
>>>>>> file_name <- "test.txt"
>>>>>>
>>>>>> con_obj <- file( file_name , "r")
>>>>>> ifile <- ireadLines( con_obj , n = 1 )
>>>>>>
>>>>>>
>>>>>> ## I do not do a loop for the example
>>>>>>
>>>>>> res <- list()
>>>>>>
>>>>>> r <- get_Lines_iter( ifile , sensors)
>>>>>> res <- list.append( res , r )
>>>>>> res
>>>>>> r <- get_Lines_iter( ifile , sensors)
>>>>>> res <- list.append( res , r )
>>>>>> res
>>>>>> r <- get_Lines_iter( ifile , sensors)
>>>>>> do.call("cbind",res)
>>>>>>
>>>>>> ## the function get_Lines_iter to select and process
the line
>>>>>>
>>>>>> get_Lines_iter  <-  function( iter , sensors, sep =
'\t', quiet >>> FALSE){
>>>>>>      ## read the next record in the iterator
>>>>>>      r = try( nextElem(iter) )
>>>>>>     while(  TRUE ){
>>>>>>        if( class(r) == "try-error") {
>>>>>>              return( stop("The iterator is
empty") )
>>>>>>       } else {
>>>>>>       ## split the read line according to the separator
>>>>>>        r_txt <- textConnection(r)
>>>>>>        fields <- scan(file = r_txt, what =
"character", sep >sep,
>>> quiet >>>>>> quiet)
>>>>>>         ## test if we have to keep the line
>>>>>>         if( fields[1] %in% sensors){
>>>>>>           ## data processing for the selected line (for
the
>example
>>>>>> transformation in dataframe)
>>>>>>           n <- length(fields)
>>>>>>           x <- data.frame( as.numeric(fields[2:n]) )
>>>>>>           names(x) <- fields[1]
>>>>>>           ## We return the values
>>>>>>           print(paste0("sensor
",fields[1]," ok"))
>>>>>>           return( x )
>>>>>>         }else{
>>>>>>          print(paste0("Sensor ", fields[1]
," not selected"))
>>>>>>          r = try(nextElem(iter) )}
>>>>>>       }
>>>>>> }# end while loop
>>>>>> }
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> L'absence de virus dans ce courrier ?lectronique a
?t? v?rifi?e
>>> par le logiciel antivirus Avast.
>>>>>> https://www.avast.com/antivirus
>>>>>>
>>>>>>           [[alternative HTML version deleted]]
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE
and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained,
reproducible
>code.
>>>>
>>>>
>>>> --
>>>> L'absence de virus dans ce courrier ?lectronique a ?t?
v?rifi?e par
>>> le logiciel antivirus Avast.
>>>> https://www.avast.com/antivirus
>>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
-- 
Sent from my phone. Please excuse my brevity.
R help - May 2020 - iterators : checkFunc with ireadLines

[R] iterators : checkFunc with ireadLines