thr3ads.net - R help - [R] iterators : checkFunc with ireadLines [May 2020]

If this information is useful, please help other people find it:
Share via:

Laurent Rhelp

2020-May-22 11:47 UTC

[R] iterators : checkFunc with ireadLines

Hi Ivan,
 ? Endeed, it is a good idea. I am under MSwindows but I can use the 
bash command I use with git. I will see how to do that with the unix 
command lines.


Le 20/05/2020 ? 09:46, Ivan Krylov a ?crit?:> Hi Laurent,
>
> I am not saying this will work every time and I do recognise that this
> is very different from a more general solution that you had envisioned,
> but if you are on an UNIX-like system or have the relevant utilities
> installed and on the %PATH% on Windows, you can filter the input file
> line-by-line using a pipe and an external program:
>
> On Sun, 17 May 2020 15:52:30 +0200
> Laurent Rhelp <LaurentRHelp at free.fr> wrote:
>
>> # sensors to keep
>> sensors <-? c("N053", "N163")
> # filter on the beginning of the line
> i <- pipe("grep -E '^(N053|N163)' test.txt")
> # or:
> # filter on the beginning of the given column
> # (use $2 for the second column, etc.)
> i <- pipe("awk '($1 ~ \"^(N053|N163)\")'
test.txt")
> # or:
> # since your message is full of Unicode non-breaking spaces, I have to
> # bring in heavier machinery to handle those correctly;
> # only this solution manages to match full column values
> # (here you can also use $F[1] for second column and so on)
> i <- pipe("perl -CSD -F'\\s+' -lE \\
>   'print join qq{\\t}, @F if $F[0] =~ /^(N053|N163)$/' \\
>   test.txt
> ")
> lines <- read.table(i) # closes i when done
>
> The downside of this approach is having to shell-escape the command
> lines, which can become complicated, and choosing between use of regular
> expressions and more wordy programs (Unicode whitespace in the input
> doesn't help, either).
>

-- 
L'absence de virus dans ce courrier ?lectronique a ?t? v?rifi?e par le
logiciel antivirus Avast.
https://www.avast.com/antivirus

William Michels

2020-May-24 02:34 UTC

head link

[R] iterators : checkFunc with ireadLines

Hi Laurent,

Seeking to give you an "R-only" solution, I thought the read.fwf()
function might be useful (to read-in your first column of data, only).
However Jeff is correct that this is a poor strategy, since read.fwf()
reads the entire file into R (documented in "Fixed-width-format
files", Section 2.2: R Data Import/Export Manual).

Jeff has suggested a number of packages, as well as using a database.
Ivan Krylov has posted answers using grep, awk and perl (perl5--to
disambiguate). [In point of fact, the R Data Import/Export Manual
suggests using perl]. Similar to Ivan, I've posted code below using
the Raku programming language (the language formerly known as Perl6).
Regexes are claimed to be more readable, but are currently very slow
in Raku. However on the plus side, the language is designed to handle
Unicode gracefully:
> # pipe() using raku-grep on Laurent's data (sep=mult whitespace):
> con_obj1 <- pipe(paste("raku -e '.put for lines.grep( / ^^N053
| ^^N163 /, :p );' ", "Laurents.txt"), open="rt");
> p6_import_a <- scan(file=con_obj1,
what=list("","","","","","","","","",""),
flush=TRUE, multi.line=FALSE, quiet=TRUE);
> close(con_obj1);
> as.data.frame(sapply(p6_import_a, t), stringsAsFactors=FALSE);  V1   V2        V3        V4        V5        V6        V7        V8
      V9       V10
1  2 N053 -0.014083 -0.004741  0.001443 -0.010152 -0.012996 -0.005337
-0.008738 -0.015094
2  4 N163 -0.054023 -0.049345 -0.037158  -0.04112 -0.044612 -0.036953
-0.036061 -0.044516>
> # pipe() using raku-grep "starts-with" to find genbankID (
>3GB TSV file)
> # "lines[0..5]" restricts raku to reading first 6 lines!
> # change "lines[0..5]" to "lines" to run raku code on
whole file:
> con_obj2 <- pipe(paste("raku -e '.put for lines[0..5].grep(
*.starts-with(q[A00145]), :p);' ", "genbankIDs_3GB.tsv"),
"rt");
> p6_import_b <- read.table(con_obj2, sep="\t");
> close(con_obj2)
> p6_import_b  V1     V2       V3          V4 V5
1  4 A00145 A00145.1 IFN-alpha A NA>
> # unicode test using R's system() function:
> try(system("raku -ne '.grep( /  ??  |  ?????  |  ?????  |  ?????? 
/, :v ).put;'  hello_7lang.txt", intern = TRUE, ignore.stderr = FALSE))[1] ""                    ""                    ""
"?? Chinese"
[5] "????? Japanese" "????? Arabic"        "??????
Russian">
[special thanks to Brad Gilbert, Joseph Brenner and others on the
perl6-users mailing list. All errors above are my own.]

HTH, Bill.

W. Michels, Ph.D.




On Fri, May 22, 2020 at 4:48 AM Laurent Rhelp <LaurentRHelp at free.fr>
wrote:>
> Hi Ivan,
>    Endeed, it is a good idea. I am under MSwindows but I can use the
> bash command I use with git. I will see how to do that with the unix
> command lines.
>
>
> Le 20/05/2020 ? 09:46, Ivan Krylov a ?crit :
> > Hi Laurent,
> >
> > I am not saying this will work every time and I do recognise that this
> > is very different from a more general solution that you had
envisioned,
> > but if you are on an UNIX-like system or have the relevant utilities
> > installed and on the %PATH% on Windows, you can filter the input file
> > line-by-line using a pipe and an external program:
> >
> > On Sun, 17 May 2020 15:52:30 +0200
> > Laurent Rhelp <LaurentRHelp at free.fr> wrote:
> >
> >> # sensors to keep
> >> sensors <-  c("N053", "N163")
> > # filter on the beginning of the line
> > i <- pipe("grep -E '^(N053|N163)' test.txt")
> > # or:
> > # filter on the beginning of the given column
> > # (use $2 for the second column, etc.)
> > i <- pipe("awk '($1 ~ \"^(N053|N163)\")'
test.txt")
> > # or:
> > # since your message is full of Unicode non-breaking spaces, I have to
> > # bring in heavier machinery to handle those correctly;
> > # only this solution manages to match full column values
> > # (here you can also use $F[1] for second column and so on)
> > i <- pipe("perl -CSD -F'\\s+' -lE \\
> >   'print join qq{\\t}, @F if $F[0] =~ /^(N053|N163)$/' \\
> >   test.txt
> > ")
> > lines <- read.table(i) # closes i when done
> >
> > The downside of this approach is having to shell-escape the command
> > lines, which can become complicated, and choosing between use of
regular
> > expressions and more wordy programs (Unicode whitespace in the input
> > doesn't help, either).
> >
>
>
> --
> L'absence de virus dans ce courrier ?lectronique a ?t? v?rifi?e par le
logiciel antivirus Avast.
> https://www.avast.com/antivirus
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

William Michels

2020-May-24 18:09 UTC

head link

[R] iterators : checkFunc with ireadLines

Strike that one sentence in brackets: "[In point of fact, the R Data
Import/Export Manual suggests using perl]", to pre-process data before
loading into R. The manual's recommendation only pertains to large
fixed width formatted files [see #1], whereas Laurent's data is
whitespace-delimited:
> read.table( "Laurents.txt")
> read.delim( "Laurents.txt", sep="")
Best Regards, Bill.

W. Michels, Ph.D.

Citation:
[#1]
https://cran.r-project.org/doc/manuals/r-release/R-data.html#Fixed_002dwidth_002dformat-files

Laurent Rhelp

2020-May-27 08:56 UTC

head link

[R] iterators : checkFunc with ireadLines

I installed raku on my PC to test your solution:

The command raku -e '.put for lines.grep( / ^^N053 | ^^N163 /, :p );'? 
Laurents.txt works fine when I write it in the bash command but when I 
use the pipe command in R as you say there is nothing in lines with 
lines <- read.table(i)

There is the same problem with Ivan's solution the command grep -E 
'^(N053|N163)' test.txt works fine under the bash command but not i
<-
pipe("grep -E '^(N053|N163)' test.txt"); lines <-
read.table(i)

May be it is because I work with MS windows ?

thx
LP




Le 24/05/2020 ? 04:34, William Michels a ?crit?:> Hi Laurent,
>
> Seeking to give you an "R-only" solution, I thought the
read.fwf()
> function might be useful (to read-in your first column of data, only).
> However Jeff is correct that this is a poor strategy, since read.fwf()
> reads the entire file into R (documented in "Fixed-width-format
> files", Section 2.2: R Data Import/Export Manual).
>
> Jeff has suggested a number of packages, as well as using a database.
> Ivan Krylov has posted answers using grep, awk and perl (perl5--to
> disambiguate). [In point of fact, the R Data Import/Export Manual
> suggests using perl]. Similar to Ivan, I've posted code below using
> the Raku programming language (the language formerly known as Perl6).
> Regexes are claimed to be more readable, but are currently very slow
> in Raku. However on the plus side, the language is designed to handle
> Unicode gracefully:
>
>> # pipe() using raku-grep on Laurent's data (sep=mult whitespace):
>> con_obj1 <- pipe(paste("raku -e '.put for lines.grep( /
^^N053 | ^^N163 /, :p );' ", "Laurents.txt"),
open="rt");
>> p6_import_a <- scan(file=con_obj1,
what=list("","","","","","","","","",""),
flush=TRUE, multi.line=FALSE, quiet=TRUE);
>> close(con_obj1);
>> as.data.frame(sapply(p6_import_a, t), stringsAsFactors=FALSE);
>    V1   V2        V3        V4        V5        V6        V7        V8
>        V9       V10
> 1  2 N053 -0.014083 -0.004741  0.001443 -0.010152 -0.012996 -0.005337
> -0.008738 -0.015094
> 2  4 N163 -0.054023 -0.049345 -0.037158  -0.04112 -0.044612 -0.036953
> -0.036061 -0.044516
>> # pipe() using raku-grep "starts-with" to find genbankID (
>3GB TSV file)
>> # "lines[0..5]" restricts raku to reading first 6 lines!
>> # change "lines[0..5]" to "lines" to run raku code
on whole file:
>> con_obj2 <- pipe(paste("raku -e '.put for lines[0..5].grep(
*.starts-with(q[A00145]), :p);' ", "genbankIDs_3GB.tsv"),
"rt");
>> p6_import_b <- read.table(con_obj2, sep="\t");
>> close(con_obj2)
>> p6_import_b
>    V1     V2       V3          V4 V5
> 1  4 A00145 A00145.1 IFN-alpha A NA
>> # unicode test using R's system() function:
>> try(system("raku -ne '.grep( /  ??  |  ?????  |  ?????  | 
??????  /, :v ).put;'  hello_7lang.txt", intern = TRUE, ignore.stderr =
FALSE))
> [1] ""                    ""                   
""
> "?? Chinese"
> [5] "????? Japanese" "????? Arabic"        "??????
Russian"
> [special thanks to Brad Gilbert, Joseph Brenner and others on the
> perl6-users mailing list. All errors above are my own.]
>
> HTH, Bill.
>
> W. Michels, Ph.D.
>
>
>
>
> On Fri, May 22, 2020 at 4:48 AM Laurent Rhelp <LaurentRHelp at
free.fr> wrote:
>> Hi Ivan,
>>     Endeed, it is a good idea. I am under MSwindows but I can use the
>> bash command I use with git. I will see how to do that with the unix
>> command lines.
>>
>>
>> Le 20/05/2020 ? 09:46, Ivan Krylov a ?crit :
>>> Hi Laurent,
>>>
>>> I am not saying this will work every time and I do recognise that
this
>>> is very different from a more general solution that you had
envisioned,
>>> but if you are on an UNIX-like system or have the relevant
utilities
>>> installed and on the %PATH% on Windows, you can filter the input
file
>>> line-by-line using a pipe and an external program:
>>>
>>> On Sun, 17 May 2020 15:52:30 +0200
>>> Laurent Rhelp <LaurentRHelp at free.fr> wrote:
>>>
>>>> # sensors to keep
>>>> sensors <-  c("N053", "N163")
>>> # filter on the beginning of the line
>>> i <- pipe("grep -E '^(N053|N163)' test.txt")
>>> # or:
>>> # filter on the beginning of the given column
>>> # (use $2 for the second column, etc.)
>>> i <- pipe("awk '($1 ~ \"^(N053|N163)\")'
test.txt")
>>> # or:
>>> # since your message is full of Unicode non-breaking spaces, I have
to
>>> # bring in heavier machinery to handle those correctly;
>>> # only this solution manages to match full column values
>>> # (here you can also use $F[1] for second column and so on)
>>> i <- pipe("perl -CSD -F'\\s+' -lE \\
>>>    'print join qq{\\t}, @F if $F[0] =~ /^(N053|N163)$/' \\
>>>    test.txt
>>> ")
>>> lines <- read.table(i) # closes i when done
>>>
>>> The downside of this approach is having to shell-escape the command
>>> lines, which can become complicated, and choosing between use of
regular
>>> expressions and more wordy programs (Unicode whitespace in the
input
>>> doesn't help, either).
>>>
>>
>> --
>> L'absence de virus dans ce courrier ?lectronique a ?t? v?rifi?e par
le logiciel antivirus Avast.
>> https://www.avast.com/antivirus
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.


-- 
L'absence de virus dans ce courrier ?lectronique a ?t? v?rifi?e par le
logiciel antivirus Avast.
https://www.avast.com/antivirus

R help - May 2020 - iterators : checkFunc with ireadLines

[R] iterators : checkFunc with ireadLines

[R] iterators : checkFunc with ireadLines

[R] iterators : checkFunc with ireadLines

[R] iterators : checkFunc with ireadLines