iuke-tier@ey m@iii@g oii uiow@@edu
2024-Jan-26 21:05 UTC
[Rd] [External] readChar() could read the whole file by default?
On Fri, 26 Jan 2024, Michael Chirico wrote:> I am curious why readLines() has a default (n=-1L) to read the full > file while readChar() has no default for nchars= (i.e., readChar(file) > is an error). Is there a technical reason for this? > > I often[1] see code like paste(readLines(f), collapse="\n") which > would be better served by readChar(), especially given issues with the > global string cache I've come across[2]. But lacking the default, the > replacement might come across less clean.The string cache seems like a very dark pink herring to me. The fact that the lines are allocated on the heap might create an issue; the cache isn't likely to add much to that. In any case I would need to see a realistic example to convince me this is worth addressing on performance grounds. I don't see any reason in principle not to have readChar and readBin read the entire file if n = -1 (others might) but someone would need to write a patch to implement that. Best, luke> For my own purposes the incantation readChar(file, file.size(file)) is > ubiquitous. Taking CRAN code[3] as a sample[4], 41% of readChar() > calls use either readChar(f, file.info(f)$size) or readChar(f, > file.size(f))[5]. > > Thanks for the consideration and feedback, > Mike C > > [1] e.g. a quick search shows O(100) usages in CRAN packages: > https://github.com/search?q=org%3Acran+%2Fpaste%5B%28%5D%5Cs*readLines%5B%28%5D.*%5B%29%5D%2C%5Cs*collapse%5Cs*%3D%5Cs*%5B%27%22%5D%5B%5C%5C%5D%2F+lang%3AR&type=code, > and O(1000) usages generally on GitHub: > https://github.com/search?q=lang%3AR+%2Fpaste%5B%28%5D%5Cs*readLines%5B%28%5D.*%5B%29%5D%2C%5Cs*collapse%5Cs*%3D%5Cs*%5B%27%22%5D%5B%5C%5C%5D%2F+lang%3AR&type=code > [2] AIUI the readLines() approach "pollutes" the global string cache > with potentially 1000s/10000s of strings for each line, only to get > them gc()'d after combining everything with paste(collapse="\n") > [3] The mirror on GitHub, which includes archived packages as well as > current (well, eventually-consistent) versions. > [4] Note that usage in packages is likely not representative of usage > in scripts, e.g. I often saw readChar(f, 1), or eol-finders like > readChar(f, 500) + grep("[\n\r]"), which makes more sense to me as > something to find in package internals than in analysis scripts. FWIW > I searched an internal codebase (scripts and packages) and found 70% > of usages reading the full file. > [5] repro: https://gist.github.com/MichaelChirico/247ea9500460dca239f031e74bdcf76b > requires GitHub PAT in env GITHUB_PAT for API permissions. > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics and Fax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tierney at uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu
Toby Hocking
2024-Jan-29 18:09 UTC
[Rd] [External] readChar() could read the whole file by default?
My opinion is that the proposed feature would be greatly appreciated by users. I had always wondered if I was the only one doing paste(readLines(f), collapse="\n") all the time. It would be great to have the proposed, more straightforward way to read the whole file as a string: readChar("my_file.txt", -1) or even better readChar("my_file.txt") Thanks for your detailed analysis Michael. On Fri, Jan 26, 2024 at 2:05?PM luke-tierney--- via R-devel <r-devel at r-project.org> wrote:> > On Fri, 26 Jan 2024, Michael Chirico wrote: > > > I am curious why readLines() has a default (n=-1L) to read the full > > file while readChar() has no default for nchars= (i.e., readChar(file) > > is an error). Is there a technical reason for this? > > > > I often[1] see code like paste(readLines(f), collapse="\n") which > > would be better served by readChar(), especially given issues with the > > global string cache I've come across[2]. But lacking the default, the > > replacement might come across less clean. > > The string cache seems like a very dark pink herring to me. The fact > that the lines are allocated on the heap might create an issue; the > cache isn't likely to add much to that. In any case I would need to > see a realistic example to convince me this is worth addressing on > performance grounds. > > I don't see any reason in principle not to have readChar and readBin > read the entire file if n = -1 (others might) but someone would need > to write a patch to implement that. > > Best, > > luke > > > For my own purposes the incantation readChar(file, file.size(file)) is > > ubiquitous. Taking CRAN code[3] as a sample[4], 41% of readChar() > > calls use either readChar(f, file.info(f)$size) or readChar(f, > > file.size(f))[5]. > > > > Thanks for the consideration and feedback, > > Mike C > > > > [1] e.g. a quick search shows O(100) usages in CRAN packages: > > https://github.com/search?q=org%3Acran+%2Fpaste%5B%28%5D%5Cs*readLines%5B%28%5D.*%5B%29%5D%2C%5Cs*collapse%5Cs*%3D%5Cs*%5B%27%22%5D%5B%5C%5C%5D%2F+lang%3AR&type=code, > > and O(1000) usages generally on GitHub: > > https://github.com/search?q=lang%3AR+%2Fpaste%5B%28%5D%5Cs*readLines%5B%28%5D.*%5B%29%5D%2C%5Cs*collapse%5Cs*%3D%5Cs*%5B%27%22%5D%5B%5C%5C%5D%2F+lang%3AR&type=code > > [2] AIUI the readLines() approach "pollutes" the global string cache > > with potentially 1000s/10000s of strings for each line, only to get > > them gc()'d after combining everything with paste(collapse="\n") > > [3] The mirror on GitHub, which includes archived packages as well as > > current (well, eventually-consistent) versions. > > [4] Note that usage in packages is likely not representative of usage > > in scripts, e.g. I often saw readChar(f, 1), or eol-finders like > > readChar(f, 500) + grep("[\n\r]"), which makes more sense to me as > > something to find in package internals than in analysis scripts. FWIW > > I searched an internal codebase (scripts and packages) and found 70% > > of usages reading the full file. > > [5] repro: https://gist.github.com/MichaelChirico/247ea9500460dca239f031e74bdcf76b > > requires GitHub PAT in env GITHUB_PAT for API permissions. > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > -- > Luke Tierney > Ralph E. Wareham Professor of Mathematical Sciences > University of Iowa Phone: 319-335-3386 > Department of Statistics and Fax: 319-335-3017 > Actuarial Science > 241 Schaeffer Hall email: luke-tierney at uiowa.edu > Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Maybe Matching Threads
- [External] readChar() could read the whole file by default?
- readChar() could read the whole file by default?
- Buffering in R 3.5 connections causes incorrect data in readChar
- Buffering in R 3.5 connections causes incorrect data in readChar
- Clarification for readChar man page