Gabor Grothendieck
2003-Nov-26 15:36 UTC
[Rd] Question about Unix file paths (and proposal for new regexp class)
> Date: Wed, 26 Nov 2003 13:52:44 +0100 > From: Martin Maechler <maechler@stat.math.ethz.ch> > To: <Kurt.Hornik@wu-wien.ac.at> > Cc: <r-devel@stat.math.ethz.ch> > Subject: Re: [Rd] Question about Unix file paths > > > > >>>>> " Kurt" == Kurt Hornik <Kurt.Hornik@wu-wien.ac.at> > >>>>> on Wed, 26 Nov 2003 10:05:42 +0100 writes: > > >>>>> Prof Brian Ripley writes: > >> On Mon, 24 Nov 2003, Duncan Murdoch wrote: > >>> >Duncan Murdoch <dmurdoch@pair.com> writes: > >>> > > >>> >> Gabor Grothendieck pointed out a bug to me in > >>> list.files(..., >> full.name=TRUE), that essentially > >>> comes down to the fact that in >> Windows it's not > >>> always valid to add a path separator (slash or >> > >>> backslash) between a path specifier and a filename. For > >>> example, > >>> >> > >>> >> c:foo > >>> >> > >>> >> is different from > >>> >> > >>> >> c:\foo > >>> >> > >>> >> and there are other examples. > >>> > >>> I've committed a change to r-patched to fix this in > >>> Windows only. Sounds like it's not an issue elsewhere. > > >> I think there are some potential issues with doubling > >> separators and final separators on dirs. On Unix file > >> systems /part1//part2 and /path/to/dir/ are valid. > >> However, file systems on Unix may not be Unix file > >> systems: examples are earlier MacOS systems on MacOS X > >> and mounted Windows and Novell systems on Linux. I would > >> not want to assume that all of these combinations worked. > > >>> Gabor also suggested an option to use shell globbing > >>> instead of regular expressions to select the files in > >>> the list, e.g. > >>> > >>> list.files(dir="/", pattern="a*.dat", glob=T) > >>> > >>> This would be easy to do in Windows, but from the little > >>> I know about Unix programming, would not be so easy > >>> there, so I haven't done anything about it. > > >> It would be shell-dependent and OS-dependent as well as a > >> retrograde step, as those who wanted to use regular > >> expressions no longer would be able to. > > Kurt> Right. In any case, an explicit glob() function > Kurt> seems preferable to me ... > > Good idea! > > More than 12 years ago, I had a similar one, and wrote a > "pat2grep()" {pattern to grep regular expression} function > --- for S-plus on Unix --- which I have now renamed to glob2regexp(): > -- still not really usable outside unix (or windows with the > 'sed' tool in the path), nor perfect, but maybe a good start: > > sys <- function(...) system(paste(..., sep = "")) > > glob2regexp <- function(pattern) > { > ## Purpose: Change "ls pattern" to "grep regular expression" pattern. > ## ------------------------------------------------------------------------- > ## Author: Martin Maechler ETH Zurich, ~ 1991 > sys("echo '", pattern, "'| sed ", > "'s/\\./\\\\./g;s/*/.*/g;s/?/./g; s/^/^/;s/$/$/; s/\\.\\*\\$$//'") > } > > E.g., > > > glob2regexp("a*.dat") > ^a.*\.dat$ > > > pat2grep("a?bc*.t??") > ^a.bc.*\.t..$ > > and one could use it as > > list.files(...., pattern = glob2regexp("a*.dat")) > > Of course, the function needs to be changed to simply use things like > sub() and gsub() --- another minor exercise for our audience ... > > MartinThis is quite nifty. One advantage is that glob2regexp does not need to know the directory. Perhaps what is needed is a regexp class which stores the type of regexp in the object itself: basic, extended, perl or glob. This would clean up and unify various extra arguments floating around in a number of functions.