bioinformatics_guy
2008-Sep-15 16:29 UTC
[R] Scripting in R -- pattern matching, logic, system calls, the works!
Im very new to R so this might be a very simple question. First I'll lay out the hierarchy of my directories, goals. I have say 5 directories of form "Coverage_(some number)" and each one of these I have text files of form "Length_(some number)" which are comprised of say 30 numbers. Each one of these Length files (which are basically incremented by 5 from 0 to 100, Length_(0,5,10,15,20) are to be averaged where the average is the y-value and the length is the x-value in a linear regression. What I want to do is, write a script that looks in each of the coverage directories and then reads in each of the files, takes the means, and plots them in form I specified above. The catch is, what if I only want to plot say Length_(20-50) and what command/method is best for a linear regression? I've looked at m1(), but have not gotten it to work properly. Below is some of the code I've put together: topdir="~" setwd(topdir) ### Took this function from a friend so I'm not sure what its doing besides grep-ing a directory? ll<-function(string) { grep(string,dir(),value=T) } ### I believe this is looking for all files of form below subdir = ll("Coverage_[1-9][0-9]$") ### A for loop iterating through each of the sub directories. for (i in subdir) { #not sure what this line is doing as I found it on the internet on a similar function setwd(paste(getwd(),i,sep="/")) #This makes a vector of all the file names filelist=ll("Length_") Can I use a regex or logic to only take the filelist variables I want? And can I now get the mean of each Length_* and set in a matrix (length x mean)? Then finally, how to do a linear regression of this. -- View this message in context: http://www.nabble.com/Scripting-in-R----pattern-matching%2C-logic%2C-system-calls%2C-the-works%21-tp19496451p19496451.html Sent from the R help mailing list archive at Nabble.com.
Don MacQueen
2008-Sep-15 22:06 UTC
[R] Scripting in R -- pattern matching, logic, system calls, the works!
I can't go through all the details, but hopefully this will help get you started. If you look at the help page for the list.files() function, you will see this: list.files(path = ".", pattern = NULL, all.files = FALSE, full.names = FALSE, recursive = FALSE, ignore.case = FALSE) The "." in path means to start at your current working directory. Assuming your 5 Coverage directories are subdirectories of your current working directory, that's what you want. Then, setting recursive to TRUE will cause it to also list the contents of all subdirectories. Since your Length files are in the Coverage subdirectories, that's what you want. Finally, the pattern argument returns only files that match the pattern, so something like patter="Length" should get you just the files you want. The result is a character vector containing the names of all your Length files. Try it and see. Then, a simple loop over the over the vector of filenames, with an appropriate scan() or read.table() command for each, will read the data in. If you need to restrict the files, say Length_20, Length_25, Length_30, etc. then you'll have to do some more work. Look at as.numeric(gsub( 'Length_', '', filename)) to get just the number part of the filename, as a number, and then you can use numeric inequalities to identify whether or not any particular file is to be processed. Since you haven't shown what the contents of your files look like (two columns of numbers or what), I have no idea what to suggest for the part having to do with reading them in, plotting or doing linear regression. The basic function for linear regression is lm(). Here is a summary: files <- list.files( '~' , pattern='Length', recursive=TRUE) for (fl in files) { ## optional, to restrict to only certain files filenum <- as.numeric(gsub( 'Length_', '', filename)) ## skip to next file if it isn't in the correct number range if (filenum > 50 | filenum < 20) next ## a command to read the current file. perhaps: ## tmp <- read.table(fl) ## commands to do statistics on the data in the current file. perhaps: ## fit <- lm( y ~ y, data=tmp) ## some output cat('------ file =',fl,'-----\n') print(fit) } This example doesn't restrict only to certain Coverage subdirectories. -Don At 9:29 AM -0700 9/15/08, bioinformatics_guy wrote:>Im very new to R so this might be a very simple question. First I'll lay out >the hierarchy of my directories, goals. > >I have say 5 directories of form "Coverage_(some number)" and each one of >these I have text files of form "Length_(some number)" which are comprised >of say 30 numbers. Each one of these Length files (which are basically >incremented by 5 from 0 to 100, Length_(0,5,10,15,20) are to be averaged >where the average is the y-value and the length is the x-value in a linear >regression. > >What I want to do is, write a script that looks in each of the coverage >directories and then reads in each of the files, takes the means, and plots >them in form I specified above. The catch is, what if I only want to plot >say Length_(20-50) and what command/method is best for a linear regression? >I've looked at m1(), but have not gotten it to work properly. > >Below is some of the code I've put together: > >topdir="~" > >setwd(topdir) > >### Took this function from a friend so I'm not sure what its doing besides >grep-ing a directory? >ll<-function(string) >{ > grep(string,dir(),value=T) >} > >### I believe this is looking for all files of form below >subdir = ll("Coverage_[1-9][0-9]$") > >### A for loop iterating through each of the sub directories. >for (i in subdir) >{ > #not sure what this line is doing as I found it on the internet on a >similar function > setwd(paste(getwd(),i,sep="/")) > #This makes a vector of all the file names > filelist=ll("Length_") > >Can I use a regex or logic to only take the filelist variables I want? >And can I now get the mean of each Length_* and set in a matrix (length x >mean)? > >Then finally, how to do a linear regression of this. > >-- >View this message in context: http:// www. >nabble.com/Scripting-in-R----pattern-matching%2C-logic%2C-system-calls%2C-the-works%21-tp19496451p19496451.html >Sent from the R help mailing list archive at Nabble.com. > >______________________________________________ >R-help at r-project.org mailing list >https:// stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http:// www. R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.-- -------------------------------------- Don MacQueen Environmental Protection Department Lawrence Livermore National Laboratory Livermore, CA, USA 925-423-1062