Dear R users, For the moment, I have a script and a function which calculates correlation matrices between all my data files. Then, it chooses the best correlation for each data and take it in order to fill missing data in the analysed file (so the data from the best correlation file is put automatically into the missing data gaps of the first file (because my files are containing missing values (NAs))). If the best correlated file doesn't contain data , it takes the data from the second best correlated file. The problem is that for the moment, it takes raw data from the best correlated file. So I need to adapt this raw data to the file that is going to be filled. As a consequence, I'd like to automatize the calculation of a linear regression (after the selection of the best or the second best correlated data file) between the two files. Instead of taking the raw data from the best correlated file to fill the first one, it should take the estimated data from the regression to fill it (in order to have more precise filled data). The idea is so to do an lm() between these two files, to extract the coefficients of the straight line (from the regression) and to calculate the estimated data for all my file (NA included), and finally to fill the gaps with this estimated data. Hope you've understand my problem. Here's the function: process.all <- function(df.list, mat){ f <- function(station) na.fill(df.list[[ station ]], df.list[[ max.cor[station] ]]) g <- function(station){ x <- df.list[[station]] if(any(is.na(x$data))){ mat[row(mat) == col(mat)] <- -Inf nas <- which(is.na(x$data)) ord <- order(mat[station, ], decreasing = TRUE)[-c(1, ncol(mat))] for(i in nas){ for(y in ord){ if(!is.na(df.list[[y]]$data[i])){ x$data[i] <- df.list[[y]]$data[i] break } } } } x } n <- length(df.list) nms <- names(df.list) max.cor <- sapply(seq.int(n), get.max.cor, corhiver2008capt1) df.list <- lapply(seq.int(n), f) df.list <- lapply(seq.int(n), g) names(df.list) <- nms df.list } I succeded for a small data.frame I've created, but I don't know how to do it in this particular case. Thanks a lot for your help! -- View this message in context: http://r.789695.n4.nabble.com/add-an-automatized-linear-regression-in-a-function-tp4606047.html Sent from the R help mailing list archive at Nabble.com.
Em 04-05-2012 11:00, jeff6868 <geoffrey_klein at etu.u-bourgogne.fr> escreveu:> Date: Thu, 3 May 2012 06:45:59 -0700 (PDT) > From: jeff6868<geoffrey_klein at etu.u-bourgogne.fr> > To:r-help at r-project.org > Subject: [R] add an automatized linear regression in a function > Message-ID:<1336052759474-4606047.post at n4.nabble.com> > Content-Type: text/plain; charset=us-ascii > > Dear R users, > > For the moment, I have a script and a function which calculates correlation > matrices between all my data files. Then, it chooses the best correlation > for each data and take it in order to fill missing data in the analysed file > (so the data from the best correlation file is put automatically into the > missing data gaps of the first file (because my files are containing missing > values (NAs))). If the best correlated file doesn't contain data , it takes > the data from the second best correlated file. > The problem is that for the moment, it takes raw data from the best > correlated file. > > So I need to adapt this raw data to the file that is going to be filled. As > a consequence, I'd like to automatize the calculation of a linear regression > (after the selection of the best or the second best correlated data file) > between the two files. > Instead of taking the raw data from the best correlated file to fill the > first one, it should take the estimated data from the regression to fill it > (in order to have more precise filled data). > The idea is so to do an lm() between these two files, to extract the > coefficients of the straight line (from the regression) and to calculate the > estimated data for all my file (NA included), and finally to fill the gaps > with this estimated data. Hope you've understand my problem. > Here's the function: > > process.all<- function(df.list, mat){ > f<- function(station) > na.fill(df.list[[ station ]], df.list[[ max.cor[station] ]]) > > g<- function(station){ > x<- df.list[[station]] > if(any(is.na(x$data))){ > mat[row(mat) == col(mat)]<- -Inf > nas<- which(is.na(x$data)) > ord<- order(mat[station, ], decreasing = TRUE)[-c(1, > ncol(mat))] > for(i in nas){ > for(y in ord){ > if(!is.na(df.list[[y]]$data[i])){ > x$data[i]<- df.list[[y]]$data[i] > break > } > } > } > } > x > } > > n<- length(df.list) > nms<- names(df.list) > max.cor<- sapply(seq.int(n), get.max.cor, corhiver2008capt1) > df.list<- lapply(seq.int(n), f) > df.list<- lapply(seq.int(n), g) > names(df.list)<- nms > df.list > } > > I succeded for a small data.frame I've created, but I don't know how to do > it in this particular case. > Thanks a lot for your help! >Statistically speaking, I don't believe in what you want, but a solution could be na.fill <- function(x, y){ i <- is.na(x$data) xx <- y$data new <- data.frame(xx=xx) x$data[i] <- predict(lm(x$data~xx, na.action=na.exclude), new)[i] x } and in process.all, change function g() to g <- function(station){ x <- df.list[[station]] if(any(is.na(x$data))){ mat[row(mat) == col(mat)] <- -Inf nas <- which(is.na(x$data)) ord <- order(mat[station, ], decreasing = TRUE)[-c(1, ncol(mat))] for(y in ord){ if(all(!is.na(df.list[[y]]$data[nas]))){ xx <- df.list[[y]]$data new <- data.frame(xx=xx) x$data[nas] <- predict(lm(x$data~xx, na.action=na.exclude), new)[nas] break } } } x } Hope this helps, Rui Barradas
Em 04-05-2012 11:00, jeff6868 <geoffrey_klein at etu.u-bourgogne.fr> escreveu:> Date: Thu, 3 May 2012 06:45:59 -0700 (PDT) > From: jeff6868<geoffrey_klein at etu.u-bourgogne.fr> > To:r-help at r-project.org > Subject: [R] add an automatized linear regression in a function > Message-ID:<1336052759474-4606047.post at n4.nabble.com> > Content-Type: text/plain; charset=us-ascii > > Dear R users, > > For the moment, I have a script and a function which calculates correlation > matrices between all my data files. Then, it chooses the best correlation > for each data and take it in order to fill missing data in the analysed file > (so the data from the best correlation file is put automatically into the > missing data gaps of the first file (because my files are containing missing > values (NAs))). If the best correlated file doesn't contain data , it takes > the data from the second best correlated file. > The problem is that for the moment, it takes raw data from the best > correlated file. > > So I need to adapt this raw data to the file that is going to be filled. As > a consequence, I'd like to automatize the calculation of a linear regression > (after the selection of the best or the second best correlated data file) > between the two files. > Instead of taking the raw data from the best correlated file to fill the > first one, it should take the estimated data from the regression to fill it > (in order to have more precise filled data). > The idea is so to do an lm() between these two files, to extract the > coefficients of the straight line (from the regression) and to calculate the > estimated data for all my file (NA included), and finally to fill the gaps > with this estimated data. Hope you've understand my problem. > Here's the function: > > process.all<- function(df.list, mat){ > f<- function(station) > na.fill(df.list[[ station ]], df.list[[ max.cor[station] ]]) > > g<- function(station){ > x<- df.list[[station]] > if(any(is.na(x$data))){ > mat[row(mat) == col(mat)]<- -Inf > nas<- which(is.na(x$data)) > ord<- order(mat[station, ], decreasing = TRUE)[-c(1, > ncol(mat))] > for(i in nas){ > for(y in ord){ > if(!is.na(df.list[[y]]$data[i])){ > x$data[i]<- df.list[[y]]$data[i] > break > } > } > } > } > x > } > > n<- length(df.list) > nms<- names(df.list) > max.cor<- sapply(seq.int(n), get.max.cor, corhiver2008capt1) > df.list<- lapply(seq.int(n), f) > df.list<- lapply(seq.int(n), g) > names(df.list)<- nms > df.list > } > > I succeded for a small data.frame I've created, but I don't know how to do > it in this particular case. > Thanks a lot for your help! >Statistically speaking, I don't believe in what you want, but a solution could be na.fill <- function(x, y){ i <- is.na(x$data) xx <- y$data new <- data.frame(xx=xx) x$data[i] <- predict(lm(x$data~xx, na.action=na.exclude), new)[i] x } and in process.all, change function g() to g <- function(station){ x <- df.list[[station]] if(any(is.na(x$data))){ mat[row(mat) == col(mat)] <- -Inf nas <- which(is.na(x$data)) ord <- order(mat[station, ], decreasing = TRUE)[-c(1, ncol(mat))] for(y in ord){ if(all(!is.na(df.list[[y]]$data[nas]))){ xx <- df.list[[y]]$data new <- data.frame(xx=xx) x$data[nas] <- predict(lm(x$data~xx, na.action=na.exclude), new)[nas] break } } } x } Hope this helps, Rui Barradas
Maybe Matching Threads
- create new column in a DF according to values from another column
- filling small gaps of N/A
- take data from a file to another according to their correlation coefficient
- stop calculation in a function
- duplicate data between two data frames according to row names