Hello everyone, I am learning R since recently, and as a small exercise I wanted to write a recursive mergesort. I was extremely surprised to discover that my sorting, although operational, is deeply inefficient in time. Here is my code :> merge <- function(x,y){ > if (is.na(x[1])) return(y) > else if (is.na(y[1])) return(x) > else if (x[1]<y[1]) return(c(x[1],merge(x[-1],y))) > else return(c(y[1],merge(x,y[-1]))) > } > > division <- function(x){ > if (is.na(x[3])) return(cbind(x[1],x[2])) > else > return(cbind(c(x[1],division(x[-c(1,2)])[,1]),c(x[2],division(x[-c(1,2)])[,2]))) > } > > mergesort <- function(x){ > if (is.na(x[2])) return(x) > else{ > print(x) > t=division(x) > return(merge(mergesort(t[,1]),mergesort(t[,2]))) > } > }I tried my best to write it "the R-way", but apparently I failed. I suppose some of the functions I used are quite heavy. I would be grateful if you could give a hint on how to change that! I hope I made myself clear and wish you a nice day, Cheers, Gaston [[alternative HTML version deleted]]
On 19/04/2016 3:39 PM, Gaston wrote:> Hello everyone, > > I am learning R since recently, and as a small exercise I wanted to > write a recursive mergesort. I was extremely surprised to discover that > my sorting, although operational, is deeply inefficient in time. Here is > my code : > >> merge <- function(x,y){ >> if (is.na(x[1])) return(y) >> else if (is.na(y[1])) return(x) >> else if (x[1]<y[1]) return(c(x[1],merge(x[-1],y))) >> else return(c(y[1],merge(x,y[-1]))) >> } >> >> division <- function(x){ >> if (is.na(x[3])) return(cbind(x[1],x[2])) >> else >> return(cbind(c(x[1],division(x[-c(1,2)])[,1]),c(x[2],division(x[-c(1,2)])[,2]))) >> } >> >> mergesort <- function(x){ >> if (is.na(x[2])) return(x) >> else{ >> print(x) >> t=division(x) >> return(merge(mergesort(t[,1]),mergesort(t[,2]))) >> } >> } > > I tried my best to write it "the R-way", but apparently I failed. I > suppose some of the functions I used are quite heavy. I would be > grateful if you could give a hint on how to change that! > > I hope I made myself clear and wish you a nice day,Your use of is.na() looks strange. I don't understand why you are testing element 2 in mergesort(), and element 1 in merge(), and element 3 in division. Are you using it to test the length? It's better to use the length() function for that. The division() function returns a matrix. It would make more R-sense to return a list containing the two parts, because they might not be the same length. Generally speaking, function calls are expensive in R, so the recursive merge you're using looks like it would be the bottleneck. You'd almost certainly be better off to allocate something of length(x) + length(y), and do the assignments in a loop. Here's a merge sort I wrote as an illustration in a class. It's designed for clarity rather than speed, but I'd guess it would be faster than yours: mergesort <- function(x) { n <- length(x) if (n < 2) return(x) # split x into two pieces of approximately equal size, x1 and x2 x1 <- x[1:(n %/% 2)] x2 <- x[(n %/% 2 + 1):n] # sort each of the pieces x1 <- mergesort(x1) x2 <- mergesort(x2) # merge them back together result <- c() i <- 0 while (length(x1) > 0 && length(x2) > 0) { # compare the first values if (x1[1] < x2[1]) { result[i + 1] <- x1[1] x1 <- x1[-1] } else { result[i + 1] <- x2[1] x2 <- x2[-1] } i <- i + 1 } # put the smaller one into the result # delete it from whichever vector it came from # repeat until one of x1 or x2 is empty # copy both vectors (one is empty!) onto the end of the results result <- c(result, x1, x2) result } If I were going for speed, I wouldn't modify the x1 and x2 vectors, and I'd pre-allocate result to the appropriate length, rather than growing it in the while loop. But that was a different class! Duncan Murdoch
I indeed used is.na() to check length, as I was not sure weather lenght() was a simple query or would go through the whole vector to count the elements. So to sum up, function calls are expensive, therefore recursion should be avoided, and growing the size of a vector (which is probably reassigning and copying?) is also expensive. Thank you for your help! On 04/19/2016 11:51 PM, Duncan Murdoch wrote:> On 19/04/2016 3:39 PM, Gaston wrote: >> Hello everyone, >> >> I am learning R since recently, and as a small exercise I wanted to >> write a recursive mergesort. I was extremely surprised to discover that >> my sorting, although operational, is deeply inefficient in time. Here is >> my code : >> >>> merge <- function(x,y){ >>> if (is.na(x[1])) return(y) >>> else if (is.na(y[1])) return(x) >>> else if (x[1]<y[1]) return(c(x[1],merge(x[-1],y))) >>> else return(c(y[1],merge(x,y[-1]))) >>> } >>> >>> division <- function(x){ >>> if (is.na(x[3])) return(cbind(x[1],x[2])) >>> else >>> return(cbind(c(x[1],division(x[-c(1,2)])[,1]),c(x[2],division(x[-c(1,2)])[,2]))) >>> >>> } >>> >>> mergesort <- function(x){ >>> if (is.na(x[2])) return(x) >>> else{ >>> print(x) >>> t=division(x) >>> return(merge(mergesort(t[,1]),mergesort(t[,2]))) >>> } >>> } >> >> I tried my best to write it "the R-way", but apparently I failed. I >> suppose some of the functions I used are quite heavy. I would be >> grateful if you could give a hint on how to change that! >> >> I hope I made myself clear and wish you a nice day, > > Your use of is.na() looks strange. I don't understand why you are > testing element 2 in mergesort(), and element 1 in merge(), and > element 3 in division. Are you using it to test the length? It's > better to use the length() function for that. > > The division() function returns a matrix. It would make more R-sense > to return a list containing the two parts, because they might not be > the same length. > > Generally speaking, function calls are expensive in R, so the > recursive merge you're using looks like it would be the bottleneck. > You'd almost certainly be better off to allocate something of > length(x) + length(y), and do the assignments in a loop. > > Here's a merge sort I wrote as an illustration in a class. It's > designed for clarity rather than speed, but I'd guess it would be > faster than yours: > > mergesort <- function(x) { > > n <- length(x) > if (n < 2) return(x) > > # split x into two pieces of approximately equal size, x1 and x2 > > x1 <- x[1:(n %/% 2)] > x2 <- x[(n %/% 2 + 1):n] > > # sort each of the pieces > x1 <- mergesort(x1) > x2 <- mergesort(x2) > > # merge them back together > result <- c() > i <- 0 > while (length(x1) > 0 && length(x2) > 0) { > # compare the first values > if (x1[1] < x2[1]) { > result[i + 1] <- x1[1] > x1 <- x1[-1] > } else { > result[i + 1] <- x2[1] > x2 <- x2[-1] > } > i <- i + 1 > } > > # put the smaller one into the result > # delete it from whichever vector it came from > # repeat until one of x1 or x2 is empty > # copy both vectors (one is empty!) onto the end of the results > result <- c(result, x1, x2) > result > } > > If I were going for speed, I wouldn't modify the x1 and x2 vectors, > and I'd pre-allocate result to the appropriate length, rather than > growing it in the while loop. But that was a different class! > > Duncan Murdoch