Hi there, I'm having myself a hard time writing an algorithm for finding patterns within a given melody. In a vector I'd like to find ALL sequences that occur at least twice, without having to check all possible patterns via pattern matching. I finally found a solution in a style that I'm used from C, i.e. calling one function from within another. GNU R doesn't seem to like that, it complains about too deep nesting and infinite recursion (I can't see that...). I've tried options(expressions=500000), but even then the variable a doesn't make it over 2 (my vectors have about 100-300 elements). I'm not a software engineer, so I guess that algo is rather badly designed. I'd appreciate any help on how to make it suitable for R, or about alternative approaches (I guess something like this must be used in bioinformatics, but I didn't find it implemented in R. Any hints are welcome ;-) ) Cheers and thanx in advance, John CODECODCODECODECODECODECODECODECODECODECODECODECODECODECODECODE antiphonar <- function(v) { a <- 1; b <- 2; n <- length(v); alessn(a, b, n, v, x); } alessn <- function(a, b, n, v, x) { if(a<n) {vavb(a, b, n, v, x);} else{print("That's all, folks ;-)");} } vavb <- function(a, b, n, v, x) { if(v[a]==v[b]) { x <- 1; while( v[a+x] == v[b+x] && b+x<n) {x <- x+1;} m <- 0; for(k in 0:(x-1)) {m <- v[a+k]*10^(x-1-k)+m;} p <- c(x, a, b, m); print(p); baba(a, b, n, v, x); } else baba(a, b, n, v, x); } baba <- function(a, b, n, v, x) { b <- b+1; if(b<=n) {vavb(a, b, n, v, x);} else { a <- a+1; b <- a+1; alessn(a, b, n, v, x); } } ENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDEND
Thanks for your response! Am Montag, den 17.07.2006, 17:36 -0400 schrieb jim holtman:> It would help if you could provide the calling script and the data > that you are using.The code I sent is included in .Rprofile. One of my data vectors would be v = c(1, 1, 1, 5, 6, 1, 1, 1, 1, 6, 1, 1, 6, 6, 6, 6, 2, 6, 6, 1, 5, 5, 2, 1, 1, 1, 5, 6, 1, 2, 1, 1, 6, 5, 1, 1, 6, 6, 6, 6, 1, 6, 1, 2, 6, 6, 1, 5, 1, 2, 1, 6, 1, 1, 5, 6, 6, 1, 1, 6, 5, 1, 6, 6, 1, 5, 4, 1, 6, 6, 4, 6, 5, 1, 5, 4, 6, 5, 1, 5, 6, 1, 1, 1, 5, 6, 1, 1, 6) for example. After defining it, I call antiphonar(v). This returns 53 entries, but as I said, this is quite too few, as the program breaks before a=3.> It sounds like your algorithm is not correct because it is recursing > so deep. You have recursive calls, so you are not terminating your > search correctly.Wouldn't it terminate on some call of alessn for a>=n?> Your functions are not returning values.Should they? They just perform tests on v. As I said, I'm not an expert programmer. I think in C++ there was something like a void type for functions, i.e. functions that only perform operations on variables without having a value for themselves.> Put some debug print statements to see what is happening in your code. > debug(antiphonar) > antiphonar(v)debugging in: antiphonar(v) debug: { a <- 1 b <- 2 n <- length(v) alessn(a, b, n, v, x) } Browse[1]> n debug: a <- 1 Browse[1]> n debug: b <- 2 Browse[1]> n debug: n <- length(v) Browse[1]> n debug: alessn(a, b, n, v, x) Browse[1]> n Fehler: Auswertung zu tief verschachtelt: unendliche Rekursion / options(expressions=)? Not very helpful :-( Cheers, John
Am Montag, den 17.07.2006, 19:42 -0400 schrieb jim holtman:> You were down at least 5300 levels in subroutine calls. Here is the > first couple of lines from 'traceback()': > > 5363: vavb(a, b, n, v, x)...> 5316: vavb(a, b, n, v, x)> The other thing is that there is no 'x' defined in the function; > > antiphonar <- function(v) > { > a <- 1; > b <- 2; > n <- length(v); > alessn(a, b, n, v, x); > } > > and this is the initial call to the rest of the functions.Ooops... normally I don't program THAT bad ... ;-)> What is the problem you are trying to solve?I'm analyzing a 16th century music manuscript. There is no rhythm notated there, but I can to a certain degree restore it by finding self-similarities in the melodic structure. The vector v is an ordered list of intervals. I could find them by hand, but this takes time and things are easily missed. Besides, for a 200 pages manuscript coding the vectors takes long, but finding similarities by hand takes endless... :-( The problem remains. I'm not sure I did completely understand the way R works. My plan was to jump out of one function to another. I think what R does is assigning a function's return value to an internal variable and tries to evaluate another function WITHIN a calling function, which of course leads to deep nesting and a bunch of internal variables. Too bad there is no "goto" statement in R... Isn't there a way to make R "forget" that it just evaluated e.g. alessn the moment it calles vavb? That algorithm itself isn't that hard. In the end it's just a decision diagram of the form "if THIS, goto THIS knot, otherwise goto THAT knot". Each knot has only one arrow pointing away from it, but sometimes more than one pointing toward it. I could send the diagram as an attachement, if you like. Cheers, John P.S.: Please keep messages on the list. Thank you!
> I was checking out your function and thinking of trying to make it more > efficient when I noticed your comment above.:-D> Yes, bioinformatics is full > of this stuff - have a look at Bioconductor.I'm about it, but it'll take some time. It's a vast repository... For the meantime, I've attached a graph for clarification ;-) Dashed lines are followed if the test returns FALSE, solid if TRUE. Cheers, John -------------- next part -------------- A non-text attachment was scrubbed... Name: diagramm.ps Type: application/postscript Size: 13108 bytes Desc: not available Url : https://stat.ethz.ch/pipermail/r-help/attachments/20060718/d619ac3d/attachment.ps
Please look at http://www.turbulence.org/Works/song/ This is a website by Martin Wattenberg that visually displays the types of patterns you are looking for. He gave a paper at the Joint Statistics Meetings in Minneapolis 2005.
On Mon, 17 Jul 2006, John Wiedenhoeft wrote:> Hi there, > > I'm having myself a hard time writing an algorithm for finding patterns > within a given melody. In a vector I'd like to find ALL sequences that > occur at least twice, without having to check all possible patterns via > pattern matching. >Another approach, which works for not-too-long vectors like you have is: n <- length(v) matches <- outer(v, v, "==") & outer(1:n,1:n,">=") Now matches has TRUE where v[i]==v[j]. For a longer match you would also need v[i+1]==v[j+1] and so on, making a diagonal line through the matrix. Diagonal lines are hard, so let's turn them into horizontal lines matches <- matrix(cbind(matches, FALSE), ncol=n) now row i+1 column j of matches is TRUE for a single entry match starting at position j at a separation of i. If there is a match of length 2, then column j+1 will also be TRUE, and so on. Now rle() applied to a row will return the lengths of consecutive sequences of TRUE and FALSE. The lengths of consecutive sequences of TRUE are the lengths of the matches. To get rid of trivial matches of length less than 2 do match2 <- t(apply(matches,1,function(row){ r<-rle(row) r$values[r$lengths<2]<-FALSE inverse.rle(r) })) And finally, to extract the matches results <- apply(match2, 1, function(row){ r<-rle(row) n<-length(r$lengths) ends<-cumsum(r$lengths) starts<-cumsum(c(1,r$lengths))[1:n] list(starts[r$values],ends[r$values]) }) for starts and ends of matches or results <- apply(match2, 1, function(row){ r<-rle(row) n<-length(r$lengths) ends<-cumsum(r$lengths)[r$values] starts<-cumsum(c(1,r$lengths))[1:n][r$values] mapply(function(stt,end) v[stt:end],starts,ends, SIMPLIFY=FALSE) }) to get a list of the actual matching sequences. -thomas
Am Dienstag, den 18.07.2006, 22:09 -0400 schrieb Jim Lemon:> Hi John, > > Minor bug - I zeroed the hit counter in the wrong place. > > find.replay<-function(tunestring,maxlen) {....> return(matchlist) > }Dear Jim, many, many thanks for your effords :-D!!! Your program is great and very elegant (guess I was thinking to mathematical and iterative...). I have made some minor adjustments to it. Most important, I have changed starttest to startpat+1, since patterns can overlap of course (1111 consists of 2 times 111). Hope this doesn't cause any inconsistency. Here is what I'm using now: CODECODECODECODECODECODECODECODECODECODECODECODECODECODECODE find.replay<-function(tunestring,filename,maxlen) { tunelen<-length(tunestring) if(missing(maxlen)) maxlen<-floor(tunelen/2) if(missing(filename)) filename<-"output" matchlist <- list() startpat<-1 endpat<-2 finishpos<-tunelen-maxlen pattern<-tunestring[startpat:endpat] patlen<-length(pattern) while(patlen <= maxlen) { while(endpat < tunelen-patlen) { starttest<-startpat+1 #changed from endpat+1 to detect overlapping patterns endtest<-starttest+(endpat-startpat) # step through the rest of tunestring with this pattern while(endtest <= tunelen) { testpat<-tunestring[starttest:endtest] if(identical(pattern,testpat)) { m <- 0; w <- 0; for(k in 1:patlen) {m <- pattern[k]*10^(patlen-k)+m;} for(l in 1:patlen) {w <- testpat[l]*10^(patlen-l)+w;} #just in case... ;-) if (m!=w) { warn <- paste("Unmatching patterns", m, "and", w, "detected errorneously!") print(warn) write.table(warn, file=filename, append=TRUE, sep="\t", eol="\n", row.names=FALSE, col.names=FALSE); } p <- c(patlen, m, startpat, starttest); write.table(t(p), file=filename, append=TRUE, sep="\t", eol="\n", row.names=FALSE, col.names=FALSE); # print(p); } # step to the next substring starttest<-starttest+1 endtest<-endtest+1 } # now move pattern one step along the string startpat<-startpat+1 endpat<-endpat+1 pattern<-tunestring[startpat:endpat] } # now increase the length of the pattern startpat<- 1 endpat<-startpat+patlen pattern<-tunestring[startpat:endpat] patlen<-length(pattern) } print(paste("Output written to file '", filename, "'. Have fun!", sep="")) } ENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDEND Again, many, many thanks. You brightened up my day! Cheers, John