I am hoping someone can help me with a bus stop sequencing problem in R, where I need to match counts of people getting on and off a bus to the correct stop in the bus route stop sequence. I have tried looking online/forums for sequence matching but seems to refer to numeric sequences or DNA matching and over my head. I am after a simple example if anyone can please help. I have two data series as per below (from database), that I want to combine. In this example ?stop_sequence? includes the equence (seq) of bus stops and ?stop_onoff? is a count of people getting on and off at certain stops (there is no entry if noone gets on or off). stop_sequence <- data.frame(seq=c(10,20,30,40,50,60), ref=c('A','B','C','D','B','A')) ## seq ref ## 1 10 A ## 2 20 B ## 3 30 C ## 4 40 D ## 5 50 B ## 6 60 A stop_onoff <- data.frame(ref=c('A','D','B','A'),on=c(5,0,10,0),off=c(0,2,2,6)) ## ref on off ## 1 A 5 0 ## 2 D 0 2 ## 3 B 10 2 ## 4 A 0 6 I need to match the stop_onoff numbers in the right sto sequence, with the correctly matched output as follows (load is a cumulative count of on and off) desired_output <- data.frame(seq=c(10,20,30,40,50,60), ref=c('A','B','C','D','B','A'), on=c(5,'-','-',0,10,0),off=c(0,'-','-',2,2,6), load=c(5,0,0,3,11,5)) ## seq ref on off load ## 1 10 A 5 0 5 ## 2 20 B - - 0 ## 3 30 C - - 0 ## 4 40 D 0 2 3 ## 5 50 B 10 2 11 ## 6 60 A 0 6 5 In this example the stop ?B? is matched to the second stop ?B? in the stop sequence and not the first because the onoff data is after stop ?D?. Any guidance much appreciated. Regards Adam [[alternative HTML version deleted]]
Homework? The list has a no homework policy - but perhaps I'll be forgiven por posting hints. In general terms, this is how I appraoched the problem: * Loop through the rows of stop_onoff - for (idx in ...someething...) {... * For each row, find the first of "ref" in a suitably filtered subset of stop_sequence, and keep track of these row numbers * Update columns "on" and "off" * Use cumsum to calculate the number of passengers on the bus Note the loop. Someone cleverer than I might be able to vectorise that step, but I couldn't see how. By the way, if this is homework... Are you sure you're desired_output is correct? I would expect someething like seq ref on off load 1 10 A 5 0 5 2 20 B 0 0 5 3 30 C 0 0 5 4 40 D 0 2 3 5 50 B 10 2 11 6 60 A 0 6 5 Are you aware that you're "ref" ccolumns are factors and not characters? If you use "stringsAsFactors = FALSE" or stop_onoff <- data.frame(ref=factor(c('A','D','B','A'), levels levels(stop_sequence$ref)),on=c(5,0,10,0),off=c(0,2,2,6)) it will simplify your'e analysis (or at least reduce some typing). Type the following in an R console ?data.frame ?factor and have a read. Now, if this ain't homework, or you just want someone to do it for you, e-mail me offline and I'll send you my appraoch. If it is homework, let me know - I'm happy to help anyway, but I will be trying to help you solve this for yourself. Cheers, DMcP On Sat, 30 Aug 2014 12:46:17 +1200 Adam Lawrence <alaw005 at gmail.com> wrote> I am hoping someone can help me with a bus stop sequencing problem in R, > where I need to match counts of people getting on and off a bus to the > correct stop in the bus route stop sequence. I have tried looking > online/forums for sequence matching but seems to refer to numeric sequences > or DNA matching and over my head. I am after a simple example if anyone can > please help. > > I have two data series as per below (from database), that I want to > combine. In this example ?stop_sequence? includes the equence (seq) ofbus> stops and ?stop_onoff? is a count of people getting on and off atcertain> stops (there is no entry if noone gets on or off). > > stop_sequence <- data.frame(seq=c(10,20,30,40,50,60), > ref=c('A','B','C','D','B','A')) > ## seq ref > ## 1 10 A > ## 2 20 B > ## 3 30 C > ## 4 40 D > ## 5 50 B > ## 6 60 A > stop_onoff <- > data.frame(ref=c('A','D','B','A'),on=c(5,0,10,0),off=c(0,2,2,6)) > ## ref on off > ## 1 A 5 0 > ## 2 D 0 2 > ## 3 B 10 2 > ## 4 A 0 6 > > I need to match the stop_onoff numbers in the right sto sequence, with the > correctly matched output as follows (load is a cumulative count of on and > off) > > desired_output <- data.frame(seq=c(10,20,30,40,50,60), > ref=c('A','B','C','D','B','A'), > on=c(5,'-','-',0,10,0),off=c(0,'-','-',2,2,6), load=c(5,0,0,3,11,5)) > ## seq ref on off load > ## 1 10 A 5 0 5 > ## 2 20 B - - 0 > ## 3 30 C - - 0 > ## 4 40 D 0 2 3 > ## 5 50 B 10 2 11 > ## 6 60 A 0 6 5 > > In this example the stop ?B? is matched to the second stop ?B? inthe stop> sequence and not the first because the onoff data is after stop ?D?. > > Any guidance much appreciated. > > Regards > Adam > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >____________________________________________________________ South Africas premier free email service - www.webmail.co.za Cheapest Insurance Quotes! https://www.outsurance.co.za/insurance-quote/personal/?source=msn&cr=Postit14_468x60_gif&cid=322
Try dtw. First convert ref to numeric since dtw does not handle character input. Then align using dtw and NA out repeated values in the alignment. Finally zap ugly row names and calculate loading: library(dtw) s1 <- as.numeric(stop_sequence$ref) s2 <- as.numeric(factor(as.character(stop_onoff$ref), levels(stop_sequence$ref))) a <- dtw(s1, s2) DF <- cbind(stop_sequence, stop_onoff[replace(a$index2, c(FALSE, diff(a$index2) == 0), NA), ])[-3] rownames(DF) <- NULL transform(DF, loading = cumsum(ifelse(is.na(on), 0, on)) - cumsum(ifelse(is.na(off), 0, off))) giving: seq ref on off loading 1 10 A 5 0 5 2 20 B NA NA 5 3 30 C NA NA 5 4 40 D 0 2 3 5 50 B 10 2 11 6 60 A 0 6 5 You will need to test this with more data and tweak it if necessary via the various dtw arguments. On Fri, Aug 29, 2014 at 8:46 PM, Adam Lawrence <alaw005 at gmail.com> wrote:> I am hoping someone can help me with a bus stop sequencing problem in R, > where I need to match counts of people getting on and off a bus to the > correct stop in the bus route stop sequence. I have tried looking > online/forums for sequence matching but seems to refer to numeric sequences > or DNA matching and over my head. I am after a simple example if anyone can > please help. > > I have two data series as per below (from database), that I want to > combine. In this example ?stop_sequence? includes the equence (seq) of bus > stops and ?stop_onoff? is a count of people getting on and off at certain > stops (there is no entry if noone gets on or off). > > stop_sequence <- data.frame(seq=c(10,20,30,40,50,60), > ref=c('A','B','C','D','B','A')) > ## seq ref > ## 1 10 A > ## 2 20 B > ## 3 30 C > ## 4 40 D > ## 5 50 B > ## 6 60 A > stop_onoff <- > data.frame(ref=c('A','D','B','A'),on=c(5,0,10,0),off=c(0,2,2,6)) > ## ref on off > ## 1 A 5 0 > ## 2 D 0 2 > ## 3 B 10 2 > ## 4 A 0 6 > > I need to match the stop_onoff numbers in the right sto sequence, with the > correctly matched output as follows (load is a cumulative count of on and > off) > > desired_output <- data.frame(seq=c(10,20,30,40,50,60), > ref=c('A','B','C','D','B','A'), > on=c(5,'-','-',0,10,0),off=c(0,'-','-',2,2,6), load=c(5,0,0,3,11,5)) > ## seq ref on off load > ## 1 10 A 5 0 5 > ## 2 20 B - - 0 > ## 3 30 C - - 0 > ## 4 40 D 0 2 3 > ## 5 50 B 10 2 11 > ## 6 60 A 0 6 5 > > In this example the stop ?B? is matched to the second stop ?B? in the stop > sequence and not the first because the onoff data is after stop ?D?. > > Any guidance much appreciated. > > Regards > Adam > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
Adam Lawrence <alaw005 <at> gmail.com> writes:> > I am hoping someone can help me with a bus stop sequencing problem in R, > where I need to match counts of people getting on and off a bus to the > correct stop in the bus route stop sequence. I have tried looking > online/forums for sequence matching but seems to refer to numeric > sequences > or DNA matching and over my head. I am after a simple example if anyone > can > please help. >Adam, Yet another way... See inline code. BTW, you should have mentioned that you are a transit planner or included a signature block so folks would know this is not a homework question. As others have noted/hinted, there are some unstated assumptions, so you need to try some test cases to be sure any solution always works. You only have one outbound/inbound cycle in stop_onoff, right?? If not, I think almost any approach can fail given the right sequence of 'seq's.> I have two data series as per below (from database), that I want to > combine. In this example ?stop_sequence? includes the equence (seq) of bus > stops and ?stop_onoff? is a count of people getting on and off at certain > stops (there is no entry if noone gets on or off). > > stop_sequence <- data.frame(seq=c(10,20,30,40,50,60), > ref=c('A','B','C','D','B','A')) > ## seq ref > ## 1 10 A > ## 2 20 B > ## 3 30 C > ## 4 40 D > ## 5 50 B > ## 6 60 A > stop_onoff <- > data.frame(ref=c('A','D','B','A'),on=c(5,0,10,0),off=c(0,2,2,6)) > ## ref on off > ## 1 A 5 0 > ## 2 D 0 2 > ## 3 B 10 2 > ## 4 A 0 6 > > I need to match the stop_onoff numbers in the right sto sequence, with the > correctly matched output as follows (load is a cumulative count of on and > off) > > desired_output <- data.frame(seq=c(10,20,30,40,50,60), > ref=c('A','B','C','D','B','A'), > on=c(5,'-','-',0,10,0),off=c(0,'-','-',2,2,6), load=c(5,0,0,3,11,5)) > ## seq ref on off load > ## 1 10 A 5 0 5 > ## 2 20 B - - 0 > ## 3 30 C - - 0 > ## 4 40 D 0 2 3 > ## 5 50 B 10 2 11 > ## 6 60 A 0 6 5 >Start here:> stop_onoff$load <- with(stop_onoff,cumsum(on)-cumsum(off)) > split.ref <- with(stop_sequence,split(seq,ref)) > split.ref.onoff <- split.ref[as.character(stop_onoff$ref)] > stop.mat <- sapply(split.ref.onoff,rep,length=2) > inout <- cbind(stop.mat,c(0,Inf))>cbind(c(0,Inf),stop.mat) > stop_onoff$seq <- head(stop.mat[inout],-1) > merge(stop_sequence[c("ref","seq")],stop_onoff[-1],by="seq",all.x=T)seq ref on off load 1 10 A 5 0 5 2 20 B NA NA NA 3 30 C NA NA NA 4 40 D 0 2 3 5 50 B 10 2 11 6 60 A 0 6 5 You can take care of turning the NA's to zeroes or '-'s, I think. HTH, Chuck