Kenneth Roy Cabrera Torres
2009-Oct-27 12:15 UTC
[R] Stack overflow in R 2.10.0 with sub()
Hi R developers: Congratulations for the new R 2.10.0 version. It is a huge effort! Thank you for your work and dedication. I just want to ask how to make this "strip blank" function to work again (it works on R.2.9.2). alumnos$AL_NUME_ID<-sub("(^ +)|( +$)","",alumnos$AL_NUME_ID),) "alumnos" is a data base with 900.000 rows and 72 columns. and "alumnos$AL_NUME_ID" is a character variable read form a "mysql" database. The system shows me this message: Error: C produce desborde de pila en 'segfault' It seems a "stack overflow" problem, but it works on R 2.9.2! Thank you for your help, and again, thank you for your work!!! Kenneth
On 10/27/2009 8:15 AM, Kenneth Roy Cabrera Torres wrote:> Hi R developers: > > Congratulations for the new R 2.10.0 version. > > It is a huge effort! Thank you for your work and dedication. > > I just want to ask how to make this "strip blank" function > to work again (it works on R.2.9.2). > > alumnos$AL_NUME_ID<-sub("(^ +)|( +$)","",alumnos$AL_NUME_ID),) > > "alumnos" is a data base with 900.000 rows and 72 columns. > and "alumnos$AL_NUME_ID" is a character variable read form > a "mysql" database. > > The system shows me this message: > > Error: C produce desborde de pila en 'segfault' > > It seems a "stack overflow" problem, but it works on R 2.9.2! > > Thank you for your help, and again, thank you for your work!!!I just tried that (after fixing the typo at the end of the line) and it worked on these vectors: x <- c("a", " a", "a ", " a ") y <- rep(x, 900000) So there is something about your dataset that is causing the problem. Can you narrow it down? Here are some tests: 1. Check that it is the value that is causing the problem, not the manner of getting it: x <- alumnos$AL_NUME_ID y <- sub("(^ +)|( +$)","",x) 2. See if it is in the first half of the data: x <- alumnos$AL_NUME_ID x <- x[seq_len(length(x)/2)] y <- sub("(^ +)|( +$)","",x) 3. See if it is in the second half: x <- alumnos$AL_NUME_ID x <- x[-seq_len(length(x)/2)] y <- sub("(^ +)|( +$)","",x) If you can narrow it down to a particularly short vector that causes the error, that would be very helpful. It's likely to be somewhat tedious, because I imagine those segfaults will terminate R; I'd suggest using save.image() a lot when things are working, so you can restart after a crash. Duncan Murdoch
Kenneth Roy Cabrera Torres
2009-Oct-27 14:46 UTC
[R] Stack overflow in R 2.10.0 with sub()
Dr. Murdoch: I am puzzled! As you adviced me I do this: x <- as.character(alumnos$AL_NUME_ID) x <- x[-seq_len(length(x)/2)] y <- gsub("(^ +)|( +$)","",x) And it fails, But, trying to locate the problem I do: x <- as.character(alumnos$AL_NUME_ID) x <- x[-seq_len(length(x)/2)] x <- x[seq_len(length(x)/2)] y <- gsub("(^ +)|( +$)","",x) works x <- as.character(alumnos$AL_NUME_ID) x <- x[-seq_len(length(x)/2)] x <- x[-seq_len(length(x)/2)] y <- gsub("(^ +)|( +$)","",x) works Now, both works!!! So, I am puzzle!!! I cannot locate the problem. Thank you for your advice. Kenneth
Kenneth Roy Cabrera Torres
2009-Oct-27 18:16 UTC
[R] Stack overflow in R 2.10.0 with sub()
El mar, 27-10-2009 a las 10:47 -0700, Phil Spector escribi?:> What happens if you type > > Sys.setlocale('LC_ALL','C') > > before using gsub or grep?When I do that, R hangs and don't show any message.> > - Phil Spector > Statistical Computing Facility > Department of Statistics > UC Berkeley > spector at stat.berkeley.edu > > > On Tue, 27 Oct 2009, Kenneth Roy Cabrera Torres wrote: > > > Thank you very much for your interest. > > > > I make this: > > x <- as.character(alumnos$AL_NUME_ID) > > x <- x[-seq_len(length(x)/2)] > > save(x, file="x.RData") > > > > I exit form R, and then restart R and I make this: > > > > load("x.RData") > > y <- gsub("(^ +)|( +$)","",x) > > > > It shows me: > > > > Error en gsub("(^ +)|( +$)", "", x) : > > input string 66644 is invalid in this locale > > > > I delete that string (it is a string with a non usual character (?)) > > > > So, I retype without that observation. > > > > y <- gsub("(^ +)|( +$)","",x[-c(66644)]) > > > > I got this: > > Error en gsub("(^ +)|( +$)", "", x[-c(66644)]) : > > input string 160689 is invalid in this locale > > > > I retype again with this invalid string this way (I use the > > 160690 position, because the lag of the x vector) > > > >> y <- gsub("(^ +)|( +$)","",x[-c(66644,160690)]) > > Error: C produce desborde de pila en 'segfault' > > > > And it fails. > > > > I also repeat all the process with this conversion first. > > > > x <- iconv(as.character(alumnos$AL_NUME_ID),"latin1","UTF-8") > > x <- x[-seq_len(length(x)/2)] > > save(x, file="x.RData") > > > > And I exit, and restart R, and then I type > > > > load("x.RData") > > y <- gsub("(^ +)|( +$)","",x) > > > > And it fails again without showing me the "invalid string" errors. > > > > I then make this: > > > > load("x.RData") > > y <- gsub("(^ +)|( +$)","",x[1:160690]) > > > > and it works, then I type > > > > y <- gsub("(^ +)|( +$)","",x[1:200000]) #(x length is 454035) > > > > and it works... > > > > But I start to make a manual binary search, > > I found something that stills puzzle me. > > > > y <- gsub("(^ +)|( +$)","",x[1:261570]) > > > > works, but sometimes fails (after I restart R), > > it always fails with index greather than 262000. > > > > I see that there are not something inusual arround 261570. > > > > x[261560:261580] > > [1] "21444777 " "1147585 " "255202522 > > " > > [4] "25852100 " "24258550 " "A8D0251207 > > " > > [7] "34681811 " "19121345 " "16921329 > > " > > [10] "20442195 " "14506482 " "44332211 > > " > > [13] "35049122 " "34326340 " "35182366 > > " > > [16] "33288742 " "34958795 " "1017147202 > > " > > [19] "3306985 " "33048501 " "33295073 > > " > > > > I am sending you the x.Rdata file to see if you can > > reproduce my problem. > > > > This infomation may be useful: > > > > sessionInfo() > > > > R version 2.10.0 (2009-10-26) > > x86_64-unknown-linux-gnu > > > > locale: > > [1] LC_CTYPE=es_CO.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=es_CO.UTF-8 LC_COLLATE=es_CO.UTF-8 > > [5] LC_MONETARY=C LC_MESSAGES=es_CO.UTF-8 > > [7] LC_PAPER=es_CO.UTF-8 LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=es_CO.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > R.Version() > > > > $platform > > [1] "x86_64-unknown-linux-gnu" > > $arch > > [1] "x86_64" > > $os > > [1] "linux-gnu" > > $system > > [1] "x86_64, linux-gnu" > > $status > > [1] "" > > $major > > [1] "2" > > $minor > > [1] "10.0" > > $year > > [1] "2009" > > $month > > [1] "10" > > $day > > [1] "26" > > $`svn rev` > > [1] "50208" > > $language > > [1] "R" > > $version.string > > [1] "R version 2.10.0 (2009-10-26)" > > > > gcc --version and g++ --verision shows me: > > > > gcc (Ubuntu 4.3.3-5ubuntu4) 4.3.3 > > Copyright (C) 2008 Free Software Foundation, Inc. > > Esto es software libre; vea el c?digo para las condiciones de copia. NO > > hay > > garant?a; ni siquiera para MERCANTIBILIDAD o IDONEIDAD PARA UN PROP?SITO > > EN > > PARTICULAR > > > > When I compile R I use this option in configuration (nothing more) > > > > ./configure --enable-R-shlib > > make > > sudo make install > > > > At the moment I have 22Gb of swap partition (keeping monitor tracking > > the systems is not using it) and 4GB of RAM. > > > > Again, thank you very much for your help. > > > > Kenneth > > > > > > > > > > > >