Janko Thyson
2011-Jan-26 19:34 UTC
[Rd] Error handling with frozen RCurl function calls + Identification of frozen R processes
Dear list, I'm tackling an empiric research problem that requires me to address a whole bunch of conceptual and/or technical details at the same time which cuts time short for all the nitty-gritty details of the "components" involved. Having said this, I'm lacking the time at the moment to deeply dive into parallel computing and HTTP requests via RCurl and I hope you can help me out with one or two imminent issues of my crawler/scraper: Once a day, I'm running 'RCurl::getURIAsynchronous(x=URL.frontier.sub, multiHandle=my.multi.handle)' within an lapply()-construct in order to read chunks of deterministically composed URLs from a host. There are courtesy time delays implemented between the individual http requests (5 times the time the last request from this host took) so that I'm not clogging the host. I'm causing about 15 minutes of traffic per day. The problem is, that 'getURIAsynchronous()' simply freezes sometimes and I don't have a clue why so. I also can't reproduce the error as it's totally erratic. I tried to put the function inside a try() or tryCatch() construct to no avail. Also, I've experimented with a couple of timeout options of Curl, but honestly didn't really understand all the implications. None worked so far. It simply seems that upon an error 'getURIAsynchronous()' simply does not give control back to the R process. Additionally, due to a lack of profound knowledge in parallel computing, the program is scripted to run a bunch of R processes independently. "Communication" between them takes place via variables they read from and write to disc in order to have some sort of "shared environment" (horrible, I know ;-)). So here are my specific questions: 1) Is it possible to catch connection or timeout errors in RCurl functions that allow me to implement my customized error handling? If so, could you guide me to some examples, please? 2) Can I somehow identify "frozen" Rterm or Rscript processes (e.g. via using Sys.getpid()?) in order to shut them down and reinitialize them? You'll find my session info below. Thanks for any hints or advice! Janko> sessionInfo()R version 2.12.1 (2010-12-16) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C [5] LC_TIME=German_Germany.1252 attached base packages: [1] tcltk tools stats graphics grDevices utils datasets [8] methods base other attached packages: [1] RCurl_1.5-0.1 bitops_1.0-4.1 XML_3.2-0.2 RMySQL_0.7-5 [5] filehash_2.1-1 hash_2.0.1 timeDate_2130.91 RODBC_1.3-2 [9] MiscPsycho_1.6 statmod_1.4.8 debug_1.2.4 mvbutils_2.5.4 [13] DBI_0.2-5 cwhmisc_2.1 lattice_0.19-13 loaded via a namespace (and not attached): [1] grid_2.12.1
Janko Thyson
2011-Jan-26 22:31 UTC
[Rd] Error handling with frozen RCurl function calls + Identification of frozen R processes
Dear list, I'm tackling an empiric research problem that requires me to address a whole bunch of conceptual and/or technical details at the same time which cuts time short for all the nitty-gritty details of the "components" involved. Having said this, I'm lacking the time at the moment to deeply dive into parallel computing and HTTP requests via RCurl and I hope you can help me out with one or two imminent issues of my crawler/scraper: Once a day, I'm running 'RCurl::getURIAsynchronous(x=URL.frontier.sub, multiHandle=my.multi.handle)' within an lapply()-construct in order to read chunks of deterministically composed URLs from a host. There are courtesy time delays implemented between the individual http requests (5 times the time the last request from this host took) so that I'm not clogging the host. I'm causing about 15 minutes of traffic per day. The problem is, that 'getURIAsynchronous()' simply freezes sometimes and I don't have a clue why so. I also can't reproduce the error as it's totally erratic. I tried to put the function inside a try() or tryCatch() construct to no avail. Also, I've experimented with a couple of timeout options of Curl, but honestly didn't really understand all the implications. None worked so far. It simply seems that upon an error 'getURIAsynchronous()' simply does not give control back to the R process. Additionally, due to a lack of profound knowledge in parallel computing, the program is scripted to run a bunch of R processes independently. "Communication" between them takes place via variables they read from and write to disc in order to have some sort of "shared environment" (horrible, I know ;-)). So here are my specific questions: 1) Is it possible to catch connection or timeout errors in RCurl functions that allow me to implement my customized error handling? If so, could you guide me to some examples, please? 2) Can I somehow identify "frozen" Rterm or Rscript processes (e.g. via using Sys.getpid()?) in order to shut them down and reinitialize them? You'll find my session info below. Thanks for any hints or advice! Janko> sessionInfo()R version 2.12.1 (2010-12-16) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C [5] LC_TIME=German_Germany.1252 attached base packages: [1] tcltk tools stats graphics grDevices utils datasets [8] methods base other attached packages: [1] RCurl_1.5-0.1 bitops_1.0-4.1 XML_3.2-0.2 RMySQL_0.7-5 [5] filehash_2.1-1 hash_2.0.1 timeDate_2130.91 RODBC_1.3-2 [9] MiscPsycho_1.6 statmod_1.4.8 debug_1.2.4 mvbutils_2.5.4 [13] DBI_0.2-5 cwhmisc_2.1 lattice_0.19-13 loaded via a namespace (and not attached): [1] grid_2.12.1