Taylor, Ronald C
2017-Mar-14 21:53 UTC
[R] problem with sparkl_connect() in the sparklyr package for parallelizing R in the Spark environment - "Gateway in port (8880) did not respond"
Hi folks, I posted the message below as a new issue on the sparklyr web page at github over a week ago, but have not gotten any reply back. So I am posting here, in the hope somebody on this list can provide guidance. I really want to get R working in Spark on our local Linux cluster. Eager to get going, if I can get out of the gate. But have to get past this spark_connect() problem. Please see the below. - Ron Taylor %%%%%%%%%%%%% problem with spark_connect() using sparklyr on a Cloudera CDH 5.10.0 Hadoop cluster #534 rtaylor24<https://github.com/rtaylor24> commented 8 days ago<https://github.com/rstudio/sparklyr/issues/534> Hello folks, I am trying to use sparklyr for the first time on a Hadoop cluster. I have used R with sparklyr on a "local" copy of Spark on my Mac laptop, but this is the first time that I am trying to run it as a "yarn-client" on a true cluster, to actually get some parallelization out of sparklyr use. We have a small Linux cluster at our lab running Cloudera CDH 5.10.0. When I try to do the spark_connect() from an R session started on a command line on the Hadoop cluster's name (master) node, I get the same msg as in an earlier CLOSED issue. That is, my error msg is: "Failed while connecting to sparklyr to port (8880) for sessionid (2423): Gateway in port (8880) did not respond." I am thus reopening that issue here, since I still need help even after reading that older issue (#394<https://github.com/rstudio/sparklyr/issues/394>). At bottom is the record of my R session on the Hadoop cluster's name node, with all the details that I can think of printed out to the screen. I note that the version of Spark used by CDH is 1.6.0, which is different than what is in spark_home_dir (1.6.2). I cannot seem to change the spark_home_dir by setting SPARK_HOME to the Spark location used by the CDH distribution. spark_home_dir does not get altered by my setting of SPARK_HOME (as you can see below). So one question (perhaps the critical question?) is: how do I force sparklyr to connect to the Spark version being used by the CDH distribution? As you can see at the Cloudera web page at https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_vd_cdh_package_tarball_510.html the 5.10.0 distribution has Spark 1.6.0, not 1.6.2. So I am trying to tell sparklyr code to use the Spark 1.6.0 distribution that is located here: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-submit /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-shell /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/spark and so I was trying to set SPARK_HOME as follows: Sys.setenv(SPARK_HOME = "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41") Sys.getenv("SPARK_HOME") [1] "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41" However! I note that part of the error msg (see bottom) says that the correct path was used to spark-submit: "Path: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-submit" So maybe sparklyr is indeed accessing the Spark 1.6.0 distribution as it should in the cluster, and the problem lies elsewhere?? One other note: there was an earlier version of sparklyr installed by support here on the Hadoop name node. I have bypassed that, installed the latest version of sparklyr (0.5.1) into /people/rtaylor/Rpackages as you can see below. Would very much appreciate some guidance to get me over this initial hurdle. * Ron Taylor Pacific Northwest National Laboratory email: ronald.taylor at pnnl.gov<mailto:ronald.taylor at pnnl.gov> %%%%%%%%%%%%%% screen output from my failed run: [rtaylor at bigdatann Rwork]$ R R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch" Copyright (C) 2016 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. ls() character(0) (.packages()) [1] "stats" "graphics" "grDevices" "utils" "datasets" "methods" [7] "base" install.packages("sparklyr", lib="/people/rtaylor/Rpackages/") --- Please select a CRAN mirror for use in this session --- trying URL 'https://cran.cnr.berkeley.edu/src/contrib/sparklyr_0.5.2.tar.gz' Content type 'application/x-gzip' length 732806 bytes (715 KB) =================================================downloaded 715 KB * installing source package 'sparklyr' ... ** package 'sparklyr' successfully unpacked and MD5 sums checked ** R ** inst ** preparing package for lazy loading ** help *** installing help indices ** building package indices ** testing if installed package can be loaded * DONE (sparklyr) The downloaded source packages are in '/tmp/RtmpQUB4IE/downloaded_packages' library(sparklyr, lib.loc="/people/rtaylor/Rpackages/") (.packages()) [1] "sparklyr" "stats" "graphics" "grDevices" "utils" "datasets" [7] "methods" "base" sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux Workstation release 6.4 (Santiago) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] sparklyr_0.5.1 loaded via a namespace (and not attached): [1] Rcpp_0.12.9 withr_1.0.2 digest_0.6.12 dplyr_0.5.0 [5] rprojroot_1.2 assertthat_0.1 rappdirs_0.3.1 R6_2.2.0 [9] jsonlite_1.2 DBI_0.5-1 backports_1.0.5 magrittr_1.5 [13] httr_1.2.1 config_0.2 tools_3.3.2 parallel_3.3.2 [17] yaml_2.1.14 base64enc_0.1-3 tcltk_3.3.2 tibble_1.2 Sys.getenv("JAVA_HOME") [1] "/usr/java/latest" spark_installed_versions() spark hadoop dir 1 1.6.2 2.6 spark-1.6.2-bin-hadoop2.6 spark_home_dir() [1] "/people/rtaylor/.cache/spark/spark-1.6.2-bin-hadoop2.6" R.home(component = "home") [1] "/share/apps/R/3.3.2/lib64/R" path.expand("~") [1] "/people/rtaylor" Sys.setenv(SPARK_HOME = "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41") Sys.getenv("SPARK_HOME") [1] "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41" spark_home_dir() [1] "/people/rtaylor/.cache/spark/spark-1.6.2-bin-hadoop2.6" ls() character(0) config <- spark_config() ls() [1] "config" sc <- spark_connect(master = "yarn-client", config = config, version = "1.6.0") Error in force(code) : Failed while connecting to sparklyr to port (8880) for sessionid (2423): Gateway in port (8880) did not respond. Path: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-submit Parameters: --class, sparklyr.Backend, --jars, '/share/apps/R/3.3.2/lib64/R/library/sparklyr/java/spark-csv_2.11-1.3.0.jar','/share/apps/R/3.3.2/lib64/R/library/sparklyr/java/commons-csv-1.1.jar','/share/apps/R/3.3.2/lib64/R/library/sparklyr/java/univocity-parsers-1.5.1.jar', '/share/apps/R/3.3.2/lib64/R/library/sparklyr/java/sparklyr-1.6-2.10.jar', 8880, 2423 ---- Output Log ---- /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/../lib/spark/bin/spark-submit: line 27: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-class: No such file or directory /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/../lib/spark/bin/spark-submit: line 27: exec: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-class: cannot execute: No such file or directory ---- Error Log ---- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% [[alternative HTML version deleted]]