Amos B. Elberg
2016-Mar-07 19:33 UTC
[R-pkgs] rZeppelin: An R notebook that makes Spark easy to use
rZeppelin is an R interpreter for Apache (incubating) Zeppelin. Zeppelin is a notebook, sort of like iPython, built on top of Apache Spark. rZeppelin makes it possible, for the first time, to create a single data/ML pipeline that mixes R, scala, and Python code, seamlessly, from a single interface. (Without breaking lazy evaluation!) For R-using data scientists, this means that you can access the full power of Spark ? including ultra-fast distributed implementations of popular algorithms ? using R, without having to learn scala, without a dedicated administrator to manage a Spark or Hadoop cluster, and without spending more than minimal time to review the SparkR api. You can load text data using R, quickly create an LDA model using Spark?s distributed LDA package, tag the text using gensim from Python, and then visualize and take further steps from R, from a single session using a single interface. The full range of Spark packages, including MLLIB and GraphX, which used to require scala development, can be used in the same pipeline with R. (Except Spark Streaming, which Zeppelin doesn?t yet support.) Beyond Spark, R data can be visualized using Zeppelin?s built-in interactive visualizations. rZeppelin also leverages knitr to make available most R visualization and interactive visualization packages. Many data types are also easily moved between R, scala and Python: the languages share a ZeppelinContext, where variables can be added and extracted with .z.put() and .z.get(). rZeppelin is intended to make Spark part of the R data scientist?s daily toolbox. rZeppelin is available here: https://github.com/elbamos/Zeppelin-With-R [[alternative HTML version deleted]]