Hi Simon,
Thanks for your feedback. -- this is an observation that I wasn't
considering when I wrote this mainly because I am, in fact, working
with rather small data sets. BTW: There is code there, it's under the
bitbucket link -- here's the direct link if you'd still like to look
at it:
https://bitbucket.org/CoherentLogic/jdataframe
Re "for practical purposes is doesn't seem like the most efficient
solution" and "So the JSON route is very roughly ~13x slower than
using Java directly."
I've not benchmarked this and will take a closer look at what you have
today -- in fact I may include these details on the JDataFrame page.
The JDataFrame targets the use case where there's significant
development being done in Java and data is exported into R and,
additionally, the developer intends to keep the two separated as much
as possible. I could work with Java directly, but then I potentially
end up with quite a bit of Java code taking up space in R and I don't
like this because if I need to refactor something I have to do it in
two places.
There's another use case for the JDataFrame as well and that's in an
enterprise application (you may have alluded to this when you said
"[i]f you need process separation..."). Consider a business where
users are working with R and the application that produces the data is
actually running in Tomcat. Shipping large amounts of data over the
wire in this example would be a performance destroyer, but for small
data sets it certainly would be helpful from a development perspective
to expose JSON-based web services where the R script would be able to
convert a result into a data frame gracefully.
Tom
On Fri, Jan 15, 2016 at 10:58 AM, Simon Urbanek
<simon.urbanek at r-project.org> wrote:> Tom,
>
> this may be good for embedding small data sets, but for practical purposes
is doesn't seem like the most efficient solution.
>
> Since you didn't provide any code, I built a test case using the
build-in Java JSON API to build a medium-sized dataset (1e6 rows) and read it in
just to get a ballpark (see
> https://gist.github.com/s-u/4efb284e3c15c6a2db16
>
> # generate:
> time java -cp .:javax.json-api-1.0.jar:javax.json-1.0.4.jar A > 1e6
>
> real 0m2.764s
> user 0m20.356s
> sys 0m0.962s
>
> # read:
>> system.time(temp <- RJSONIO::fromJSON("1e6"))
> user system elapsed
> 3.484 0.279 3.834
>> str(temp)
> List of 2
> $ V1: num [1:1000000] 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 ...
> $ V2: chr [1:1000000] "X0" "X1" "X2"
"X3" ...
>
> For comparison using Java directly (includes both generation and reading
into R):
>
>> system.time(temp <- lapply(J("A")$direct(), .jevalArray))
> user system elapsed
> 0.962 0.186 0.494
>
> So the JSON route is very roughly ~13x slower than using Java directly.
Obviously, this will vary by data set type etc. since there is R overhead
involved as well: for example, if you have only numeric variables, the JSON
route is 30x slower on reading alone [50x total]. String variables slow down
everyone equally. Interestingly, the JSON encoding is using all 16 cores, so the
2.7s real time add up to over 20s CPU time so on smaller machines you may see
more overhead.
>
> If you need process separation, it may be a different story - in principle
it is faster to use more native serialization than JSON since parsing is the
slowest part for big datasets.
>
> Cheers,
> Simon
>
>
>> On Jan 14, 2016, at 4:52 PM, Thomas Fuller <thomas.fuller at
coherentlogic.com> wrote:
>>
>> Hi Folks,
>>
>> If you need to send data from Java to R you may consider using the
>> JDataFrame API -- which is used to convert data into JSON which then
>> can be converted into a data frame in R.
>>
>> Here's the project page:
>>
>> https://coherentlogic.com/middleware-development/jdataframe/
>>
>> and here's a partial example which demonstrates what the API looks
like:
>>
>> String result = new JDataFrameBuilder()
>> .addColumn("Code", new Object[] {"WV",
"VA", })
>> .addColumn("Description", new Object[] {"West
Virginia", "Virginia"})
>> .toJson();
>>
>> and in R script we would need to do this:
>>
>> temp <- RJSONIO::fromJSON(json)
>> tempDF <- as.data.frame(temp)
>>
>> which yields a data frame that looks like this:
>>
>>> tempDF
>> Description Code
>> 1 West Virginia WV
>> 2 Virginia VA
>>
>> It is my intention to deploy this project to Maven Central this week,
>> time permitting.
>>
>> Questions and comments are welcomed.
>>
>> Tom
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>