Hello,
I am trying to read a set of json files containing tweets using the
following code:
json_data <- fromJSON(paste(readLines(json_file))
Unfortunately, it only reads the first record on the file. For example, in
the file below, it only reads the first record starting with
"id":"tag:
search.twitter.com,2005:3318539389". What is the best way to retrieve these
records? I have 20 such json files with varying number of tweets in it.
Thank you in advance.
Best,
Mayukh
{"id":"tag:search.twitter.com
,2005:3318539389","objectType":"activity","actor":{"objectType":"person","id":"id:
twitter.com:2859421","link":"http://www.twitter.com/meetjenn","displayName":"Jenn","postedTime":"2007-01-29T17:06:00.000Z","image":"06-19-07_2010.jpg","summary":"I
say 'like' a lot. I fall down a lot. I walk into everything. Love Pgh
Pens,
NE Pats, Fundraising, Dogs & History. Craft Beer & Running
Novice.","links":[{"href":"http://meetjenn.tumblr.com","rel":"me"}],"friendsCount":0,"followersCount":0,"listedCount":0,"statusesCount":0,"twitterTimeZone":"Eastern
Time (US &
Canada)","verified":false,"utcOffset":"0","preferredUsername":"meetjenn","languages":["en"],"location":{"objectType":"place","displayName":"Pgh/Philajersey"},"favoritesCount":0},"verb":"post","postedTime":"2009-08-15T00:00:12.000Z","generator":{"displayName":"tweetdeck","link":"
http://twitter.com
"},"provider":{"objectType":"service","displayName":"Twitter","link":"
http://www.twitter.com"},"link":"
http://twitter.com/meetjenn/statuses/3318539389","body":"Cool
story about
the man who created the @Starbucks logo. Additional link at the bottom on
how it came to be: http://bit.ly/16bOJk
","object":{"objectType":"note","id":"object:search.twitter.com,2005:3318539389","summary":"Cool
story about the man who created the @Starbucks logo. Additional link at the
bottom on how it came to be: http://bit.ly/16bOJk","link":"
http://twitter.com/meetjenn/statuses/3318539389
","postedTime":"2009-08-15T00:00:12.000Z"},"twitter_entities":{"urls":[{"expanded_url":null,"indices":[111,131],"url":"
http://bit.ly/16bOJk
"}],"hashtags":[],"user_mentions":[{"id":null,"name":null,"indices":[41,51],"screen_name":"@Starbucks","id_str":null}]},"retweetCount":0,"gnip":{"matching_rules":[{"value":"Starbucks","tag":null}]}}
{"id":"tag:search.twitter.com
,2005:3318543260","objectType":"activity","actor":{"objectType":"person","id":"id:
twitter.com:61595468","link":"http://www.twitter.com/FastestFood","displayName":"FastFood
Bob","postedTime":"2009-01-30T20:51:10.000Z","image":"","summary":"Just
A
little food for
thought","links":[{"href":"http://www.TeamSantilli.com","rel":"me"}],"friendsCount":0,"followersCount":0,"listedCount":0,"statusesCount":0,"twitterTimeZone":"Pacific
Time (US &
Canada)","verified":false,"utcOffset":"0","preferredUsername":"FastestFood","languages":["en"],"location":{"objectType":"place","displayName":"eating
some
thoughts"},"favoritesCount":0},"verb":"post","postedTime":"2009-08-15T00:00:23.000Z","generator":{"displayName":"oauth:17","link":"
http://twitter.com
"},"provider":{"objectType":"service","displayName":"Twitter","link":"
http://www.twitter.com"},"link":"
http://twitter.com/FastestFood/statuses/3318543260","body":"Oregon
Biz
Report ? How Starbucks saved millions. Oregon closures ...
http://u.mavrev.com/02bdj","object":{"objectType":"note","id":"object:
search.twitter.com,2005:3318543260","summary":"Oregon Biz
Report ? How
Starbucks saved millions. Oregon closures ... http://u.mavrev.com/02bdj
","link":"http://twitter.com/FastestFood/statuses/3318543260
","postedTime":"2009-08-15T00:00:23.000Z"},"twitter_entities":{"urls":[{"expanded_url":null,"indices":[70,95],"url":"
http://u.mavrev.com/02bdj
"}],"hashtags":[],"user_mentions":[]},"retweetCount":0,"gnip":{"matching_rules":[{"value":"Starbucks","tag":null}]}}
{"info":{"message":"Replay Request
Completed","sent":"2015-02-18T00:05:15+00:00","activity_count":2}}
[[alternative HTML version deleted]]
Mayukh, I think you are missing an argument to paste() and a right parenthesis character. Try json_data <- fromJSON(paste(readLines(json_file), collapse = " ")) Mark R. Mark Sharp, Ph.D. msharp at TxBiomed.org> On Jul 27, 2015, at 3:41 PM, Mayukh Dass <mayukh.dass at gmail.com> wrote: > > Hello, > > I am trying to read a set of json files containing tweets using the > following code: > > json_data <- fromJSON(paste(readLines(json_file)) > > Unfortunately, it only reads the first record on the file. For example, in > the file below, it only reads the first record starting with "id":"tag: > search.twitter.com,2005:3318539389". What is the best way to retrieve these > records? I have 20 such json files with varying number of tweets in it. > Thank you in advance. > > Best, > Mayukh > > {"id":"tag:search.twitter.com > ,2005:3318539389","objectType":"activity","actor":{"objectType":"person","id":"id: > twitter.com:2859421","link":"http://www.twitter.com/meetjenn","displayName":"Jenn","postedTime":"2007-01-29T17:06:00.000Z","image":"06-19-07_2010.jpg","summary":"I > say 'like' a lot. I fall down a lot. I walk into everything. Love Pgh Pens, > NE Pats, Fundraising, Dogs & History. Craft Beer & Running > Novice.","links":[{"href":"http://meetjenn.tumblr.com","rel":"me"}],"friendsCount":0,"followersCount":0,"listedCount":0,"statusesCount":0,"twitterTimeZone":"Eastern > Time (US & > Canada)","verified":false,"utcOffset":"0","preferredUsername":"meetjenn","languages":["en"],"location":{"objectType":"place","displayName":"Pgh/Philajersey"},"favoritesCount":0},"verb":"post","postedTime":"2009-08-15T00:00:12.000Z","generator":{"displayName":"tweetdeck","link":" > http://twitter.com > "},"provider":{"objectType":"service","displayName":"Twitter","link":" > http://www.twitter.com"},"link":" > http://twitter.com/meetjenn/statuses/3318539389","body":"Cool story about > the man who created the @Starbucks logo. Additional link at the bottom on > how it came to be: http://bit.ly/16bOJk > ","object":{"objectType":"note","id":"object:search.twitter.com,2005:3318539389","summary":"Cool > story about the man who created the @Starbucks logo. Additional link at the > bottom on how it came to be: http://bit.ly/16bOJk","link":" > http://twitter.com/meetjenn/statuses/3318539389 > ","postedTime":"2009-08-15T00:00:12.000Z"},"twitter_entities":{"urls":[{"expanded_url":null,"indices":[111,131],"url":" > http://bit.ly/16bOJk > "}],"hashtags":[],"user_mentions":[{"id":null,"name":null,"indices":[41,51],"screen_name":"@Starbucks","id_str":null}]},"retweetCount":0,"gnip":{"matching_rules":[{"value":"Starbucks","tag":null}]}} > {"id":"tag:search.twitter.com > ,2005:3318543260","objectType":"activity","actor":{"objectType":"person","id":"id: > twitter.com:61595468","link":"http://www.twitter.com/FastestFood","displayName":"FastFood > Bob","postedTime":"2009-01-30T20:51:10.000Z","image":"","summary":"Just A > little food for > thought","links":[{"href":"http://www.TeamSantilli.com","rel":"me"}],"friendsCount":0,"followersCount":0,"listedCount":0,"statusesCount":0,"twitterTimeZone":"Pacific > Time (US & > Canada)","verified":false,"utcOffset":"0","preferredUsername":"FastestFood","languages":["en"],"location":{"objectType":"place","displayName":"eating > some > thoughts"},"favoritesCount":0},"verb":"post","postedTime":"2009-08-15T00:00:23.000Z","generator":{"displayName":"oauth:17","link":" > http://twitter.com > "},"provider":{"objectType":"service","displayName":"Twitter","link":" > http://www.twitter.com"},"link":" > http://twitter.com/FastestFood/statuses/3318543260","body":"Oregon Biz > Report ? How Starbucks saved millions. Oregon closures ... > http://u.mavrev.com/02bdj","object":{"objectType":"note","id":"object: > search.twitter.com,2005:3318543260","summary":"Oregon Biz Report ? How > Starbucks saved millions. Oregon closures ... http://u.mavrev.com/02bdj > ","link":"http://twitter.com/FastestFood/statuses/3318543260 > ","postedTime":"2009-08-15T00:00:23.000Z"},"twitter_entities":{"urls":[{"expanded_url":null,"indices":[70,95],"url":" > http://u.mavrev.com/02bdj > "}],"hashtags":[],"user_mentions":[]},"retweetCount":0,"gnip":{"matching_rules":[{"value":"Starbucks","tag":null}]}} > {"info":{"message":"Replay Request > Completed","sent":"2015-02-18T00:05:15+00:00","activity_count":2}} > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Thanks Mark. I made a mistake when I was coping the code on the email. I have the parentheses in my code. Best, Mayukh> On Jul 27, 2015, at 5:16 PM, Mark Sharp <msharp at TxBiomed.org> wrote: > > Mayukh, > > I think you are missing an argument to paste() and a right parenthesis character. > > Try > json_data <- fromJSON(paste(readLines(json_file), collapse = " ")) > > Mark > R. Mark Sharp, Ph.D. > msharp at TxBiomed.org > > > > > >> On Jul 27, 2015, at 3:41 PM, Mayukh Dass <mayukh.dass at gmail.com> wrote: >> >> Hello, >> >> I am trying to read a set of json files containing tweets using the >> following code: >> >> json_data <- fromJSON(paste(readLines(json_file)) >> >> Unfortunately, it only reads the first record on the file. For example, in >> the file below, it only reads the first record starting with "id":"tag: >> search.twitter.com,2005:3318539389". What is the best way to retrieve these >> records? I have 20 such json files with varying number of tweets in it. >> Thank you in advance. >> >> Best, >> Mayukh >> >> {"id":"tag:search.twitter.com >> ,2005:3318539389","objectType":"activity","actor":{"objectType":"person","id":"id: >> twitter.com:2859421","link":"http://www.twitter.com/meetjenn","displayName":"Jenn","postedTime":"2007-01-29T17:06:00.000Z","image":"06-19-07_2010.jpg","summary":"I >> say 'like' a lot. I fall down a lot. I walk into everything. Love Pgh Pens, >> NE Pats, Fundraising, Dogs & History. Craft Beer & Running >> Novice.","links":[{"href":"http://meetjenn.tumblr.com","rel":"me"}],"friendsCount":0,"followersCount":0,"listedCount":0,"statusesCount":0,"twitterTimeZone":"Eastern >> Time (US & >> Canada)","verified":false,"utcOffset":"0","preferredUsername":"meetjenn","languages":["en"],"location":{"objectType":"place","displayName":"Pgh/Philajersey"},"favoritesCount":0},"verb":"post","postedTime":"2009-08-15T00:00:12.000Z","generator":{"displayName":"tweetdeck","link":" >> http://twitter.com >> "},"provider":{"objectType":"service","displayName":"Twitter","link":" >> http://www.twitter.com"},"link":" >> http://twitter.com/meetjenn/statuses/3318539389","body":"Cool story about >> the man who created the @Starbucks logo. Additional link at the bottom on >> how it came to be: http://bit.ly/16bOJk >> ","object":{"objectType":"note","id":"object:search.twitter.com,2005:3318539389","summary":"Cool >> story about the man who created the @Starbucks logo. Additional link at the >> bottom on how it came to be: http://bit.ly/16bOJk","link":" >> http://twitter.com/meetjenn/statuses/3318539389 >> ","postedTime":"2009-08-15T00:00:12.000Z"},"twitter_entities":{"urls":[{"expanded_url":null,"indices":[111,131],"url":" >> http://bit.ly/16bOJk >> "}],"hashtags":[],"user_mentions":[{"id":null,"name":null,"indices":[41,51],"screen_name":"@Starbucks","id_str":null}]},"retweetCount":0,"gnip":{"matching_rules":[{"value":"Starbucks","tag":null}]}} >> {"id":"tag:search.twitter.com >> ,2005:3318543260","objectType":"activity","actor":{"objectType":"person","id":"id: >> twitter.com:61595468","link":"http://www.twitter.com/FastestFood","displayName":"FastFood >> Bob","postedTime":"2009-01-30T20:51:10.000Z","image":"","summary":"Just A >> little food for >> thought","links":[{"href":"http://www.TeamSantilli.com","rel":"me"}],"friendsCount":0,"followersCount":0,"listedCount":0,"statusesCount":0,"twitterTimeZone":"Pacific >> Time (US & >> Canada)","verified":false,"utcOffset":"0","preferredUsername":"FastestFood","languages":["en"],"location":{"objectType":"place","displayName":"eating >> some >> thoughts"},"favoritesCount":0},"verb":"post","postedTime":"2009-08-15T00:00:23.000Z","generator":{"displayName":"oauth:17","link":" >> http://twitter.com >> "},"provider":{"objectType":"service","displayName":"Twitter","link":" >> http://www.twitter.com"},"link":" >> http://twitter.com/FastestFood/statuses/3318543260","body":"Oregon Biz >> Report ? How Starbucks saved millions. Oregon closures ... >> http://u.mavrev.com/02bdj","object":{"objectType":"note","id":"object: >> search.twitter.com,2005:3318543260","summary":"Oregon Biz Report ? How >> Starbucks saved millions. Oregon closures ... http://u.mavrev.com/02bdj >> ","link":"http://twitter.com/FastestFood/statuses/3318543260 >> ","postedTime":"2009-08-15T00:00:23.000Z"},"twitter_entities":{"urls":[{"expanded_url":null,"indices":[70,95],"url":" >> http://u.mavrev.com/02bdj >> "}],"hashtags":[],"user_mentions":[]},"retweetCount":0,"gnip":{"matching_rules":[{"value":"Starbucks","tag":null}]}} >> {"info":{"message":"Replay Request >> Completed","sent":"2015-02-18T00:05:15+00:00","activity_count":2}} >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >
Mayukh,
I apologize for taking so long to get back to your problem. I expect you may
have found the solution. If so I would be interested. I have developed a hack to
solve the problem, but I expect if someone knew how to handle JSON objects or
even text parsing better they could develop a more elegant solution.
As I understand the problem, your text file has more than one JSON object in
text form. There are three. The first two are very similar and the last is a
trailer indication what was done, when it was done and the number of JSON
objects sent. The problem is that fromJSON() only pulls off the first of the
JSON objects.
I have defined three helper functions to separate the JSON objects, read them
in, and store them in a list.
library(RJSONIO)
library(stringi, quietly = TRUE)
#library(jsonlite) # also works
#' Returns dataframe with ordered locations of the matching braces.
#'
#' There is almost certainly a better function to do this.
#' @param txt character vector of length one having 0 or more matching
braces.
#' @import stringi
#' @examples
#' library(rmsutilityr)
#' match_braces("{123{456{78}9}10}")
#' @export
match_braces <- function(txt) {
txt <- txt[1] # just in the case of having more than one element
left <- stri_locate_all_regex(txt, "\\{")[[1]][ , 1]
right <- stri_locate_all_regex(txt, "\\}")[[1]][ , 2]
len <- length(left)
braces <- data.frame(left = rep(0, len), right = rep(0, len))
for (i in seq_along(right)) {
for (j in rev(seq_along(left))) {
if (left[j] < right[i] & left[j] != 0) {
braces$left[i] <- left[j]
braces$right[i] <- right[i]
left[j] <- 0
break
}
}
}
braces[order(braces$left), ]
}
#' Returns a list containing two objects in the text of a character vector
#' of length one: (1) object = the first json object found and (2) remainder
=
#' the remaining text.
#'
#' Properly formed messages are assumed. Error checking is non-existent.
#' @param json_txt character vector of length one having one or more JSON
#' objects in character form.
#' @import stringi
#' @export
get_first_json_message <- function(json_txt) {
len <- stri_length(json_txt)
braces <- match_braces(json_txt)
if (braces$right[1] + 1 > len) {
remainder <- ""
} else {
remainder <- stri_trim_both(stri_sub(json_txt, braces$right[1] + 1))
}
list(object = stri_sub(json_txt, braces$left[1], to = braces$right[1]),
remainder = remainder)
}
#' Returns list of lists made by call to fromJSON()
#' @param json_txt character vector of length 1 having one or more
#' JSON objects in text form.
#' @import stringi
#' @export
get_json_list <- function (json_txt) {
t_json_txt <- json_txt
i <- 0
json_list <- list()
repeat{
i <- i + 1
message_remainder <- get_first_json_message(t_json_txt)
json_list[i] <- list(fromJSON(message_remainder$object))
if (message_remainder$remainder == "")
break
t_json_txt <- message_remainder$remainder
}
json_list
}
json_file <- "../data/json_file.txt"
json_txt <- stri_trim_both(stri_c(readLines(json_file), collapse = "
"))
json_list <- get_json_list(json_txt)
length(json_list)
R. Mark Sharp, Ph.D.
Director of Primate Records Database
Southwest National Primate Research Center
Texas Biomedical Research Institute
P.O. Box 760549
San Antonio, TX 78245-0549
Telephone: (210)258-9476
e-mail: msharp at TxBiomed.org
> On Jul 27, 2015, at 5:16 PM, Mark Sharp <msharp at TxBiomed.org>
wrote:
>
> Mayukh,
>
> I think you are missing an argument to paste() and a right parenthesis
character.
>
> Try
> json_data <- fromJSON(paste(readLines(json_file), collapse = "
"))
>
> Mark
> R. Mark Sharp, Ph.D.
> msharp at TxBiomed.org
>
>
>
>
>
>> On Jul 27, 2015, at 3:41 PM, Mayukh Dass <mayukh.dass at
gmail.com> wrote:
>>
>> Hello,
>>
>> I am trying to read a set of json files containing tweets using the
>> following code:
>>
>> json_data <- fromJSON(paste(readLines(json_file))
>>
>> Unfortunately, it only reads the first record on the file. For example,
in
>> the file below, it only reads the first record starting with
"id":"tag:
>> search.twitter.com,2005:3318539389". What is the best way to
retrieve these
>> records? I have 20 such json files with varying number of tweets in it.
>> Thank you in advance.
>>
>> Best,
>> Mayukh
>>
>> {"id":"tag:search.twitter.com
>>
,2005:3318539389","objectType":"activity","actor":{"objectType":"person","id":"id:
>>
twitter.com:2859421","link":"http://www.twitter.com/meetjenn","displayName":"Jenn","postedTime":"2007-01-29T17:06:00.000Z","image":"06-19-07_2010.jpg","summary":"I
>> say 'like' a lot. I fall down a lot. I walk into everything.
Love Pgh Pens,
>> NE Pats, Fundraising, Dogs & History. Craft Beer & Running
>>
Novice.","links":[{"href":"http://meetjenn.tumblr.com","rel":"me"}],"friendsCount":0,"followersCount":0,"listedCount":0,"statusesCount":0,"twitterTimeZone":"Eastern
>> Time (US &
>>
Canada)","verified":false,"utcOffset":"0","preferredUsername":"meetjenn","languages":["en"],"location":{"objectType":"place","displayName":"Pgh/Philajersey"},"favoritesCount":0},"verb":"post","postedTime":"2009-08-15T00:00:12.000Z","generator":{"displayName":"tweetdeck","link":"
>> http://twitter.com
>>
"},"provider":{"objectType":"service","displayName":"Twitter","link":"
>> http://www.twitter.com"},"link":"
>>
http://twitter.com/meetjenn/statuses/3318539389","body":"Cool
story about
>> the man who created the @Starbucks logo. Additional link at the bottom
on
>> how it came to be: http://bit.ly/16bOJk
>>
","object":{"objectType":"note","id":"object:search.twitter.com,2005:3318539389","summary":"Cool
>> story about the man who created the @Starbucks logo. Additional link at
the
>> bottom on how it came to be:
http://bit.ly/16bOJk","link":"
>> http://twitter.com/meetjenn/statuses/3318539389
>>
","postedTime":"2009-08-15T00:00:12.000Z"},"twitter_entities":{"urls":[{"expanded_url":null,"indices":[111,131],"url":"
>> http://bit.ly/16bOJk
>>
"}],"hashtags":[],"user_mentions":[{"id":null,"name":null,"indices":[41,51],"screen_name":"@Starbucks","id_str":null}]},"retweetCount":0,"gnip":{"matching_rules":[{"value":"Starbucks","tag":null}]}}
>> {"id":"tag:search.twitter.com
>>
,2005:3318543260","objectType":"activity","actor":{"objectType":"person","id":"id:
>>
twitter.com:61595468","link":"http://www.twitter.com/FastestFood","displayName":"FastFood
>>
Bob","postedTime":"2009-01-30T20:51:10.000Z","image":"","summary":"Just
A
>> little food for
>>
thought","links":[{"href":"http://www.TeamSantilli.com","rel":"me"}],"friendsCount":0,"followersCount":0,"listedCount":0,"statusesCount":0,"twitterTimeZone":"Pacific
>> Time (US &
>>
Canada)","verified":false,"utcOffset":"0","preferredUsername":"FastestFood","languages":["en"],"location":{"objectType":"place","displayName":"eating
>> some
>>
thoughts"},"favoritesCount":0},"verb":"post","postedTime":"2009-08-15T00:00:23.000Z","generator":{"displayName":"oauth:17","link":"
>> http://twitter.com
>>
"},"provider":{"objectType":"service","displayName":"Twitter","link":"
>> http://www.twitter.com"},"link":"
>>
http://twitter.com/FastestFood/statuses/3318543260","body":"Oregon
Biz
>> Report ? How Starbucks saved millions. Oregon closures ...
>>
http://u.mavrev.com/02bdj","object":{"objectType":"note","id":"object:
>>
search.twitter.com,2005:3318543260","summary":"Oregon Biz
Report ? How
>> Starbucks saved millions. Oregon closures ... http://u.mavrev.com/02bdj
>>
","link":"http://twitter.com/FastestFood/statuses/3318543260
>>
","postedTime":"2009-08-15T00:00:23.000Z"},"twitter_entities":{"urls":[{"expanded_url":null,"indices":[70,95],"url":"
>> http://u.mavrev.com/02bdj
>>
"}],"hashtags":[],"user_mentions":[]},"retweetCount":0,"gnip":{"matching_rules":[{"value":"Starbucks","tag":null}]}}
>> {"info":{"message":"Replay Request
>>
Completed","sent":"2015-02-18T00:05:15+00:00","activity_count":2}}
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>