thr3ads.net - R help - [R] Regex with criteria from multiple lines [Feb 2014]

If this information is useful, please help other people find it:
Share via:

Mark Stam

2014-Feb-14 09:29 UTC

[R] Regex with criteria from multiple lines

Hello,

I do data analysis on json data (Twitter). An example of the data:

**********
"      \"id\": 433662713886429200,"
"      \"id_str\": \"433662713886429184\","
"      \"text\": \"Hond vast in water in Bargerveen bij
Zwartemeer -
http://t.co/FqbkOMzYd1 #Zwartemeer #bargerveen #hond #innood\","
"      \"source\": \"<a
href=\"https://about.twitter.com/products/tweetdeck\"
rel=\"nofollow\">TweetDeck</a>\","
**********

I get the contents of the "text" field like this:

r <- regexpr("^( )*\"text(.*?),$", myjsondata)
text <- regmatches(myjsondata,r)
txt <-
gsub("\"text\":|\",|\"","",text)

Unfortunately, in json there are more fields with the same name, for
example:

**********
"      \"id\": 433662713886429200,"
"      \"id_str\": \"433662713886429184\","
"      \"text\": \"Hond vast in water in Bargerveen bij
Zwartemeer -
http://t.co/FqbkOMzYd1 #Zwartemeer #bargerveen #hond #innood\","
"      \"source\": \"<a
href=\"https://about.twitter.com/products/tweetdeck\"
rel=\"nofollow\">TweetDeck</a>\","
...
"      \"entities\":  {"


"        \"hashtags\":  ["


"           {"


"            \"text\": \"Zwartemeer\","
...
"            \"text\": \"bargerveen\","


...
"            \"text\": \"hond\","
etc.
**********

I only want to get the data from the text field between the "id_str"
and
the "source" fields. I don't want to have the data from the text
fields
below "hashtags". I do understand regex, but I don't understand
how to do
it with the criteria from multiple lines.

I know it's possible to use a Json library in R, but in my case I can't,
because I get the json from raw "clipboard" data.

Thanks !

Mark Stam

	[[alternative HTML version deleted]]

Jeff Newmiller

2014-Feb-14 13:58 UTC

head link

[R] Regex with criteria from multiple lines

You need to use the JSON library or equivalent to solve this problem. I
don't understand why you think that having the data in the clipboard
prevents you from doing this since that is just another file (but I usually
avoid using the clipboard for reproducible analysis anyway).
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

On February 14, 2014 1:29:59 AM PST, Mark Stam <digistam at gmail.com>
wrote:>Hello,
>
>I do data analysis on json data (Twitter). An example of the data:
>
>**********
>"      \"id\": 433662713886429200,"
>"      \"id_str\": \"433662713886429184\","
>"      \"text\": \"Hond vast in water in Bargerveen bij
Zwartemeer -
>http://t.co/FqbkOMzYd1 #Zwartemeer #bargerveen #hond #innood\","
>"      \"source\": \"<a
>href=\"https://about.twitter.com/products/tweetdeck\"
>rel=\"nofollow\">TweetDeck</a>\","
>**********
>
>I get the contents of the "text" field like this:
>
>r <- regexpr("^( )*\"text(.*?),$", myjsondata)
>text <- regmatches(myjsondata,r)
>txt <-
gsub("\"text\":|\",|\"","",text)
>
>Unfortunately, in json there are more fields with the same name, for
>example:
>
>**********
>"      \"id\": 433662713886429200,"
>"      \"id_str\": \"433662713886429184\","
>"      \"text\": \"Hond vast in water in Bargerveen bij
Zwartemeer -
>http://t.co/FqbkOMzYd1 #Zwartemeer #bargerveen #hond #innood\","
>"      \"source\": \"<a
>href=\"https://about.twitter.com/products/tweetdeck\"
>rel=\"nofollow\">TweetDeck</a>\","
>...
>"      \"entities\":  {"
>
>
>"        \"hashtags\":  ["
>
>
>"           {"
>
>
>"            \"text\": \"Zwartemeer\","
>...
>"            \"text\": \"bargerveen\","
>
>
>...
>"            \"text\": \"hond\","
>etc.
>**********
>
>I only want to get the data from the text field between the
"id_str"
>and
>the "source" fields. I don't want to have the data from the
text fields
>below "hashtags". I do understand regex, but I don't
understand how to
>do
>it with the criteria from multiple lines.
>
>I know it's possible to use a Json library in R, but in my case I
>can't,
>because I get the json from raw "clipboard" data.
>
>Thanks !
>
>Mark Stam
>
>	[[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

R help - Feb 2014 - Regex with criteria from multiple lines

[R] Regex with criteria from multiple lines

[R] Regex with criteria from multiple lines