Shelby McIntyre
2012-Aug-05 15:16 UTC
[R] Extracting desired numbers from complicated lines of web pages
I need to extract the indicted (bold & underlined) numbers from lines coming
off web pages.
Of course I don't know ahead of time the location or length of the number.
What I do know
is the tag "Friends", and "Reviews", etc. In fact, it would
be good to end up with
Value Variable
108 Friends
151 Reviews
5 Review Updates
NA First <-- assuming here that "First" did
not show up on an line
etc.
Of particular trouble is line [7] which requires extracting 3 numbers 2022
(Useful), 1591 (Funny) and 1756 (Cool).
============== Extraction problem lines ==========
[1] "\t\t\t<li id=\"friendCount\"><a
href=\"/user_details_friends?userid=--T8djg0nrb_yMMMA3Y0jQ\">108
Friends</a></li>"
[2] "\t\t\t<li id=\"reviewCount\"><a
href=\"/user_details_reviews_self?userid=--T8djg0nrb_yMMMA3Y0jQ\">151
Reviews</a></li>"
[3] "\t\t\t\t<li id=\"updatesCount\">5 Review
Updates</li>"
[4] "\t\t\t\t<li id=\"ftrCount\"><a
href=\"/user_details_reviews_self?review_filter=first&userid=--T8djg0nrb_yMMMA3Y0jQ\">1
First</a></li>"
[5] "\t\t\t\t<li id=\"fanCount\">2 Fans</li>"
[6] "\t\t\t\t<li id=\"localPhotoCount\"><a
href=\"/user_local_photos?userid=--T8djg0nrb_yMMMA3Y0jQ\">54 Local
Photos</a></li>"
[7] <p id="review_votes" class="smaller"><img
src="http://s3-media2.ak.yelpcdn.com/assets/0/www/img/cf265851428e/ico/reviewVotes.gif"
alt=""> Review votes:<br> 2022 Useful, 1591 Funny, and 1756
Cool</p>
[[alternative HTML version deleted]]
jim holtman
2012-Aug-05 18:27 UTC
[R] Extracting desired numbers from complicated lines of web pages
try this: left as an exercise to the reader if these have to be
grouped by 'userid' which might be the case and therefore you might
want to check for non-existent values. Also on the last line you did
not say it there are only those three values, or could there be more.
input <- readLines(textConnection('
+ [1] "\t\t\t<li id=\"friendCount\"><a
href=\"/user_details_friends?userid=--T8djg0nrb_yMMMA3Y0jQ\">108
Friends</a></li>"
+
+ [2] "\t\t\t<li id=\"reviewCount\"><a
href=\"/user_details_reviews_self?userid=--T8djg0nrb_yMMMA3Y0jQ\">151
Reviews</a></li>"
+
+ [3] "\t\t\t\t<li id=\"updatesCount\">5 Review
Updates</li>"
+
+ [4] "\t\t\t\t<li id=\"ftrCount\"><a
href=\"/user_details_reviews_self?review_filter=first&userid=--T8djg0nrb_yMMMA3Y0jQ\">1
First</a></li>"
+
+ [5] "\t\t\t\t<li id=\"fanCount\">2 Fans</li>"
+
+ [6] "\t\t\t\t<li id=\"localPhotoCount\"><a
href=\"/user_local_photos?userid=--T8djg0nrb_yMMMA3Y0jQ\">54 Local
Photos</a></li>"
+
+ [7] <p id="review_votes" class="smaller"><img
src="http://s3-media2.ak.yelpcdn.com/assets/0/www/img/cf265851428e/ico/reviewVotes.gif"
alt=""> Review votes:<br> 2022 Useful, 1591 Funny, and 1756
Cool</p>
+
+ [[alternative HTML version deleted]]'))>
> # extract the data by brute force and then break apart into a dataframe
> count <- lapply(input, function(.line){
+ if (grepl('[0-9]+ Friends', .line))
+ return(sub(".*>([0-9]+) (Friends).*",
"\\1:\\2", .line))
+ if (grepl("[0-9]+ Reviews", .line))
+ return(sub(".*>([0-9]+) (Reviews).*",
"\\1:\\2", .line))
+ if (grepl("[0-9]+ Review Update", .line))
+ return(sub(".*>([0-9]+) (Review Update).*",
"\\1:\\2", .line))
+ if (grepl("[0-9]+ First", .line))
+ return(sub(".*>([0-9]+) (First).*", "\\1:\\2",
.line))
+ if (grepl("[0-9]+ Fans", .line))
+ return(sub(".*>([0-9]+) (Fans).*", "\\1:\\2",
.line))
+ if (grepl("[0-9]+ Local Photos", .line))
+ return(sub(".*>([0-9]+) (Local Photos).*",
"\\1:\\2", .line))
+ if (grepl("[0-9]+ Useful", .line))
+ return(c( # vector with multiple values
+ sub(".* ([0-9]+) (Useful).*", "\\1:\\2",
.line)
+ , sub(".* ([0-9]+) (Funny).*", "\\1:\\2", .line)
+ , sub(".* ([0-9]+) (Cool).*", "\\1:\\2", .line)
+ ))
+ return(NULL)
+ })>
> # create dataframe
> df <- data.frame(do.call(rbind, strsplit(unlist(count), ":")))
> names(df) <- c("Value", "Variable")
> df
Value Variable
1 108 Friends
2 151 Reviews
3 5 Review Update
4 1 First
5 2 Fans
6 54 Local Photos
7 2022 Useful
8 1591 Funny
9 1756 Cool>
>
>
>
On Sun, Aug 5, 2012 at 11:16 AM, Shelby McIntyre <smcintyremobile at
me.com> wrote:> I need to extract the indicted (bold & underlined) numbers from lines
coming off web pages.
>
> Of course I don't know ahead of time the location or length of the
number. What I do know
> is the tag "Friends", and "Reviews", etc. In fact, it
would be good to end up with
>
> Value Variable
> 108 Friends
> 151 Reviews
> 5 Review Updates
> NA First <-- assuming here that "First"
did not show up on an line
> etc.
>
> Of particular trouble is line [7] which requires extracting 3 numbers 2022
(Useful), 1591 (Funny) and 1756 (Cool).
> ============== Extraction problem lines ==========>
> [1] "\t\t\t<li id=\"friendCount\"><a
href=\"/user_details_friends?userid=--T8djg0nrb_yMMMA3Y0jQ\">108
Friends</a></li>"
>
> [2] "\t\t\t<li id=\"reviewCount\"><a
href=\"/user_details_reviews_self?userid=--T8djg0nrb_yMMMA3Y0jQ\">151
Reviews</a></li>"
>
> [3] "\t\t\t\t<li id=\"updatesCount\">5 Review
Updates</li>"
>
> [4] "\t\t\t\t<li id=\"ftrCount\"><a
href=\"/user_details_reviews_self?review_filter=first&userid=--T8djg0nrb_yMMMA3Y0jQ\">1
First</a></li>"
>
> [5] "\t\t\t\t<li id=\"fanCount\">2
Fans</li>"
>
> [6] "\t\t\t\t<li id=\"localPhotoCount\"><a
href=\"/user_local_photos?userid=--T8djg0nrb_yMMMA3Y0jQ\">54 Local
Photos</a></li>"
>
> [7] <p id="review_votes" class="smaller"><img
src="http://s3-media2.ak.yelpcdn.com/assets/0/www/img/cf265851428e/ico/reviewVotes.gif"
alt=""> Review votes:<br> 2022 Useful, 1591 Funny, and 1756
Cool</p>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Jim Holtman
Data Munger Guru
What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.