Shelby McIntyre
2012-Aug-05 15:16 UTC
[R] Extracting desired numbers from complicated lines of web pages
I need to extract the indicted (bold & underlined) numbers from lines coming off web pages. Of course I don't know ahead of time the location or length of the number. What I do know is the tag "Friends", and "Reviews", etc. In fact, it would be good to end up with Value Variable 108 Friends 151 Reviews 5 Review Updates NA First <-- assuming here that "First" did not show up on an line etc. Of particular trouble is line [7] which requires extracting 3 numbers 2022 (Useful), 1591 (Funny) and 1756 (Cool). ============== Extraction problem lines ========== [1] "\t\t\t<li id=\"friendCount\"><a href=\"/user_details_friends?userid=--T8djg0nrb_yMMMA3Y0jQ\">108 Friends</a></li>" [2] "\t\t\t<li id=\"reviewCount\"><a href=\"/user_details_reviews_self?userid=--T8djg0nrb_yMMMA3Y0jQ\">151 Reviews</a></li>" [3] "\t\t\t\t<li id=\"updatesCount\">5 Review Updates</li>" [4] "\t\t\t\t<li id=\"ftrCount\"><a href=\"/user_details_reviews_self?review_filter=first&userid=--T8djg0nrb_yMMMA3Y0jQ\">1 First</a></li>" [5] "\t\t\t\t<li id=\"fanCount\">2 Fans</li>" [6] "\t\t\t\t<li id=\"localPhotoCount\"><a href=\"/user_local_photos?userid=--T8djg0nrb_yMMMA3Y0jQ\">54 Local Photos</a></li>" [7] <p id="review_votes" class="smaller"><img src="http://s3-media2.ak.yelpcdn.com/assets/0/www/img/cf265851428e/ico/reviewVotes.gif" alt=""> Review votes:<br> 2022 Useful, 1591 Funny, and 1756 Cool</p> [[alternative HTML version deleted]]
jim holtman
2012-Aug-05 18:27 UTC
[R] Extracting desired numbers from complicated lines of web pages
try this: left as an exercise to the reader if these have to be grouped by 'userid' which might be the case and therefore you might want to check for non-existent values. Also on the last line you did not say it there are only those three values, or could there be more. input <- readLines(textConnection(' + [1] "\t\t\t<li id=\"friendCount\"><a href=\"/user_details_friends?userid=--T8djg0nrb_yMMMA3Y0jQ\">108 Friends</a></li>" + + [2] "\t\t\t<li id=\"reviewCount\"><a href=\"/user_details_reviews_self?userid=--T8djg0nrb_yMMMA3Y0jQ\">151 Reviews</a></li>" + + [3] "\t\t\t\t<li id=\"updatesCount\">5 Review Updates</li>" + + [4] "\t\t\t\t<li id=\"ftrCount\"><a href=\"/user_details_reviews_self?review_filter=first&userid=--T8djg0nrb_yMMMA3Y0jQ\">1 First</a></li>" + + [5] "\t\t\t\t<li id=\"fanCount\">2 Fans</li>" + + [6] "\t\t\t\t<li id=\"localPhotoCount\"><a href=\"/user_local_photos?userid=--T8djg0nrb_yMMMA3Y0jQ\">54 Local Photos</a></li>" + + [7] <p id="review_votes" class="smaller"><img src="http://s3-media2.ak.yelpcdn.com/assets/0/www/img/cf265851428e/ico/reviewVotes.gif" alt=""> Review votes:<br> 2022 Useful, 1591 Funny, and 1756 Cool</p> + + [[alternative HTML version deleted]]'))> > # extract the data by brute force and then break apart into a dataframe > count <- lapply(input, function(.line){+ if (grepl('[0-9]+ Friends', .line)) + return(sub(".*>([0-9]+) (Friends).*", "\\1:\\2", .line)) + if (grepl("[0-9]+ Reviews", .line)) + return(sub(".*>([0-9]+) (Reviews).*", "\\1:\\2", .line)) + if (grepl("[0-9]+ Review Update", .line)) + return(sub(".*>([0-9]+) (Review Update).*", "\\1:\\2", .line)) + if (grepl("[0-9]+ First", .line)) + return(sub(".*>([0-9]+) (First).*", "\\1:\\2", .line)) + if (grepl("[0-9]+ Fans", .line)) + return(sub(".*>([0-9]+) (Fans).*", "\\1:\\2", .line)) + if (grepl("[0-9]+ Local Photos", .line)) + return(sub(".*>([0-9]+) (Local Photos).*", "\\1:\\2", .line)) + if (grepl("[0-9]+ Useful", .line)) + return(c( # vector with multiple values + sub(".* ([0-9]+) (Useful).*", "\\1:\\2", .line) + , sub(".* ([0-9]+) (Funny).*", "\\1:\\2", .line) + , sub(".* ([0-9]+) (Cool).*", "\\1:\\2", .line) + )) + return(NULL) + })> > # create dataframe > df <- data.frame(do.call(rbind, strsplit(unlist(count), ":"))) > names(df) <- c("Value", "Variable") > dfValue Variable 1 108 Friends 2 151 Reviews 3 5 Review Update 4 1 First 5 2 Fans 6 54 Local Photos 7 2022 Useful 8 1591 Funny 9 1756 Cool> > > >On Sun, Aug 5, 2012 at 11:16 AM, Shelby McIntyre <smcintyremobile at me.com> wrote:> I need to extract the indicted (bold & underlined) numbers from lines coming off web pages. > > Of course I don't know ahead of time the location or length of the number. What I do know > is the tag "Friends", and "Reviews", etc. In fact, it would be good to end up with > > Value Variable > 108 Friends > 151 Reviews > 5 Review Updates > NA First <-- assuming here that "First" did not show up on an line > etc. > > Of particular trouble is line [7] which requires extracting 3 numbers 2022 (Useful), 1591 (Funny) and 1756 (Cool). > ============== Extraction problem lines ==========> > [1] "\t\t\t<li id=\"friendCount\"><a href=\"/user_details_friends?userid=--T8djg0nrb_yMMMA3Y0jQ\">108 Friends</a></li>" > > [2] "\t\t\t<li id=\"reviewCount\"><a href=\"/user_details_reviews_self?userid=--T8djg0nrb_yMMMA3Y0jQ\">151 Reviews</a></li>" > > [3] "\t\t\t\t<li id=\"updatesCount\">5 Review Updates</li>" > > [4] "\t\t\t\t<li id=\"ftrCount\"><a href=\"/user_details_reviews_self?review_filter=first&userid=--T8djg0nrb_yMMMA3Y0jQ\">1 First</a></li>" > > [5] "\t\t\t\t<li id=\"fanCount\">2 Fans</li>" > > [6] "\t\t\t\t<li id=\"localPhotoCount\"><a href=\"/user_local_photos?userid=--T8djg0nrb_yMMMA3Y0jQ\">54 Local Photos</a></li>" > > [7] <p id="review_votes" class="smaller"><img src="http://s3-media2.ak.yelpcdn.com/assets/0/www/img/cf265851428e/ico/reviewVotes.gif" alt=""> Review votes:<br> 2022 Useful, 1591 Funny, and 1756 Cool</p> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it.