Drake Gossi
2019-Feb-19 20:30 UTC
[R] help getting a research project started on regulations.gov
Hello everyone, I will be using R to manipulate this data <https://www.regulations.gov/docketBrowser?rpp=25&so=DESC&sb=commentDueDate&po=0&dct=PS&D=ED-2018-OCR-0064>. Specifically, it's proposed changes to Title IX--over 11,000 publicly available comments. So, the end goal is for me to tabulate each of these 11,000 comments in a csv file, so I can begin to manipulate and visualize the data. But I'm not there yet. I just put in for an API key and, while I have one, I'm waiting for it to be activated. After that, though, I'm a little lost. Do I need to scrape the comments from the site? Or does having the API render that unnecessary? There is this interface <https://regulationsgov.github.io/developers/console/> that works with the API, but I don't know if, though it, I can get the data I need. I'm still trying to figure out what JSON is. Or, if I have to scrape the comments, can I do that with R? I can't get a straight answer from the python people. I can't tell if I need to do this through beautiful soup or through scrapy (or even if I need to do it at all, as I said...). The trouble with the comments is, they are each on their own URL, so--and again this is assuming that I will have to scrape them--I don't know how to code in order to grab all of the comments from all of the URLs. I also am trying to figure out how to isolate the essence of the comments in the html. From the python people, I've heard the following: scrapy fetch 'url' will download the raw page you are interested in. And you can look at the raw source code. Important to appreciate that what you see in the browser is often processed in your browser before you see it. Of course, a scraper can do the same processing, but it's complicated. So, start by looking at the raw source code. Maybe you can grab what you need with simple parsing like Beautiful Soup does. Maybe you need to do more. Scrapy is your friend. Beautiful soup is your friend here. It can analyze the data within the html tags on your scraped page. But often javascript is used on 'modern' web pages so the page is actually not just html, but javascript that changes the html. For this you need another tool -- i think one is called scrapy. Others here probably have experience with that. I think part of my problem relates to that yellow part. I was saying things like I think what I might be looking for is a div class = GIY1LSJIXD, since that's where the hierarchy seems to taper off in the html for the comment I'm looking to scrape. What I'm trying to do here is, locate the comment in the html so I can tell the request function to extract it. Any help anyone could offer here would be much appreciated. I'm very lost. Drake [[alternative HTML version deleted]]
Bert Gunter
2019-Feb-19 23:16 UTC
[R] help getting a research project started on regulations.gov
Please search yourself first! "scrape JSON from web" at the rseek.org site produced what appeared to be several relevant hits, especially this CRAN task view: https://cran.r-project.org/web/views/WebTechnologies.html Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Tue, Feb 19, 2019 at 3:07 PM Drake Gossi <drake.gossi at gmail.com> wrote:> Hello everyone, > > I will be using R to manipulate this data > < > https://www.regulations.gov/docketBrowser?rpp=25&so=DESC&sb=commentDueDate&po=0&dct=PS&D=ED-2018-OCR-0064 > >. > Specifically, it's proposed changes to Title IX--over 11,000 publicly > available comments. So, the end goal is for me to tabulate each of these > 11,000 comments in a csv file, so I can begin to manipulate and visualize > the data. > > But I'm not there yet. I just put in for an API key and, while I have one, > I'm waiting for it to be activated. After that, though, I'm a little lost. > Do I need to scrape the comments from the site? Or does having the API > render that unnecessary? There is this interface > <https://regulationsgov.github.io/developers/console/> that works with the > API, but I don't know if, though it, I can get the data I need. I'm still > trying to figure out what JSON is. > > Or, if I have to scrape the comments, can I do that with R? I can't get a > straight answer from the python people. I can't tell if I need to do this > through beautiful soup or through scrapy (or even if I need to do it at > all, as I said...). The trouble with the comments is, they are each on > their own URL, so--and again this is assuming that I will have to scrape > them--I don't know how to code in order to grab all of the comments from > all of the URLs. > > I also am trying to figure out how to isolate the essence of the comments > in the html. From the python people, I've heard the following: > > scrapy fetch 'url' > will download the raw page you are interested in. And you can look at > the raw source code. Important to appreciate that what you see in the > browser is often processed in your browser before you see it. > > Of course, a scraper can do the same processing, but it's complicated. > So, start by looking at the raw source code. Maybe you can grab what you > need with simple parsing like Beautiful Soup does. Maybe you need to do > more. Scrapy is your friend. > > Beautiful soup is your friend here. It can analyze the data within > the html tags on your scraped page. But often javascript is used on > 'modern' web pages so the page is actually not just html, but > javascript that changes the html. For this you need another tool -- i > think one is called scrapy. Others here probably have experience with > that. > > I think part of my problem relates to that yellow part. I was saying things > like > > I think what I might be looking for is a div class = GIY1LSJIXD, since > that's where the hierarchy seems to taper off in the html for the comment > I'm looking to scrape. > > > What I'm trying to do here is, locate the comment in the html so I can tell > the request function to extract it. > > Any help anyone could offer here would be much appreciated. I'm very lost. > > Drake > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]