Each day the daily balance in the following link http://www. snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf is updated. I would like to set up an R procedure to be run daily in a server able to read the figures in a couple of lines only ("Industriale" and "Termoelettrico", towards the end of the balance) and put the data in a table. Is that possible? If yes, what R-packages should I use? Ciao Vittorio
Vittorio, this isn't really an R problem, you need a tool to extract text from a PDF document. I've tried pdftotext from the xpdf bundle, and it worked fine for the file you linked. In my Ubuntu Linux it is in the xpdf-utils package, search to xpdf to find out whether it is available on windows if you use windows. If you want to call it from R you can use the 'system' function. There may be other, better method i'm unaware of, of course. Best, Gabor On Wed, May 09, 2007 at 03:47:59PM +0100, Vittorio wrote:> Each day the daily balance in the following link > > http://www. > snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf > > is > updated. > > I would like to set up an R procedure to be run daily in a > server able to read the figures in a couple of lines only > ("Industriale" and "Termoelettrico", towards the end of the balance) > and put the data in a table. > > Is that possible? If yes, what R-packages > should I use? > > Ciao > Vittorio > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Csardi Gabor <csardi at rmki.kfki.hu> MTA RMKI, ELTE TTK
You can do it with the base toolkit. Just read the PDF file in as text and then extract the data:> # read in PDF file as text > x.in <- readLines("http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf") > # find Industriale > Ind <- grep("Industriale", x.in, value=TRUE) > # find Termoelettrico > Ter <- grep("Termoelettrico", x.in, value=TRUE) > # extract the data > Ind.data <- sub(".*\\(([\\s0-9,]*)\\).*", "\\1", Ind, perl=TRUE) > Ter.data <- sub(".*\\(([\\s0-9,]*)\\).*", "\\1", Ter, perl=TRUE) > Ind.data[1] " 46,6"> Ter.data[1] " 99,3"> >>On 5/9/07, Vittorio <vdemart1 at tin.it> wrote:> Each day the daily balance in the following link > > http://www. > snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf > > is > updated. > > I would like to set up an R procedure to be run daily in a > server able to read the figures in a couple of lines only > ("Industriale" and "Termoelettrico", towards the end of the balance) > and put the data in a table. > > Is that possible? If yes, what R-packages > should I use? > > Ciao > Vittorio > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve?
On Wed, 2007-05-09 at 15:47 +0100, Vittorio wrote:> Each day the daily balance in the following link > > http://www. > snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf > > is > updated. > > I would like to set up an R procedure to be run daily in a > server able to read the figures in a couple of lines only > ("Industriale" and "Termoelettrico", towards the end of the balance) > and put the data in a table. > > Is that possible? If yes, what R-packages > should I use? > > Ciao > VittorioVittorio, Keep in mind that PDF files are typically text files. Thus you can read it in using readLines(): PDFFile <- readLines("http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf") # Clean up unlink("http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf")> str(PDFFile)chr [1:989] "%PDF-1.2" "6 0 obj" "<<" "/Length 7 0 R" ... # Now find the lines containing the values you wish # Use grep() with a regex for either term Lines <- grep("(Industriale|Termoelettrico)", PDFFile)> Lines[1] 33 34> PDFFile[Lines][1] "/F3 1 Tf 9 0 0 9 204 304 Tm (Industriale )Tj 9 0 0 9 420 304 Tm ( 46,6)Tj" [2] "9 0 0 9 204 283 Tm (Termoelettrico )Tj 9 0 0 9 420 283 Tm ( 99,3)Tj" # Now parse the values out of the lines" Vals <- sub(".*\\((.*)\\).*", "\\1", PDFFile[Lines])> Vals[1] " 46,6" " 99,3" # Now convert them to numeric # need to change the ',' to a '.' at least in my locale> as.numeric(gsub(",", "\\.", Vals))[1] 46.6 99.3 HTH, Marc Schwartz
Modify this to suit. After grepping out the correct lines we use strapply to find and emit character sequences that come after a "(" but do not contain a ")" . back = -1 says to only emit the backreferences and not the entire matched expression (which would have included the leading "(" ): URL <- "http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf" Lines.raw <- readLines(URL) Lines <- grep("Industriale|Termoelettrico", Lines.raw, value = TRUE) library(gsubfn) strapply(Lines, "[(]([^)]*)", back = -1, simplify = rbind) which gives a character matrix whose first column is the label and second column is the number in character form. You can then manipulate it as desired. On 5/9/07, Vittorio <vdemart1 at tin.it> wrote:> Each day the daily balance in the following link > > http://www. > snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf > > is > updated. > > I would like to set up an R procedure to be run daily in a > server able to read the figures in a couple of lines only > ("Industriale" and "Termoelettrico", towards the end of the balance) > and put the data in a table. > > Is that possible? If yes, what R-packages > should I use? > > Ciao > Vittorio > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Here is one additional solution. This one produces a data frame. The regular expression removes: - everything from beginning to first ( - everything from last ( to end - everything between ) and ( in the middle The | characters separate the three parts. Then read.table reads it in. URL <- "http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf" Lines.raw <- readLines(URL) Lines <- grep("Industriale|Termoelettrico", Lines.raw, value = TRUE) rx <- "^[^(]*[(]|[)][^(]*$|[)][^(]*[(]" read.table(textConnection(gsub(rx, "", Lines)), dec = ",") On 5/9/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:> Modify this to suit. After grepping out the correct lines we use strapply > to find and emit character sequences that come after a "(" but do not contain > a ")" . back = -1 says to only emit the backreferences and not the entire > matched expression (which would have included the leading "(" ): > > URL <- "http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf" > Lines.raw <- readLines(URL) > Lines <- grep("Industriale|Termoelettrico", Lines.raw, value = TRUE) > library(gsubfn) > strapply(Lines, "[(]([^)]*)", back = -1, simplify = rbind) > > which gives a character matrix whose first column is the label > and second column is the number in character form. You can > then manipulate it as desired. > > On 5/9/07, Vittorio <vdemart1 at tin.it> wrote: > > Each day the daily balance in the following link > > > > http://www. > > snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf > > > > is > > updated. > > > > I would like to set up an R procedure to be run daily in a > > server able to read the figures in a couple of lines only > > ("Industriale" and "Termoelettrico", towards the end of the balance) > > and put the data in a table. > > > > Is that possible? If yes, what R-packages > > should I use? > > > > Ciao > > Vittorio > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > >