Each day the daily balance in the following link
http://www.
snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf
is
updated.
I would like to set up an R procedure to be run daily in a
server able to read the figures in a couple of lines only
("Industriale" and "Termoelettrico", towards the end of the
balance)
and put the data in a table.
Is that possible? If yes, what R-packages
should I use?
Ciao
Vittorio
Vittorio, this isn't really an R problem, you need a tool to extract text from a PDF document. I've tried pdftotext from the xpdf bundle, and it worked fine for the file you linked. In my Ubuntu Linux it is in the xpdf-utils package, search to xpdf to find out whether it is available on windows if you use windows. If you want to call it from R you can use the 'system' function. There may be other, better method i'm unaware of, of course. Best, Gabor On Wed, May 09, 2007 at 03:47:59PM +0100, Vittorio wrote:> Each day the daily balance in the following link > > http://www. > snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf > > is > updated. > > I would like to set up an R procedure to be run daily in a > server able to read the figures in a couple of lines only > ("Industriale" and "Termoelettrico", towards the end of the balance) > and put the data in a table. > > Is that possible? If yes, what R-packages > should I use? > > Ciao > Vittorio > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Csardi Gabor <csardi at rmki.kfki.hu> MTA RMKI, ELTE TTK
You can do it with the base toolkit. Just read the PDF file in as text and then extract the data:> # read in PDF file as text > x.in <- readLines("http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf") > # find Industriale > Ind <- grep("Industriale", x.in, value=TRUE) > # find Termoelettrico > Ter <- grep("Termoelettrico", x.in, value=TRUE) > # extract the data > Ind.data <- sub(".*\\(([\\s0-9,]*)\\).*", "\\1", Ind, perl=TRUE) > Ter.data <- sub(".*\\(([\\s0-9,]*)\\).*", "\\1", Ter, perl=TRUE) > Ind.data[1] " 46,6"> Ter.data[1] " 99,3"> >>On 5/9/07, Vittorio <vdemart1 at tin.it> wrote:> Each day the daily balance in the following link > > http://www. > snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf > > is > updated. > > I would like to set up an R procedure to be run daily in a > server able to read the figures in a couple of lines only > ("Industriale" and "Termoelettrico", towards the end of the balance) > and put the data in a table. > > Is that possible? If yes, what R-packages > should I use? > > Ciao > Vittorio > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve?
On Wed, 2007-05-09 at 15:47 +0100, Vittorio wrote:> Each day the daily balance in the following link > > http://www. > snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf > > is > updated. > > I would like to set up an R procedure to be run daily in a > server able to read the figures in a couple of lines only > ("Industriale" and "Termoelettrico", towards the end of the balance) > and put the data in a table. > > Is that possible? If yes, what R-packages > should I use? > > Ciao > VittorioVittorio, Keep in mind that PDF files are typically text files. Thus you can read it in using readLines(): PDFFile <- readLines("http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf") # Clean up unlink("http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf")> str(PDFFile)chr [1:989] "%PDF-1.2" "6 0 obj" "<<" "/Length 7 0 R" ... # Now find the lines containing the values you wish # Use grep() with a regex for either term Lines <- grep("(Industriale|Termoelettrico)", PDFFile)> Lines[1] 33 34> PDFFile[Lines][1] "/F3 1 Tf 9 0 0 9 204 304 Tm (Industriale )Tj 9 0 0 9 420 304 Tm ( 46,6)Tj" [2] "9 0 0 9 204 283 Tm (Termoelettrico )Tj 9 0 0 9 420 283 Tm ( 99,3)Tj" # Now parse the values out of the lines" Vals <- sub(".*\\((.*)\\).*", "\\1", PDFFile[Lines])> Vals[1] " 46,6" " 99,3" # Now convert them to numeric # need to change the ',' to a '.' at least in my locale> as.numeric(gsub(",", "\\.", Vals))[1] 46.6 99.3 HTH, Marc Schwartz
Modify this to suit. After grepping out the correct lines we use strapply
to find and emit character sequences that come after a "(" but do not
contain
a ")" . back = -1 says to only emit the backreferences and not the
entire
matched expression (which would have included the leading "(" ):
URL <-
"http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf"
Lines.raw <- readLines(URL)
Lines <- grep("Industriale|Termoelettrico", Lines.raw, value =
TRUE)
library(gsubfn)
strapply(Lines, "[(]([^)]*)", back = -1, simplify = rbind)
which gives a character matrix whose first column is the label
and second column is the number in character form. You can
then manipulate it as desired.
On 5/9/07, Vittorio <vdemart1 at tin.it> wrote:> Each day the daily balance in the following link
>
> http://www.
> snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf
>
> is
> updated.
>
> I would like to set up an R procedure to be run daily in a
> server able to read the figures in a couple of lines only
> ("Industriale" and "Termoelettrico", towards the end of
the balance)
> and put the data in a table.
>
> Is that possible? If yes, what R-packages
> should I use?
>
> Ciao
> Vittorio
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Here is one additional solution. This one produces a data frame. The
regular expression removes:
- everything from beginning to first (
- everything from last ( to end
- everything between ) and ( in the middle
The | characters separate the three parts. Then read.table reads it in.
URL <-
"http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf"
Lines.raw <- readLines(URL)
Lines <- grep("Industriale|Termoelettrico", Lines.raw, value =
TRUE)
rx <- "^[^(]*[(]|[)][^(]*$|[)][^(]*[(]"
read.table(textConnection(gsub(rx, "", Lines)), dec = ",")
On 5/9/07, Gabor Grothendieck <ggrothendieck at gmail.com>
wrote:> Modify this to suit. After grepping out the correct lines we use strapply
> to find and emit character sequences that come after a "(" but do
not contain
> a ")" . back = -1 says to only emit the backreferences and not
the entire
> matched expression (which would have included the leading "(" ):
>
> URL <-
"http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf"
> Lines.raw <- readLines(URL)
> Lines <- grep("Industriale|Termoelettrico", Lines.raw, value =
TRUE)
> library(gsubfn)
> strapply(Lines, "[(]([^)]*)", back = -1, simplify = rbind)
>
> which gives a character matrix whose first column is the label
> and second column is the number in character form. You can
> then manipulate it as desired.
>
> On 5/9/07, Vittorio <vdemart1 at tin.it> wrote:
> > Each day the daily balance in the following link
> >
> > http://www.
> > snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf
> >
> > is
> > updated.
> >
> > I would like to set up an R procedure to be run daily in a
> > server able to read the figures in a couple of lines only
> > ("Industriale" and "Termoelettrico", towards the
end of the balance)
> > and put the data in a table.
> >
> > Is that possible? If yes, what R-packages
> > should I use?
> >
> > Ciao
> > Vittorio
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>