"Biedermann, Jürgen"
2009-Oct-15 08:43 UTC
[R] "Complex?" import of pdf files (criminal records) into R table
Hi there, I'm facing the decision if it would be possible to transform several more or less complex pdf files into an R Table-Format or if it has to be done manually. I think it would be a impudent to expect a complete solution, but I would be grateful if anyone could give me an advice on how the structure of such a R-program could look like, and if it's possible in general. Here the problem: Each pdf file belongs to a person. The pdf files actually represent the anonymous criminal record of a person. Each entry should lead to one row with the person number as key. The different lines should form the columns. The criminal record actually looks like this: --------------------------------------------------- Header with irrelevant text for us | Date: xx.xx.xxxx (relevant for us) Anonymous person number: xxxxxxxxxxx Entries in the register 1. xx.xx.1902 -City- Be in force since: xx.xx.1902 Date of offense:xx.xx.xxxx Elements of the offence: For example "Rape" Section in law: ?176, ?178 Abs. 1 Sentenced to 5 years imprisonment "Irrelevant text for us" Accommodation in an forensic psychiatry Accommodation sentenced on probation Rest of sentence sentenced on probation until the xx.xx.xxxx 2. xx.xx.1910 Be in force since: .... ..... ----------------------------------------------------------------------- The problem is that the entries do not always have the same structure. The first 6 lines are structurally the same in each entry of the criminal record (each entry has a line for the judgement date, the "be in force" date, the date of offence, the elements of the offence, the Sections in law, and the sentence). But then depending on the sentence different lines emerge which contain information if the person was sentenced on probation, if the probation was withdrawn again, when the person was released etc. So, I think, these lines should be allocated to different columns depending on key words. The definition of the key words for most cases would not be the problem, actually. If a certain column is not relevant in an entry (so, the key word didn't emerge) NA should be put in the place. But because sometimes (in rare cases), the entries contain spelling errors, at the end, all the lines of an entry, which could not be allocated to a column should be put in a column to check them manually. In the end the table should look more of less like this. -------------------------------------------------- "Per.Numb";"EntryNumber";"Judg.Date";"DateOffen.";...;"Probation.until"; "Released";"Not allocated" xxxx1 1 xx.xx.1902 xx.xx.1901 ... xx.xx.1905 NA "blablabla" xxxx1 2 xx.xx.1910 xx.xx.1909 ... NA 1925 "blablabla" xxxx2 1 xx.xx.1924 xx.xx.1923 ... NA NA "blablabla" ------------------------------------------------------------------ Could anyone help me? Thanks Greetings J?rgen
Marc Schwartz
2009-Oct-15 14:28 UTC
[R] "Complex?" import of pdf files (criminal records) into R table
On Oct 15, 2009, at 3:43 AM, Biedermann, J?rgen wrote:> Hi there, > > I'm facing the decision if it would be possible to transform several > more or less complex pdf files into an R Table-Format or if it has > to be done manually. I think it would be a impudent to expect a > complete solution, but I would be grateful if anyone could give me > an advice on how the structure of such a R-program could look like, > and if it's possible in general. > > Here the problem: > Each pdf file belongs to a person. The pdf files actually represent > the anonymous criminal record of a person. Each entry should lead to > one row with the person number as key. The different lines should > form the columns. The criminal record actually looks like this: > > > --------------------------------------------------- > Header with irrelevant text for us | Date: xx.xx.xxxx (relevant > for us) > > Anonymous person number: xxxxxxxxxxx > > Entries in the register > > 1. xx.xx.1902 -City- > Be in force since: xx.xx.1902 > Date of offense:xx.xx.xxxx > Elements of the offence: For example "Rape" > Section in law: ?176, ?178 Abs. 1 > Sentenced to 5 years imprisonment > "Irrelevant text for us" > Accommodation in an forensic psychiatry > Accommodation sentenced on probation > Rest of sentence sentenced on probation until the xx.xx.xxxx > > 2. xx.xx.1910 > Be in force since: .... > ..... > > ----------------------------------------------------------------------- > > The problem is that the entries do not always have the same > structure. The first 6 lines are structurally the same in each entry > of the criminal record (each entry has a line for the judgement > date, the "be in force" date, the date of offence, the elements of > the offence, the Sections in law, and the sentence). > > But then depending on the sentence different lines emerge which > contain information if the person was sentenced on probation, if the > probation was withdrawn again, when the person was released etc. > So, I think, these lines should be allocated to different columns > depending on key words. The definition of the key words for most > cases would not be the problem, actually. If a certain column is not > relevant in an entry (so, the key word didn't emerge) NA should be > put in the place. > But because sometimes (in rare cases), the entries contain spelling > errors, at the end, all the lines of an entry, which could not be > allocated to a column should be put in a column to check them > manually. > > In the end the table should look more of less like this. > > -------------------------------------------------- > "Per > .Numb";"EntryNumber";"Judg.Date";"DateOffen.";...;"Probation.until"; > "Released";"Not allocated" > > xxxx1 1 xx.xx.1902 xx.xx.1901 ... xx.xx.1905 NA "blablabla" > xxxx1 2 xx.xx.1910 xx.xx.1909 ... NA 1925 "blablabla" > xxxx2 1 xx.xx.1924 xx.xx.1923 ... NA NA "blablabla" > ------------------------------------------------------------------ > > Could anyone help me? > Thanks > > Greetings > J?rgenYou don't indicate the OS you are on, but you will want to get a hold of 'pdftotext', which is a command line application that can extract the textual content from the PDF files. On most Linuxen, it is already installed, but for Windows and OSX you will likely need to Google for it. The basic approach is to loop over each PDF file, use pdftotext to get the text content and dump it into a regular text file. That file can then be read into R using ?readLines. This can all be done within R using the ?system command. Get the names of the PDF files in a given folder by using ?list.files with a "\ \.pdf" or "\\.PDF" search pattern. Then ?paste together the full command using a prefix along the lines of "pdftotext -layout - nopgbrk", presuming that the pdftotext command is in your $PATH. The suffix to be paste()d will be the name of the input PDF file and the name of the output text file. So you end up with a command line character vector along the lines of: "pdftotext -layout -nopgbrk xxxxx.pdf xxxxx.txt" where the x's are the specific file basenames. Review the pdftotext options to understand what is being done and if you should need to modify them for your particular files. Once you have the data in R for each file, you will then need to process the content line by line, looking for the keywords that are associated with the content you require. Using ?grep is perhaps the easiest way to accomplish that. You can then use ?gsub to replace/ strip the keywords, leaving you with the data only, for each line. For multi line scenarios, you will need to keep track of where the keyword for the first line is and then look for the subsequent keyword or perhaps a blank line, to know when to stop aggregating the data for that initial keyword. It then becomes a matter of reorganizing the content that you need into the format you require for subsequent work. I have not looked for 'text processing' related packages on CRAN, so you may wish to look there first in case there is anything relevant. HTH, Marc Schwartz
Reasonably Related Threads
- Re: Which SIP phones...
- json.c:704 ast_json_vpack: Error building JSON from '{s: s, s: s}': Invalid UTF-8 string.
- outbound calls
- switching from simple_bridge technology to native_rtp issue
- json.c:704 ast_json_vpack: Error building JSON from '{s: s, s: s}': Invalid UTF-8 string.