thr3ads.net - R help - [R] Help request: Parsing docx files for key words and appending to a spreadsheet [Dec 2023]

If this information is useful, please help other people find it:
Share via:

Roy Mendelssohn - NOAA Federal

2023-Dec-29 18:25 UTC

[R] Help request: Parsing docx files for key words and appending to a spreadsheet

Hi Andy:

I don?t have an answer but I do have what I hope is some friendly advice. 
Generally the more information you can provide,  the more likely you will get
help that is useful.  In your case you say that you tried several packages and
they didn?t do what you wanted.  Providing that code,  as well as why they
didn?t do what you wanted (be specific)  would greatly facilitate things.

Happy new year,

-Roy

> On Dec 29, 2023, at 10:14 AM, Andy <phaedrusv at gmail.com> wrote:
> 
> Hello
> 
> I am trying to work through a problem, but feel like I've gone down a
rabbit hole. I'd very much appreciate any help.
> 
> The task: I have several directories of multiple (some directories, up to
2,500+) *.docx files (newspaper articles downloaded from Lexis+) that I want to
iterate through to append to a spreadsheet only those articles that satisfy a
condition (i.e., a specific keyword is present for >= 50% coverage of the
subject matter). Lexis+ has a very specific structure and keywords are given in
the row "Subject".
> 
> I'd like to be able to accomplish the following:
> 
> (1) Append the title, the month, the author, the number of words, and page
number(s) to a spreadsheet
> 
> (2) Read each article and extract keywords (in the docs, these are listed
in 'Subject' section as a list of keywords with a percentage showing the
extent to which the keyword features in the article (e.g., FAST FASHION (72%))
and to append the keyword and the % coverage to the same row in the spreadsheet.
However, I want to ensure that the keyword coverage meets the threshold of >=
50%; if not, then pass onto the next article in the directory. Rinse and repeat
for the entire directory.
> 
> So far, I've tried working through some Stack Overflow-based solutions,
but most seem to use the textreadr package, which is now deprecated; others use
either the officer or the officedown packages. However, these packages don't
appear to do what I want the program to do, at least not in any of the examples
I have found, nor in the vignettes and relevant package manuals I've looked
at.
> 
> The first point is, is what I am intending to do even possible using R? If
it is, then where do I start with this? If these docx files were converted to
UTF-8 plain text, would that make the task easier?
> 
> I am not a confident coder, and am really only just getting my head around
R so appreciate a steep learning curve ahead, but of course, I don't know
what I don't know, so any pointers in the right direction would be a big
help.
> 
> Many thanks in anticipation
> 
> Andy
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

CALUM POLWART

2023-Dec-29 18:50 UTC

head link

[R] Help request: Parsing docx files for key words and appending to a spreadsheet

textreadr would be the obvious approach.

When you say it is depreciated do you mean it's not available on cran?
Sometimes maintaining a package on cran in just a pain in the ass.

devtools::install_github("trinker/textreadr")


Should let you install it.

In theory docx files are actually just zip files (you can unzip them) and
you may find there is then a specific file in the zip that is readable with
on of R's General text file readers.

Alternatively, read_docx from:
https://www.rdocumentation.org/packages/qdapTools

May be worth a look.

What platform are you on. Certainly options to command line convert files
to txt and do from there.


On Fri, 29 Dec 2023, 18:25 Roy Mendelssohn - NOAA Federal via R-help, <
r-help at r-project.org> wrote:
> Hi Andy:
>
> I don?t have an answer but I do have what I hope is some friendly advice.
> Generally the more information you can provide,  the more likely you will
> get help that is useful.  In your case you say that you tried several
> packages and they didn?t do what you wanted.  Providing that code,  as well
> as why they didn?t do what you wanted (be specific)  would greatly
> facilitate things.
>
> Happy new year,
>
> -Roy
>
>
> > On Dec 29, 2023, at 10:14 AM, Andy <phaedrusv at gmail.com>
wrote:
> >
> > Hello
> >
> > I am trying to work through a problem, but feel like I've gone
down a
> rabbit hole. I'd very much appreciate any help.
> >
> > The task: I have several directories of multiple (some directories, up
> to 2,500+) *.docx files (newspaper articles downloaded from Lexis+) that I
> want to iterate through to append to a spreadsheet only those articles that
> satisfy a condition (i.e., a specific keyword is present for >= 50%
> coverage of the subject matter). Lexis+ has a very specific structure and
> keywords are given in the row "Subject".
> >
> > I'd like to be able to accomplish the following:
> >
> > (1) Append the title, the month, the author, the number of words, and
> page number(s) to a spreadsheet
> >
> > (2) Read each article and extract keywords (in the docs, these are
> listed in 'Subject' section as a list of keywords with a percentage
showing
> the extent to which the keyword features in the article (e.g., FAST FASHION
> (72%)) and to append the keyword and the % coverage to the same row in the
> spreadsheet. However, I want to ensure that the keyword coverage meets the
> threshold of >= 50%; if not, then pass onto the next article in the
> directory. Rinse and repeat for the entire directory.
> >
> > So far, I've tried working through some Stack Overflow-based
solutions,
> but most seem to use the textreadr package, which is now deprecated; others
> use either the officer or the officedown packages. However, these packages
> don't appear to do what I want the program to do, at least not in any
of
> the examples I have found, nor in the vignettes and relevant package
> manuals I've looked at.
> >
> > The first point is, is what I am intending to do even possible using
R?
> If it is, then where do I start with this? If these docx files were
> converted to UTF-8 plain text, would that make the task easier?
> >
> > I am not a confident coder, and am really only just getting my head
> around R so appreciate a steep learning curve ahead, but of course, I
don't
> know what I don't know, so any pointers in the right direction would be
a
> big help.
> >
> > Many thanks in anticipation
> >
> > Andy
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Andy

2023-Dec-29 20:17 UTC

head link

[R] Help request: Parsing docx files for key words and appending to a spreadsheet

Hi Roy (& others)

Many thanks for the advice - well taken. Thanks also to the others who 
have responded so quickly - I thought I might have to wait days!! :-)

I'm on a Linux (Mint) machine. Below, I document three attempts, two 
using officer and the last now using textreadr

My attempts so far using 'officer':

##################

(1) First Attempt:

# Load libraries
library(tcltk)
library(tidyverse)
library(officer)

setwd(tk_choose.dir())

doc_path <- list.files(getwd(), pattern = ".docx", full.names =
TRUE)

files <- list.files(getwd(), ".docx")
files
length(files)

## This works to here - obtain a list of docx files in directory 'TEST 
with 9 files'. However, the next line
doc_in <- read_docx(files)

Results in this error:Error in filetype %in% c("docx") && 
grepl("^([fh]ttp)", file) :'length = 9' in coercion to
'logical(1)'

No idea how to debug that.

Even when trying Calum's suggestion with officer:

content <- officer::docx_summary("Now they want us to charge our 
electric cars from litter bins.docx") # A title of one of the articles

The error returned is:Error in x$doc_obj : $ operator is invalid for 
atomic vectors


##################
(2) Second Attempt:

# Load libraries
library(tcltk)
library(tidyverse)
library(officer)

setwd(tk_choose.dir())

doc_path <- list.files(getwd(), pattern = ".docx", full.names =
TRUE)

files <- list.files(getwd(), ".docx")
files
length(files)

docx_summary(doc_path, preserve = FALSE)
## At this point, the error is:Error in x$doc_obj : $ operator is 
invalid for atomic vectors

So, not sure how I am passing an atomic vector or if there is something 
I am supposed to set to make this something else?

##################
(3) Third attempt - now trying with textreadr (Thanks for the help on 
installing this, Calum):

# Load libraries
library(tcltk)
library(tidyverse)
library(textreadr)

folder <- setwd(tk_choose.dir())

files <- list.files(folder, ".docx")
files
length(files)

doc <- read_docx("Now they want us to charge our electric cars from 
litter bins.docx") # One of the 9 files in the folder

read_docx(doc, skip = 0, remove.empty = TRUE, trim = TRUE) # To test 
against one file

## The last line returns the following error:Error in filetype %in% 
c("docx") && grepl("^([fh]ttp)", file) :'length
= 38' in coercion to
'logical(1)'

##################
And so I am going around in circles and not at all clear on how I can 
make progress.

I am sure that there must be a way, but the suggestions on-line each 
lead to the above errors.

Thanks for any further help.

Best wishes, and thanks
Andy


On 29/12/2023 18:25, Roy Mendelssohn - NOAA Federal
wrote:> Hi Andy:
>
> I don?t have an answer but I do have what I hope is some friendly advice. 
Generally the more information you can provide,  the more likely you will get
help that is useful.  In your case you say that you tried several packages and
they didn?t do what you wanted.  Providing that code,  as well as why they
didn?t do what you wanted (be specific)  would greatly facilitate things.
>
> Happy new year,
>
> -Roy
>
>
>> On Dec 29, 2023, at 10:14 AM, Andy<phaedrusv at gmail.com> 
wrote:
>>
>> Hello
>>
>> I am trying to work through a problem, but feel like I've gone down
a rabbit hole. I'd very much appreciate any help.
>>
>> The task: I have several directories of multiple (some directories, up
to 2,500+) *.docx files (newspaper articles downloaded from Lexis+) that I want
to iterate through to append to a spreadsheet only those articles that satisfy a
condition (i.e., a specific keyword is present for >= 50% coverage of the
subject matter). Lexis+ has a very specific structure and keywords are given in
the row "Subject".
>>
>> I'd like to be able to accomplish the following:
>>
>> (1) Append the title, the month, the author, the number of words, and
page number(s) to a spreadsheet
>>
>> (2) Read each article and extract keywords (in the docs, these are
listed in 'Subject' section as a list of keywords with a percentage
showing the extent to which the keyword features in the article (e.g., FAST
FASHION (72%)) and to append the keyword and the % coverage to the same row in
the spreadsheet. However, I want to ensure that the keyword coverage meets the
threshold of >= 50%; if not, then pass onto the next article in the
directory. Rinse and repeat for the entire directory.
>>
>> So far, I've tried working through some Stack Overflow-based
solutions, but most seem to use the textreadr package, which is now deprecated;
others use either the officer or the officedown packages. However, these
packages don't appear to do what I want the program to do, at least not in
any of the examples I have found, nor in the vignettes and relevant package
manuals I've looked at.
>>
>> The first point is, is what I am intending to do even possible using R?
If it is, then where do I start with this? If these docx files were converted to
UTF-8 plain text, would that make the task easier?
>>
>> I am not a confident coder, and am really only just getting my head
around R so appreciate a steep learning curve ahead, but of course, I don't
know what I don't know, so any pointers in the right direction would be a
big help.
>>
>> Many thanks in anticipation
>>
>> Andy
>>
>> ______________________________________________
>> R-help at r-project.org  mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting
guidehttp://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

	[[alternative HTML version deleted]]

Possibly Parallel Threads

Search for more possibly parallel threads

R help - Dec 2023 - Help request: Parsing docx files for key words and appending to a spreadsheet

[R] Help request: Parsing docx files for key words and appending to a spreadsheet

[R] Help request: Parsing docx files for key words and appending to a spreadsheet

[R] Help request: Parsing docx files for key words and appending to a spreadsheet

Possibly Parallel Threads