thr3ads.net - R help - [R] Help request: Parsing docx files for key words and appending to a spreadsheet [Dec 2023]

If this information is useful, please help other people find it:
Share via:

CALUM POLWART

2023-Dec-29 19:01 UTC

[R] Help request: Parsing docx files for key words and appending to a spreadsheet

It sounded like he looked at officeR but I would agree

content <- officer::docx_summary("filename.docx")

Would get the text content into an object called content.

That object is a data.frame so you can then manipulate it.  To be more
specific, we might need an example of the DF

You can loop this easily with a for statement although there are people who
prefer a non-for approach to iteration in R. For can be slow. But if you
don't need to do this very quickly I'd stick with for if you are used to
programming

On Fri, 29 Dec 2023, 18:35 jim holtman, <jholtman at gmail.com> wrote:
> checkout the 'officer' package
>
> Thanks
>
> Jim Holtman
> *Data Munger Guru*
>
>
> *What is the problem that you are trying to solve?Tell me what you want to
> do, not how you want to do it.*
>
>
> On Fri, Dec 29, 2023 at 10:14?AM Andy <phaedrusv at gmail.com> wrote:
>
> > Hello
> >
> > I am trying to work through a problem, but feel like I've gone
down a
> > rabbit hole. I'd very much appreciate any help.
> >
> > The task: I have several directories of multiple (some directories, up
> > to 2,500+) *.docx files (newspaper articles downloaded from Lexis+)
that
> > I want to iterate through to append to a spreadsheet only those
articles
> > that satisfy a condition (i.e., a specific keyword is present for
>= 50%
> > coverage of the subject matter). Lexis+ has a very specific structure
> > and keywords are given in the row "Subject".
> >
> > I'd like to be able to accomplish the following:
> >
> > (1) Append the title, the month, the author, the number of words, and
> > page number(s) to a spreadsheet
> >
> > (2) Read each article and extract keywords (in the docs, these are
> > listed in 'Subject' section as a list of keywords with a
percentage
> > showing the extent to which the keyword features in the article (e.g.,
> > FAST FASHION (72%)) and to append the keyword and the % coverage to
the
> > same row in the spreadsheet. However, I want to ensure that the
keyword
> > coverage meets the threshold of >= 50%; if not, then pass onto the
next
> > article in the directory. Rinse and repeat for the entire directory.
> >
> > So far, I've tried working through some Stack Overflow-based
solutions,
> > but most seem to use the textreadr package, which is now deprecated;
> > others use either the officer or the officedown packages. However,
these
> > packages don't appear to do what I want the program to do, at
least not
> > in any of the examples I have found, nor in the vignettes and relevant
> > package manuals I've looked at.
> >
> > The first point is, is what I am intending to do even possible using
R?
> > If it is, then where do I start with this? If these docx files were
> > converted to UTF-8 plain text, would that make the task easier?
> >
> > I am not a confident coder, and am really only just getting my head
> > around R so appreciate a steep learning curve ahead, but of course, I
> > don't know what I don't know, so any pointers in the right
direction
> > would be a big help.
> >
> > Many thanks in anticipation
> >
> > Andy
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Dr Eberhard W Lisse

2023-Dec-29 20:25 UTC

head link

[R] Help request: Parsing docx files for key words and appending to a spreadsheet

I would also look at https://pandoc.org perhaps which can
export a number of formats...

And for spreadsheets https://github.com/jqnatividad/qsv is my
goto weapon.  Can also read and write XLSX and others.

A sample document or two would always be helpful...

el

On 29/12/2023 21:01, CALUM POLWART wrote:> It sounded like he looked at officeR but I would agree
> 
> content <- officer::docx_summary("filename.docx")
> 
> Would get the text content into an object called content.
> 
> That object is a data.frame so you can then manipulate it.
> To be more specific, we might need an example of the DF
[...]>> On Fri, Dec 29, 2023 at 10:14 AM Andy <phaedrusv at gmail.com>
>> wrote:
[...]>>> I'd like to be able to accomplish the following:
>>>
>>> (1) Append the title, the month, the author, the number of
>>> words, and page number(s) to a spreadsheet
>>>
>>> (2) Read each article and extract keywords (in the docs,
>>> these are listed in 'Subject' section as a list of
>>> keywords with a percentage showing the extent to which the
>>> keyword features in the article (e.g., FAST FASHION (72%))
>>> and to append the keyword and the % coverage to the same
>>> row in the spreadsheet.  However, I want to ensure that
>>> the keyword coverage meets the threshold of >= 50%; if
>>> not, then pass onto the next article in the directory.
>>> Rinse and repeat for the entire directory.[...]

Seemingly Similar Threads

Search for more maybe matching threads

R help - Dec 2023 - Help request: Parsing docx files for key words and appending to a spreadsheet

[R] Help request: Parsing docx files for key words and appending to a spreadsheet

[R] Help request: Parsing docx files for key words and appending to a spreadsheet

Seemingly Similar Threads