On 03/18/2021 04:30 PM, Paul Heinlein wrote:> On Thu, 18 Mar 2021, H wrote:
>
>> I have a challenge I am interested in getting feedback on.
>>
>> I will on a regular basis download a series of data files from the web
where the data is in XML-format. The format is known in advance but is different
between the various data files. I then plan to extract the various data items
("elements?") from each data file, do some light formatting and then
save desired parts of each original data file as a formatted CSV-file for later
importing into a database.
>>
>> As the plan is to use a bash shell script using curl to get the files,
I have begun looking at external XML parsers that I can call from my script,
perhaps specify which elements I want, get the data back in some kind of bash
data structure and finally format and save as CSV-files.
>>
>> There seems to be a number of XML parsers available but perhaps someone
on the list has a recommendation for which one might suit my needs best? I
should add that I am running CentOS 7.
>
> Will you be using an XSLT stylesheet to do the work? There's a somewhat
steep learning curve, but in my experience it's the most reliable method for
parsing XML except in the very simplest of cases.
>
> In that case, the libxslt stuff may be what you want:
>
> ? http://xmlsoft.org/libxslt/
>
> The command-line tool is xsltproc.
>
> Again, it's not easy to use, but once you've built a toolchain, it
will be reliable and fairly easy to modify if the source XML schema change.
>
I just checked and I cannot see that the organization publishing these data
files offer any XSLT stylesheet. IOW, I am, perhaps incorrectly, assuming that
the publisher of the data would be one with said stylesheet. (Although perhaps
that is something an end-user could put together as well??)
Although the data format of each data series is unique, it is simple and could
conceivably be parsed using grep but I am looking for a more
"forward-looking" solution for other applications in the future.
If XSLT stylesheets are not available - would you suggest another tool? Or,
would you suggest I design sheets, presumably one for for each data series?