thr3ads.net - CentOS - [CentOS] XML parsing in shell script [Mar 2021]

If this information is useful, please help other people find it:
Share via:

Fabian Arrotin

2021-Mar-19 16:40 UTC

[CentOS] XML parsing in shell script

On 18/03/2021 22:08, H wrote:> On 03/18/2021 04:30 PM, Paul Heinlein wrote:
>> On Thu, 18 Mar 2021, H wrote:
>>
>>> I have a challenge I am interested in getting feedback on.
>>>
>>> I will on a regular basis download a series of data files from the
web where the data is in XML-format. The format is known in advance but is
different between the various data files. I then plan to extract the various
data items ("elements?") from each data file, do some light formatting
and then save desired parts of each original data file as a formatted CSV-file
for later importing into a database.
>>>
>>> As the plan is to use a bash shell script using curl to get the
files, I have begun looking at external XML parsers that I can call from my
script, perhaps specify which elements I want, get the data back in some kind of
bash data structure and finally format and save as CSV-files.
>>>
>>> There seems to be a number of XML parsers available but perhaps
someone on the list has a recommendation for which one might suit my needs best?
I should add that I am running CentOS 7.
>>
>> Will you be using an XSLT stylesheet to do the work? There's a
somewhat steep learning curve, but in my experience it's the most reliable
method for parsing XML except in the very simplest of cases.
>>
>> In that case, the libxslt stuff may be what you want:
>>
>> ? http://xmlsoft.org/libxslt/
>>
>> The command-line tool is xsltproc.
>>
>> Again, it's not easy to use, but once you've built a toolchain,
it will be reliable and fairly easy to modify if the source XML schema change.
>>
> I just checked and I cannot see that the organization publishing these data
files offer any XSLT stylesheet. IOW, I am, perhaps incorrectly, assuming that
the publisher of the data would be one with said stylesheet. (Although perhaps
that is something an end-user could put together as well??)
> 
> Although the data format of each data series is unique, it is simple and
could conceivably be parsed using grep but I am looking for a more
"forward-looking" solution for other applications in the future.
> 
> If XSLT stylesheets are not available - would you suggest another tool? Or,
would you suggest I design sheets, presumably one for for each data series?
> 
I used in the past xmlstarlet (available in epel) for quick parsing from
within bash scripts.
For something more robust, maybe switch to python ? (ymmv)

-- 
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 17F3B7A1 | twitter: @arrfab

Leon Fauster

2021-Mar-19 19:25 UTC

head link

[CentOS] XML parsing in shell script

Am 19.03.21 um 17:40 schrieb Fabian Arrotin:> On 18/03/2021 22:08, H wrote:
>> On 03/18/2021 04:30 PM, Paul Heinlein wrote:
>>> On Thu, 18 Mar 2021, H wrote:
>>>
>>>> I have a challenge I am interested in getting feedback on.
>>>>
>>>> I will on a regular basis download a series of data files from
the web where the data is in XML-format. The format is known in advance but is
different between the various data files. I then plan to extract the various
data items ("elements?") from each data file, do some light formatting
and then save desired parts of each original data file as a formatted CSV-file
for later importing into a database.
>>>>
>>>> As the plan is to use a bash shell script using curl to get the
files, I have begun looking at external XML parsers that I can call from my
script, perhaps specify which elements I want, get the data back in some kind of
bash data structure and finally format and save as CSV-files.
>>>>
>>>> There seems to be a number of XML parsers available but perhaps
someone on the list has a recommendation for which one might suit my needs best?
I should add that I am running CentOS 7.
>>>
>>> Will you be using an XSLT stylesheet to do the work? There's a
somewhat steep learning curve, but in my experience it's the most reliable
method for parsing XML except in the very simplest of cases.
>>>
>>> In that case, the libxslt stuff may be what you want:
>>>
>>>  ? http://xmlsoft.org/libxslt/
>>>
>>> The command-line tool is xsltproc.
>>>
>>> Again, it's not easy to use, but once you've built a
toolchain, it will be reliable and fairly easy to modify if the source XML
schema change.
>>>
>> I just checked and I cannot see that the organization publishing these
data files offer any XSLT stylesheet. IOW, I am, perhaps incorrectly, assuming
that the publisher of the data would be one with said stylesheet. (Although
perhaps that is something an end-user could put together as well??)
>>
>> Although the data format of each data series is unique, it is simple
and could conceivably be parsed using grep but I am looking for a more
"forward-looking" solution for other applications in the future.
>>
>> If XSLT stylesheets are not available - would you suggest another tool?
Or, would you suggest I design sheets, presumably one for for each data series?
>>
> 
> I used in the past xmlstarlet (available in epel) for quick parsing from
> within bash scripts.
> For something more robust, maybe switch to python ? (ymmv)
> 


just for a value grep use xmllint (its in libxml2 package):

Example:

XML input:

<?xml version="1.0" encoding="utf-8" 
?><methodResponse><params><param><value><string>OK</string></value></param></params></methodResponse>


bash var:

STATUS=$(echo ${RESPONSE} | xmllint --format --xpath 
"//methodResponse/params/param/value/string/text()" - 2>/dev/null)


--
Leon

2021-Mar-20 00:26 UTC

head link

[CentOS] XML parsing in shell script

On 03/19/2021 12:40 PM, Fabian Arrotin wrote:> On 18/03/2021 22:08, H wrote:
>> On 03/18/2021 04:30 PM, Paul Heinlein wrote:
>>> On Thu, 18 Mar 2021, H wrote:
>>>
>>>> I have a challenge I am interested in getting feedback on.
>>>>
>>>> I will on a regular basis download a series of data files from
the web where the data is in XML-format. The format is known in advance but is
different between the various data files. I then plan to extract the various
data items ("elements?") from each data file, do some light formatting
and then save desired parts of each original data file as a formatted CSV-file
for later importing into a database.
>>>>
>>>> As the plan is to use a bash shell script using curl to get the
files, I have begun looking at external XML parsers that I can call from my
script, perhaps specify which elements I want, get the data back in some kind of
bash data structure and finally format and save as CSV-files.
>>>>
>>>> There seems to be a number of XML parsers available but perhaps
someone on the list has a recommendation for which one might suit my needs best?
I should add that I am running CentOS 7.
>>> Will you be using an XSLT stylesheet to do the work? There's a
somewhat steep learning curve, but in my experience it's the most reliable
method for parsing XML except in the very simplest of cases.
>>>
>>> In that case, the libxslt stuff may be what you want:
>>>
>>> ? http://xmlsoft.org/libxslt/
>>>
>>> The command-line tool is xsltproc.
>>>
>>> Again, it's not easy to use, but once you've built a
toolchain, it will be reliable and fairly easy to modify if the source XML
schema change.
>>>
>> I just checked and I cannot see that the organization publishing these
data files offer any XSLT stylesheet. IOW, I am, perhaps incorrectly, assuming
that the publisher of the data would be one with said stylesheet. (Although
perhaps that is something an end-user could put together as well??)
>>
>> Although the data format of each data series is unique, it is simple
and could conceivably be parsed using grep but I am looking for a more
"forward-looking" solution for other applications in the future.
>>
>> If XSLT stylesheets are not available - would you suggest another tool?
Or, would you suggest I design sheets, presumably one for for each data series?
>>
> I used in the past xmlstarlet (available in epel) for quick parsing from
> within bash scripts.
> For something more robust, maybe switch to python ? (ymmv)
>I wanted to do this in bash and decided on calling xsltproc while investing in
writing an XSLT stylesheet for each data file format.

CentOS - Mar 2021 - XML parsing in shell script

[CentOS] XML parsing in shell script

[CentOS] XML parsing in shell script

[CentOS] XML parsing in shell script