On 18/03/2021 22:08, H wrote:> On 03/18/2021 04:30 PM, Paul Heinlein wrote: >> On Thu, 18 Mar 2021, H wrote: >> >>> I have a challenge I am interested in getting feedback on. >>> >>> I will on a regular basis download a series of data files from the web where the data is in XML-format. The format is known in advance but is different between the various data files. I then plan to extract the various data items ("elements?") from each data file, do some light formatting and then save desired parts of each original data file as a formatted CSV-file for later importing into a database. >>> >>> As the plan is to use a bash shell script using curl to get the files, I have begun looking at external XML parsers that I can call from my script, perhaps specify which elements I want, get the data back in some kind of bash data structure and finally format and save as CSV-files. >>> >>> There seems to be a number of XML parsers available but perhaps someone on the list has a recommendation for which one might suit my needs best? I should add that I am running CentOS 7. >> >> Will you be using an XSLT stylesheet to do the work? There's a somewhat steep learning curve, but in my experience it's the most reliable method for parsing XML except in the very simplest of cases. >> >> In that case, the libxslt stuff may be what you want: >> >> ? http://xmlsoft.org/libxslt/ >> >> The command-line tool is xsltproc. >> >> Again, it's not easy to use, but once you've built a toolchain, it will be reliable and fairly easy to modify if the source XML schema change. >> > I just checked and I cannot see that the organization publishing these data files offer any XSLT stylesheet. IOW, I am, perhaps incorrectly, assuming that the publisher of the data would be one with said stylesheet. (Although perhaps that is something an end-user could put together as well??) > > Although the data format of each data series is unique, it is simple and could conceivably be parsed using grep but I am looking for a more "forward-looking" solution for other applications in the future. > > If XSLT stylesheets are not available - would you suggest another tool? Or, would you suggest I design sheets, presumably one for for each data series? >I used in the past xmlstarlet (available in epel) for quick parsing from within bash scripts. For something more robust, maybe switch to python ? (ymmv) -- Fabian Arrotin The CentOS Project | https://www.centos.org gpg key: 17F3B7A1 | twitter: @arrfab
Am 19.03.21 um 17:40 schrieb Fabian Arrotin:> On 18/03/2021 22:08, H wrote: >> On 03/18/2021 04:30 PM, Paul Heinlein wrote: >>> On Thu, 18 Mar 2021, H wrote: >>> >>>> I have a challenge I am interested in getting feedback on. >>>> >>>> I will on a regular basis download a series of data files from the web where the data is in XML-format. The format is known in advance but is different between the various data files. I then plan to extract the various data items ("elements?") from each data file, do some light formatting and then save desired parts of each original data file as a formatted CSV-file for later importing into a database. >>>> >>>> As the plan is to use a bash shell script using curl to get the files, I have begun looking at external XML parsers that I can call from my script, perhaps specify which elements I want, get the data back in some kind of bash data structure and finally format and save as CSV-files. >>>> >>>> There seems to be a number of XML parsers available but perhaps someone on the list has a recommendation for which one might suit my needs best? I should add that I am running CentOS 7. >>> >>> Will you be using an XSLT stylesheet to do the work? There's a somewhat steep learning curve, but in my experience it's the most reliable method for parsing XML except in the very simplest of cases. >>> >>> In that case, the libxslt stuff may be what you want: >>> >>> ? http://xmlsoft.org/libxslt/ >>> >>> The command-line tool is xsltproc. >>> >>> Again, it's not easy to use, but once you've built a toolchain, it will be reliable and fairly easy to modify if the source XML schema change. >>> >> I just checked and I cannot see that the organization publishing these data files offer any XSLT stylesheet. IOW, I am, perhaps incorrectly, assuming that the publisher of the data would be one with said stylesheet. (Although perhaps that is something an end-user could put together as well??) >> >> Although the data format of each data series is unique, it is simple and could conceivably be parsed using grep but I am looking for a more "forward-looking" solution for other applications in the future. >> >> If XSLT stylesheets are not available - would you suggest another tool? Or, would you suggest I design sheets, presumably one for for each data series? >> > > I used in the past xmlstarlet (available in epel) for quick parsing from > within bash scripts. > For something more robust, maybe switch to python ? (ymmv) >just for a value grep use xmllint (its in libxml2 package): Example: XML input: <?xml version="1.0" encoding="utf-8" ?><methodResponse><params><param><value><string>OK</string></value></param></params></methodResponse> bash var: STATUS=$(echo ${RESPONSE} | xmllint --format --xpath "//methodResponse/params/param/value/string/text()" - 2>/dev/null) -- Leon
On 03/19/2021 12:40 PM, Fabian Arrotin wrote:> On 18/03/2021 22:08, H wrote: >> On 03/18/2021 04:30 PM, Paul Heinlein wrote: >>> On Thu, 18 Mar 2021, H wrote: >>> >>>> I have a challenge I am interested in getting feedback on. >>>> >>>> I will on a regular basis download a series of data files from the web where the data is in XML-format. The format is known in advance but is different between the various data files. I then plan to extract the various data items ("elements?") from each data file, do some light formatting and then save desired parts of each original data file as a formatted CSV-file for later importing into a database. >>>> >>>> As the plan is to use a bash shell script using curl to get the files, I have begun looking at external XML parsers that I can call from my script, perhaps specify which elements I want, get the data back in some kind of bash data structure and finally format and save as CSV-files. >>>> >>>> There seems to be a number of XML parsers available but perhaps someone on the list has a recommendation for which one might suit my needs best? I should add that I am running CentOS 7. >>> Will you be using an XSLT stylesheet to do the work? There's a somewhat steep learning curve, but in my experience it's the most reliable method for parsing XML except in the very simplest of cases. >>> >>> In that case, the libxslt stuff may be what you want: >>> >>> ? http://xmlsoft.org/libxslt/ >>> >>> The command-line tool is xsltproc. >>> >>> Again, it's not easy to use, but once you've built a toolchain, it will be reliable and fairly easy to modify if the source XML schema change. >>> >> I just checked and I cannot see that the organization publishing these data files offer any XSLT stylesheet. IOW, I am, perhaps incorrectly, assuming that the publisher of the data would be one with said stylesheet. (Although perhaps that is something an end-user could put together as well??) >> >> Although the data format of each data series is unique, it is simple and could conceivably be parsed using grep but I am looking for a more "forward-looking" solution for other applications in the future. >> >> If XSLT stylesheets are not available - would you suggest another tool? Or, would you suggest I design sheets, presumably one for for each data series? >> > I used in the past xmlstarlet (available in epel) for quick parsing from > within bash scripts. > For something more robust, maybe switch to python ? (ymmv) >I wanted to do this in bash and decided on calling xsltproc while investing in writing an XSLT stylesheet for each data file format.