Jeremy C. Reed
2007-Mar-22 20:52 UTC
converting html with \xa9 to Markdown and using iconv?
The html document various characters like ? \xa0 ? \xa9 (Copyright symbol) (and others). I tried using html2text.py but it didn't like these characters. Any ideas on how I can use iconv or another tool to convert documents like this so I can then convert to Markdown? I don't want to do manually as I have around 500+ documents. Jeremy C. Reed
Am Donnerstag, 22. M?rz 2007 schrieb Jeremy C. Reed:> The html document various characters like > ? \xa0 > ? \xa9 (Copyright symbol) > (and others). > > I tried using html2text.py but it didn't like these characters. > > Any ideas on how I can use iconv or another tool to convert documents like > this so I can then convert to Markdown? > > I don't want to do manually as I have around 500+ documents. > > > Jeremy C. ReedAs far as I understand you, you are looking for a converter which supports UTF-8 / Unicode characters? My PHP-script (ported from html2text.py) doesn't change those, so it would theoretically work. Try it out at [1]. But: It's PHP - so unless you have access to a command line or write a little PHP script to be run locally it will be of no use for you. The latter should be pretty easy though, simply recourse through your files / folders, apply html2text to all and save the output somewhere. You might want to allow long(er) execution times for PHP scripts for the meantime. Another alternative would be to use one of the other converters, I know there are some but I don't have their URLs at hand. Maybe someone will be able to help you. [1]: http://milianw.de/projects/html2text/ -- Milian Wolff http://milianw.de
John MacFarlane
2007-Mar-22 23:00 UTC
converting html with \xa9 to Markdown and using iconv?
You could try html2markdown, which uses iconv, tidy, and pandoc. It should have no trouble with these characters. It's included in the pandoc distribution: http://sophos.berkeley.edu/macfarlane/pandoc/ JM +++ Jeremy C. Reed [Mar 22 07 15:52 ]:> The html document various characters like > ? \xa0 > ? \xa9 (Copyright symbol) > (and others). > > I tried using html2text.py but it didn't like these characters. > > Any ideas on how I can use iconv or another tool to convert documents like > this so I can then convert to Markdown? > > I don't want to do manually as I have around 500+ documents. > > > Jeremy C. Reed > _______________________________________________ > Markdown-Discuss mailing list > Markdown-Discuss at six.pairlist.net > http://six.pairlist.net/mailman/listinfo/markdown-discuss
Julian Tarkhanov
2007-Mar-23 14:43 UTC
converting html with \xa9 to Markdown and using iconv?
On Mar 22, 2007, at 9:52 PM, Jeremy C. Reed wrote:> I tried using html2text.py but it didn't like these characters. > > Any ideas on how I can use iconv or another tool to convert > documents like > this so I can then convert to Markdown?This might have to do with Python and it's silly-silly unicode strings. I have done hundreds of docs with all kinds of weird characters in them, in many languages, as long as they were UTF-8 the Perl markdown, the PHP one and the Ruby one all worked fine. You got mucho problemo if you use some odd 8-bit legacy encoding for your special chars. Besides, you don't actually need to convert them to entities - normal browsers just render these Unicode chars verbatim if your font has them. -- Julian 'Julik' Tarkhanov please send all personal mail to me at julik.nl
andrew at shedside.com
2007-Mar-23 15:22 UTC
converting html with \xa9 to Markdown and using iconv?
On Thu, 22 Mar 2007 15:52:01 -0500 (CDT), Jeremy C. Reed wrote:> I tried using html2text.py but it didn't like these characters.If you're familiar with XSLT, another option is: http://www.lowerelement.com/Geekery/XML/XHTML-to-Markdown.html (You can use Perl's XML::LibXML and XML::LibXSLT to parse your HTML as XHTML and then transform it using the above stylesheet, UTF8 intact, into Markdown). Cheers, Andrew.