Hi In base iconv, some character sets have problem, mostly related to single-byte JIS X 0201 Kana (aka Half Width Kana) and multi-bytes JIS X 0213 (2004 version is the newest standard for now). The problems can be stratified by 3 error patterns. 1. Illegal byte sequence (in Gnu iconv, "cannot convert") 2. Invalid argument (in Gnu iconv, "unsupported") 3. Invalid characters What I tried is: *Select Japanese codes from `icomv -l`. (Possibly dropped some codes) *Stratify them by whether single-byte or multi-bytes. *Convert simple test string (no meaning as sentences) from UTF-8 to target code using `iconv -f UTF-8 -t (target) (TestString)` and its reverse conversion, and compare reverse converted string with original test string. If error occurred, stratify by it and record the output in hex form. Please see attached PDF for detail (Notes are basically for base iconv). Base iconv in stable/10 r258701 and Gnu iconv from ports in stable/9 for reference. Strangely, although all target is listed in `iconv -l` (base iconv, not all of them are listed in ports Gnu iconv), some target caught error "invalid argument" and no output string to stdout. This shall not happen, and should be gracefully supported or dropped from list. In other error pattern, output strings are erroneously converted. In some case dropped some character, or converted to alternative character for error case (GETA MARK). But I'd need to mention that mapping non-supported character to GETA MARK is normal treatment for multi-bytes case because not all UTF-8 characters are supported in every codes. Dropping unsupported is considered as really abnormal in most cases. Can someone confirm and fix? Looking in src tree, corresponding csmapper sources seems existing. But my knowledge in iconv internals is insufficient, so I can't figure out why these error occurs. I have no fix, sorry. (It's beyonds my ability). Some technical notes: In JIS X 0201 and its variants, half width katakana characters are supported directly in 8bits encoding and via shift-out/shift-in in 7bits encoding. JIS X 0212 is extension for JIS X 0208. Not superset of JIS X 0208. In other hand, JIS X 0213 is modified superset of JIS X 0208. (Includes almost all of JIS X 0208 but not compatible as some code points are changed, subsumed or splitted. In addition, many of MS extended characters are included.) In strict EUC-JP (equals to EUC variant of ISO2022-JP), half width katakana characters are intentionally unsupported, but EUC-JP itself can support them as 2-bytes form lead by 0x8E followed by JIS X 0201 code. In strict EUC-JP, JIS X 0212 extended characters are not supported, but EUC-JP itself can support as 3-bytes form lead by 0x8F. In my multi-bytes test string, codepoint 0xE2 0x85 0xB1 in UTF-8 is vendor specific in JIS X 0208 and JIS X 0208 + 0212 (equals to ISO2022-JP-1 excluding half width katanaka characters), including SHIFT_JIS variants and EUC variants. Some of these vendor specific characters are introduced into standard in JIS X 0213. Vendor specific variants such as CP932 already have them from before JIS X 0213. Regards. -- Tomoaki AOKI junchoon at dec.sakura.ne.jp -------------- next part -------------- A non-text attachment was scrubbed... Name: test-iconv-results.pdf Type: application/pdf Size: 28806 bytes Desc: not available URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20131208/85102db3/attachment.pdf>