Hi everyone, I have a very annoying character encoding problem. Have a look to this: # ls -l M*mo-1.* -rw-rw-rw- 1 root root 8417218 6 sept. 2013 Mémo-1.aif -rwxr--r-- 1 hope hope 8417218 6 sept. 2013 Mémo-1.aif -rw-rw-rw- 1 root root 363175 6 sept. 2013 Mémo-1.m4a -rwxr--r-- 1 hope hope 363175 6 sept. 2013 Mémo-1.m4a Yes, it looks like two files have exactly the same name, but actually they're different: one as "é" encoded as 0xCC81, and the other one (the "good one") as 0xC3A9. Of course similar problems occur for all accented letters. So here's the setup: I have a very weird proprietary system (DDP server), probably running internally some ancient version of Samba. People copied these files to this old server from Mac workstations. So far so good. I have a new server, running CentOS 7.3 and Samba 4.6. I mounted the CIFS exports from the DDP server : # mount | grep temp //192.168.5.150/w-rushes-temp on /mnt/w-rushes-temp type cifs (ro,relatime,vers=1.0,cache=strict,username=admin,domain=,uid=0,noforceuid,gid=0,noforcegid,addr=192.168.5.150,soft,unix,posixpaths,serverino,mapposix,acl,rsize=1048576,wsize=65536,echo_interval=60,actimeo=1) Listing the files on this mount everything looks good at first glance: # ls -l M*mo-1.* -rw-rw-rw- 1 root root 8417218 6 sept. 2013 Mémo-1.aif -rw-rw-rw- 1 root root 363175 6 sept. 2013 Mémo-1.m4a Now I copy the files from the old system to the new one, using cp -a, or rsync. Then when connecting with the Mac to the new server using SMB, you can't see any of the files with accented characters in the name. But they're here, though invisible from the Mac Finder (they look fine when listed from the terminal, as you've seen before). If I copy the file from the Mac Finder, or I create a new file with "touch héhohàhù" they appear perfectly fine, with accents and all. What can be the cause of this weird encoding effect? You notice that on the new server I didn't use "iocharset=utf8" option. However the files with accented characters look fine (treacherously). Bonus question, I have 327 TB of data with mangled file names. Any trick to avoid copying everything *again* would be welcome... -- ------------------------------------------------------------------------ Emmanuel Florac | Direction technique | Intellique | <eflorac at intellique.com> | +33 1 78 94 84 02 ------------------------------------------------------------------------ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 181 bytes Desc: Signature digitale OpenPGP URL: <http://lists.samba.org/pipermail/samba/attachments/20180426/02253592/attachment.sig>
On Thu, Apr 26, 2018 at 07:29:41PM +0200, Emmanuel Florac via samba wrote:> Hi everyone, > > I have a very annoying character encoding problem. Have a look to this: > > # ls -l M*mo-1.* > -rw-rw-rw- 1 root root 8417218 6 sept. 2013 Mémo-1.aif > -rwxr--r-- 1 hope hope 8417218 6 sept. 2013 Mémo-1.aif > -rw-rw-rw- 1 root root 363175 6 sept. 2013 Mémo-1.m4a > -rwxr--r-- 1 hope hope 363175 6 sept. 2013 Mémo-1.m4a > > Yes, it looks like two files have exactly the same name, but actually > they're different: one as "é" encoded as 0xCC81, and the other one (the > "good one") as 0xC3A9. Of course similar problems occur for all accented > letters.0xC3A9 is utf-8 of é, so that's correct.> So here's the setup: I have a very weird proprietary system (DDP > server), probably running internally some ancient version of Samba. > People copied these files to this old server from Mac workstations. So > far so good. > > I have a new server, running CentOS 7.3 and Samba 4.6. I mounted the > CIFS exports from the DDP server : > > # mount | grep temp > > //192.168.5.150/w-rushes-temp on /mnt/w-rushes-temp type cifs > (ro,relatime,vers=1.0,cache=strict,username=admin,domain=,uid=0,noforceuid,gid=0,noforcegid,addr=192.168.5.150,soft,unix,posixpaths,serverino,mapposix,acl,rsize=1048576,wsize=65536,echo_interval=60,actimeo=1) > > Listing the files on this mount everything looks good at first glance: > > # ls -l M*mo-1.* > > -rw-rw-rw- 1 root root 8417218 6 sept. 2013 Mémo-1.aif > -rw-rw-rw- 1 root root 363175 6 sept. 2013 Mémo-1.m4a > > Now I copy the files from the old system to the new one, using cp -a, > or rsync. > > Then when connecting with the Mac to the new server using SMB, you > can't see any of the files with accented characters in the name. But > they're here, though invisible from the Mac Finder (they look fine when > listed from the terminal, as you've seen before). > > If I copy the file from the Mac Finder, or I create a new file with > "touch héhohàhù" they appear perfectly fine, with accents and all. > > What can be the cause of this weird encoding effect?I'm guessing this is compose character effect. MacOSX uses unicode "compose" characters to stich together an accent onto an existing chacter. I think MacOSX is the only system that uses this as standard. I think you should be able to fix this using iconv, although you might want to do this carefully on the server.
On Thu, Apr 26, 2018 at 11:00:17AM -0700, Jeremy Allison via samba wrote:> I'm guessing this is compose character effect. MacOSX uses > unicode "compose" characters to stich together an accent > onto an existing chacter.Precomposed vs decomposed that is (NFC vs NFD).> I think MacOSX is the only system that uses this as standard.macOS uses decomposed and it is iirc the only system that uses this.> I think you should be able to fix this using iconv, although > you might want to do this carefully on the server.Björn's convmv is probably the better tool here. -slow -- Ralph Boehme, Samba Team https://samba.org/ Samba Developer, SerNet GmbH https://sernet.de/en/samba/ GPG Key Fingerprint: FAE2 C608 8A24 2520 51C5 59E4 AA1E 9B71 2639 9E46