[NCLUG] charset/encoding challenges & filesystems

Bob Proulx bob at proulx.com
Mon Nov 26 15:30:30 MST 2012


Luke Jones wrote:
> Thanks for the tips. I copied a few problematic files into a test directory
> -- try to focus the problem to something smaller than "I just sent 10K
> files and it didn't work"

Good plan!

> and when I rsync'ed it both ways everything appeared to work
> correctly.

Sounds good.  On what machine were you running rsync?  From the Apple?
Or from a GNU/Linux machine?  Or both?

> I got messages from rsync that looked like this:
> 
> opening connection using ssh craters.local rsync --server -vvnlogDtpr .

What options did you use?  (It looks like at least two -vv options.)
What version of rsync?

> test/02 Wagner ?\#200\#224 Die Walku?\#210re ?\#200\#224 Ride Of The Valkyries.mp3 is uptodate
> test/05 Se A Vida E?\#201 (That's The Way Life Is).mp3 is uptodate
> test/08 Blue O?\#210yster Cult.mp3 is uptodate
> test/08 Sla?\#201inte Bhreagh Hiu?\#201lit (Hewlett).mp3 is uptodate

That does *look* ugly.  But that seems like something with your
terminal.  It seems to me that your terminal is not being configured
with LANG when it is started.  HINT: If you are setting LANG in your
$HOME/.bashrc then it will _hide_ that the terminal doesn't have it
set when the terminal is started.  Unless you start a second one from
a bash that has LANG set.

> That has the advantage of showing what the actual bytes are in the
> filename, and showing that they're matching their counterparts on the far
> side and correctly determining what is and is not up to date.

I don't see that.  I set up a test this way:

  $ mkdir /tmp/rsynctrial
  $ cd /tmp/rsynctrial
  $ date -R > Öyster
  $ rsync -av . $OTHERMACHINE:/tmp/rsynctrial/
  sending incremental file list
  ./
  Öyster

  sent 132 bytes  received 34 bytes  332.00 bytes/sec
  total size is 29  speedup is 0.17

That looks to have worked perfectly.  File names on both ends are
displayed correctly.  File name in transit is displayed correctly.  No
special options were given other than having LANG set to en_US.UTF-8
as mentioned previously.  I didn't need to do anything interesting
with options or iconv or other such things.

If I purposefully break LANG and start up a new xterm without LANG
then depending upon the combination of whether xterm and/or bash have
LANG set I get different behavior and even your behavior.

  $ env -i HOME=$HOME PATH=$PATH DISPLAY=$DISPLAY SSH_AUTH_SOCK=$SSH_AUTH_SOCK xterm

That gives me a terminal that does not have LANG set and so will
default to the C (aka POSIX) locale setting.

  $ rsync -av . torpid:/tmp/t/
  sending incremental file list
  ./
  \#303\#226yster

  sent 132 bytes  received 34 bytes  110.67 bytes/sec
  total size is 29  speedup is 0.17

That recreates your behavior but only if the terminal doesn't
understand the locale character encoding.  But regardless of the
*display* of the characters the rsync copy of them is okay.  The sync
of the files is okay regardless of the character encoding of the
filenames.

Note that if you were setting LANG too late then 'ls' and 'rsync'
would get it and think the terminal were UTF-8 when it isn't and then
the display would be garbled:

  Ãyster

That hopefully displays as an "A" with a "~" on top showing that the
encoding mismatch is trying to display it as a multibyte character but
failing due to the mismatch.  But my cut-n-paste across the broken
character encoding might not come through the email right.

> So. I still don't know what is wrong, but you were helpful if only in
> asking that I "sit down calmly, take a deep breath, and think things over"
> as it work. I'll put some more time into debugging this tomorrow and
> hopefully figure out what's breaking with the 10K music files.

I think your terminal is not set with the proper locale.  But I don't
know whether you are doing this on your Mac or on your GNU/Linux
machine.  I think if you set LANG=en_US.UTF-8 and then launch the
terminal that it will be okay.

  $ env LANG=en_US.UTF-8 xterm

But I think it doesn't really matter to rsync and that even if you
ignore this that the files will be sync'd properly.

If you run rsync twice then the second time it doesn't copy any files
the second time, right?  If it does then that is due to a different
problem of timestamps or permissions or user or groups or one of the
other attributes.  Pretty sure anyway. :-)

Bob



More information about the NCLUG mailing list