Discussion:
[RFC PATCH] Windows: Assume all file names to be UTF-8 encoded.
Peter Krefting
2009-03-02 08:47:22 UTC
Permalink
When opening a file through open() or fopen(), the path passed is
UTF-8 encoded. To handle this on Windows, we need to convert the
path string to UTF-16 and use the Unicode-based interface.
---
Windows does support file names using arbitrary Unicode characters, you just
need to use its wchar_t interfaces instead of the char ones (the char ones
just gets converted into wchar_t on the API level anyway, for the same
reasons). This is the beginnings of support for UTF-8 file names on Git on
Windows.

Since there is no real file system abstraction beyond using stdio (AFAIK), I
need to hack it by replacing fopen (and open). Probably opendir/readdir as
well (might be trickier), and possibly even hack around main() to parse the
wchar_t command-line instead of the char copy.

This will lose all chances of Windows 9x compatibility, but I don't know if
there are any attempts of supporting it anyway?

Please note that MultiByteToWideChar() will reject any invalid UTF-8
strings, perhaps it should just fall back to a regular open()/fopen() in
that case?

No Signed-Off line since this is unfinished, just presenting rough sketches
of an idea.

compat/mingw.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
compat/mingw.h | 3 ++
2 files changed, 62 insertions(+), 1 deletions(-)

diff --git a/compat/mingw.c b/compat/mingw.c
index e25cb4f..8b19b80 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -9,13 +9,30 @@ int mingw_open (const char *filename, int oflags, ...)
{
va_list args;
unsigned mode;
+ wchar_t *unicode_filename;
+ int unicode_filename_len;
va_start(args, oflags);
mode = va_arg(args, int);
va_end(args);

if (!strcmp(filename, "/dev/null"))
filename = "nul";
- int fd = open(filename, oflags, mode);
+
+ unicode_filename_len = MultiByteToWideChar(CP_UTF8, 0, filename, -1, NULL, 0);
+ if (0 == unicode_filename_len) {
+ errno = EINVAL;
+ return -1;
+ };
+
+ unicode_filename = xmalloc(unicode_filename_len * sizeof (wchar_t));
+ if (NULL == unicode_filename) {
+ errno = ENOMEM;
+ return -1;
+ }
+ MultiByteToWideChar(CP_UTF8, 0, filename, -1, unicode_filename, unicode_filename_len);
+ int fd = _wopen(unicode_filename, oflags, mode);
+ free(unicode_filename);
+
if (fd < 0 && (oflags & O_CREAT) && errno == EACCES) {
DWORD attrs = GetFileAttributes(filename);
if (attrs != INVALID_FILE_ATTRIBUTES && (attrs & FILE_ATTRIBUTE_DIRECTORY))
@@ -24,6 +41,47 @@ int mingw_open (const char *filename, int oflags, ...)
return fd;
}

+FILE *mingw_fopen (const char *filename, const char *mode)
+{
+ wchar_t *unicode_filename, *unicode_mode;
+ int unicode_filename_len, unicode_mode_len;
+ FILE *fh;
+
+ unicode_filename_len = MultiByteToWideChar(CP_UTF8, 0, filename, -1, NULL, 0);
+ if (0 == unicode_filename_len) {
+ errno = EINVAL;
+ return NULL;
+ };
+
+ unicode_filename = xmalloc(unicode_filename_len * sizeof (wchar_t));
+ if (NULL == unicode_filename) {
+ errno = ENOMEM;
+ return NULL;
+ }
+ MultiByteToWideChar(CP_UTF8, 0, filename, -1, unicode_filename, unicode_filename_len);
+
+ unicode_mode_len = MultiByteToWideChar(CP_UTF8, 0, mode, -1, NULL, 0);
+ if (0 == unicode_mode_len) {
+ free(unicode_filename);
+ errno = EINVAL;
+ return NULL;
+ };
+
+ unicode_mode = xmalloc(unicode_mode_len * sizeof (wchar_t));
+ if (NULL == unicode_mode) {
+ free(unicode_mode);
+ errno = ENOMEM;
+ return NULL;
+ }
+ MultiByteToWideChar(CP_UTF8, 0, mode, -1, unicode_mode, unicode_mode_len);
+
+ fh = _wfopen(unicode_filename, unicode_mode);
+ free(unicode_filename);
+ free(unicode_mode);
+
+ return fh;
+}
+
static inline time_t filetime_to_time_t(const FILETIME *ft)
{
long long winTime = ((long long)ft->dwHighDateTime << 32) + ft->dwLowDateTime;
diff --git a/compat/mingw.h b/compat/mingw.h
index 4f275cb..235df0a 100644
--- a/compat/mingw.h
+++ b/compat/mingw.h
@@ -142,6 +142,9 @@ int sigaction(int sig, struct sigaction *in, struct sigaction *out);
int mingw_open (const char *filename, int oflags, ...);
#define open mingw_open

+FILE *mingw_fopen (const char *filename, const char *mode);
+#define fopen mingw_fopen
+
char *mingw_getcwd(char *pointer, int len);
#define getcwd mingw_getcwd
--
1.6.0.2.1172.ga5ed0
Johannes Sixt
2009-03-02 10:30:01 UTC
Permalink
Post by Peter Krefting
When opening a file through open() or fopen(), the path passed is
UTF-8 encoded.
I don't think that this assumption is valid. Whenever the Windows API has
to convert between Unicode strings and char* strings, it uses the current
"ANSI code page". As far as I know, the UTF-8 codepage (65001) cannot be
used as the "current ANSI code page". Users will always have some code
page set that is not UTF-8.

For example, if the user specifies a file name on the command line, than
it will not enter git in UTF-8, but in the current "ANSI" or "OEM code
page" encoding. If git prints a file name under the assumption that it is
UTF-8 encoded, then it will be displayed incorrectly because the system
uses a different encoding.
Post by Peter Krefting
Since there is no real file system abstraction beyond using stdio
(AFAIK), I need to hack it by replacing fopen (and open). Probably
opendir/readdir as well (might be trickier), and possibly even hack
around main() to parse the wchar_t command-line instead of the char copy.
I think you are grossly underestimating the venture that you want to
undertake here.

Please come up with a plan how you are going to deal with the various
issues. File names enter and leave the system through different channels:

- the command line and terminal window
- object database (tree objects)
- opendir/readdir; opening files or directories for reading or writing

And there is probably some more... How do you treat encodings in these
channels? What if the file names are not valid UTF-8? Etc.

The biggest obstacle will be that git does not have a notion of "file name
encoding" - it simply treats a file name as a stream of bytes. There is no
place to write an encoding. If the byte streams are regarded as having an
encoding, then you can have ambiguities, mixed encodings, or invalid
characters. You would have to deal with this in some way.
Post by Peter Krefting
This will lose all chances of Windows 9x compatibility, but I don't know
if there are any attempts of supporting it anyway?
Windows 9x is already out of the loop. We use GetFileInformationByHandle()
that is only available since Windows 2000.

-- Hannes
Peter Krefting
2009-03-02 10:46:47 UTC
Permalink
Post by Johannes Sixt
I don't think that this assumption is valid.
Depends on where you are coming from. For the files stored in the Git
repositories, I believe all file names are supposed to be UTF-8 encoded
(just like commit messages and user names are). That's the assumption I
started working from.
Post by Johannes Sixt
Users will always have some code page set that is not UTF-8.
Indeed. And as long as the char-pointer interfaces in stdio and elsewhere
work on that assumption, we have a problem.
Post by Johannes Sixt
For example, if the user specifies a file name on the command line, than
it will not enter git in UTF-8, but in the current "ANSI" or "OEM code
page" encoding.
That problem is already solved as we do have a wchar_t command line
available. If you pass a file name that is not representable in the current
"ANSI" codepage on the command line, it will come out as garbage in the
char* version, but will be correct in the wchar_t* version. Thus we need to
convert that to utf-8 and use that instead.
Post by Johannes Sixt
If git prints a file name under the assumption that it is UTF-8 encoded,
then it will be displayed incorrectly because the system uses a different
encoding.
Here setting the local codepage to UTF-8 *might* work, although I haven't
tested that. Or always use the wchar_t versions of printf and friends.
Post by Johannes Sixt
I think you are grossly underestimating the venture that you want to
undertake here.
I've done this before with other software, so, yes, I know it is quite a big
undertaking. That is also why I started out with a minimal RFC patch to see
if there was any interest in working with this.
Post by Johannes Sixt
Please come up with a plan how you are going to deal with the various
- the command line and terminal window
GetCommandLineW() as decribed above.
Post by Johannes Sixt
- object database (tree objects)
Those file names are supposedly always UTF-8.
Post by Johannes Sixt
- opendir/readdir; opening files or directories for reading or writing
Wrap file open and directory read to use the wchar_t versions, converting
that to UTF-8 strings at the API level.
Post by Johannes Sixt
And there is probably some more... How do you treat encodings in these
channels? What if the file names are not valid UTF-8? Etc.
Ill-formed UTF-8 should just be rejected. Invalid UTF-8 is worse. I'm not
sure what the Linux version does, when running in a UTF-8 locale. Does it
allow ill-formed or illegal UTF-8 sequences?

NTFS allows almost any sequence of wchar_t's, it doesn't even have to be
valid UTF-16.
Post by Johannes Sixt
The biggest obstacle will be that git does not have a notion of "file name
encoding" - it simply treats a file name as a stream of bytes.
Yeah, that is one of the major bugs in its design, IMHO. But almost everyone
seems to assume that file names are UTF-8 strings anyway, so in the absence
of any other information, it's a good assumption as any to make.
Post by Johannes Sixt
If the byte streams are regarded as having an encoding, then you can have
ambiguities, mixed encodings, or invalid characters. You would have to
deal with this in some way.
Considering we already see problems with file names that cannot properly be
represented on some file systems (case-only differences in the Linux kernel
when checked out on Windows; Mac OS' built-in Unicode normalization of file
names, etc.)
Post by Johannes Sixt
Windows 9x is already out of the loop.
Good.
--
\\// Peter - http://www.softwolves.pp.se/
Johannes Schindelin
2009-03-02 10:56:40 UTC
Permalink
Hi,
Post by Peter Krefting
Post by Johannes Sixt
I don't think that this assumption is valid.
Depends on where you are coming from. For the files stored in the Git
repositories, I believe all file names are supposed to be UTF-8 encoded
(just like commit messages and user names are). That's the assumption I
started working from.
No. As far as Git is concerned, the file names are just as much blobs as
the file contents.

The fact that Windows messes with this notion just as it messes with the
file contents (think the endless story whose name is CR/LF) shows only how
"well" designed the concepts in Windows are.

And as it stands, we have at least two issues on the msysGit issue tracker
that complain that Git does not work with localized file names properly.

So no, file names are not UTF-8 at all, especially not on Windows.

Do not get me wrong, I really welcome you taking care of the issue, but I
do not think that forcing UTF-8 is a solution.

Thanks & sorry,
Dscho
Peter Krefting
2009-03-02 12:03:58 UTC
Permalink
Post by Johannes Schindelin
No. As far as Git is concerned, the file names are just as much blobs as
the file contents.
I've struggled with the same problems on Linux before, since its file
systems doesn't have the concept of characters, either. I guess it's just
design principles, but as far as I am concerned, having file names be
constructed from characters makes a lot more sense than having them
constructed from bytes.

Git does the right thing in assuming commit messages and user names be UTF-8
characters, though, it would have been nice to have file names covered by
the same constraints.
Post by Johannes Schindelin
The fact that Windows messes with this notion just as it messes with the
file contents (think the endless story whose name is CR/LF) shows only how
"well" designed the concepts in Windows are.
In this case, yes, Windows' way of doing does make more sense, at least to
me. And as far as text files are concerned, treating text as sequences of
bytes are in most cases not a very smart thing to do, either, but it's hard
not to given how most computers are constructed.
Post by Johannes Schindelin
And as it stands, we have at least two issues on the msysGit issue tracker
that complain that Git does not work with localized file names properly.
So no, file names are not UTF-8 at all, especially not on Windows.
I am not trying to make file names *on Windows* to be UTF-8. I am trying to
make file names on Windows be Windows file names, i.e UTF-16 Unicode. It's
just that since Git internally uses the char* APIs, and from what I have
seen in most other cases assume that char* text is UTF-8, I am trying to
convert from Windows' view of path names to Git's (UTF-16 to UTF-8) and back.

The other way would be to keep the char* APIs but convert to the Windows
locale encoding ("ANSI codepage"), but that will break horribly as not all
file names that can be used on a file system can be represented as such.
Plus, all calls to a Windows API using a char* path name *is* converted into
UTF-16 anyway, since that is what is used internally in the Windows NT
subsystems.
Post by Johannes Schindelin
Do not get me wrong, I really welcome you taking care of the issue, but I a
do not think that forcing UTF-8 is a solution.
Some kind of handling of Git repositories where file names are not UTF-8
would probably need to be added, yes.
--
\\// Peter - http://www.softwolves.pp.se/
Peter Krefting
2009-03-02 13:57:52 UTC
Permalink
Hi!
Makes sense too. I think the whole API would have to be changed to use
TCHAR*.
I'd rather just say wchar_t explicitely. I'm not particularly fond of macros
that change under your feet just because you fail to define a symbol
somewhere...
Then you need to do the right conversion at the right places, this will be
quite tricky, painful work, but there is probably no way around that.
In the other project I worked on we ended up wrapping all file-related calls
in our own porting interface, and then let each platform we compiled for
implement their own methods for handling Unicode paths. For Windows it's
trivial since all APIs are Unicode. For Unix-like OSes it's tricky as you
have to take the locale settings into account, but fortunately the world is
slowly moving towards UTF-8 locales, which eases the pain a bit.
Note that not only conversions will be needed but you'll also need to
adjust all routines handling filenames to use the proper Unicode version.
(strchr -> _tstrchr, open -> _topen, strcpy -> _tstrcpy, strlen ->
_tcslen, ...).
Not necessarily. If the code can be set up to use UTF-8 char* internally,
not everything needs to be rewritten (I've done that too, only took a
couple of years to move the codebase over to all-Unicode).
--
\\// Peter - http://www.softwolves.pp.se/
Thomas Rast
2009-03-02 14:29:54 UTC
Permalink
Post by Peter Krefting
In the other project I worked on we ended up wrapping all file-related calls
in our own porting interface, and then let each platform we compiled for
implement their own methods for handling Unicode paths. For Windows it's
trivial since all APIs are Unicode. For Unix-like OSes it's tricky as you
have to take the locale settings into account, but fortunately the world is
slowly moving towards UTF-8 locales, which eases the pain a bit.
Have you thought about all the consequences this would have for the
*nix people here? [*]

Even if you pretend that Git did always enforce UTF-8 paths in its
trees, so that there's no backward compatibility to be cared for,
you're still in a world of hurt when trying to check out such paths
under a locale (or whatever setting might control this new encoding
logic) that does not support the whole range of UTF-8.

Like, say, the C locale.

Next you get to see to it that the users can spell all filenames even
if their locale doesn't let them, since they'll want to do things like
'git show $rev:$file' with them.

With backwards compatibility it's even worse as you're suddenly
imposing extra restrictions on what a valid filename in the repository
must look like.


[*] I'm _extremely_ tempted to write "people using non-broken OSes",
but let's pretend to be neutral for a second.
--
Thomas Rast
trast@{inf,student}.ethz.ch
Peter Krefting
2009-03-02 20:41:57 UTC
Permalink
Have you thought about all the consequences this would have for the *=
nix=20
people here? [*]
Yeah. It will fix problems trying to check out a Git repository created=
by=20
me in a iso8859-1 locale on a machine using a utf-8 locale, where both =
ends=20
would like to have a file named "=DC".

Or, hopefully, a careful adoption of this on Windows won't affect Unixe=
s and=20
other systems with pre-Unicode APIs at all, since the Windows code woul=
d be=20
in the "compat" directory.
you're still in a world of hurt when trying to check out such paths u=
nder=20
a locale (or whatever setting might control this new encoding logic) =
that=20
does not support the whole range of UTF-8.
Yeah. That would be a case similar to the casing problem on Windows.
With backwards compatibility it's even worse as you're suddenly impos=
ing=20
extra restrictions on what a valid filename in the repository must lo=
ok=20
like.
Indeed. It is unfortunate that this wasn't properly specified to start =
with.=20
It's mostly a minor issue since *most* people will not use non-ASCII fi=
le=20
names. At least for most of the kind of projects that Git have attracte=
d so=20
far, so the problem is not that big. The problem is if Git is to attrac=
t=20
"the masses". Especially on Windows, where file names using non-ASCII a=
re=20
common, this needs to be addressed eventually.
[*] I'm _extremely_ tempted to write "people using non-broken OSes", =
but=20
let's pretend to be neutral for a second.
In most cases, I would most definitely agree with you on calling it tha=
t,=20
but when it comes to Unicode support, Windows is one of the least broke=
n=20
OSes (with Symbian being my favourite).

--=20
\\// Peter - http://www.softwolves.pp.se/
Lars Noschinski
2009-03-03 07:56:55 UTC
Permalink
Indeed. It is unfortunate that this wasn't properly specified to star=
t with.=20
It's mostly a minor issue since *most* people will not use non-ASCII =
file=20
names. At least for most of the kind of projects that Git have attrac=
ted so=20
far, so the problem is not that big. The problem is if Git is to attr=
act "the=20
masses". Especially on Windows, where file names using non-ASCII are =
common,=20
this needs to be addressed eventually.
Using no encoding for filenames was the obvious (and I would argue)
correct choice. Unix filenames are specified to be a sequence of bytes,
excluding '/' and '\0'. A lot of these sequences are not valid UTF-8.
=46urther, the encoding needed for filenames depends on the encoding us=
ed
in the source code for referencing these files. Again, for the unix fil=
e
handling functions, this means no encoding.

Changing the filename (on checkout), so that the user sees an =DC
regardless of his or her locale (instead of an \0xDC, which only
resolves to an =DC on latin-1) would be an absolutely broken concept he=
re.
[*] I'm _extremely_ tempted to write "people using non-broken OSes",=
but let's=20
pretend to be neutral for a second.
=20
In most cases, I would most definitely agree with you on calling it t=
hat, but=20
when it comes to Unicode support, Windows is one of the least broken =
OSes (with=20
Symbian being my favourite).
IMHO having encoding specific open functions is begging for problems.

- Lars.
Peter Krefting
2009-03-03 11:54:31 UTC
Permalink
Using no encoding for filenames was the obvious (and I would argue)=20
correct choice. Unix filenames are specified to be a sequence of byte=
s,=20
excluding '/' and '\0'.
I know the Unix way of thinking lends itself to such a design. This is =
one=20
of the few cases where I personally think Unix has got it wrong, and Wi=
ndows=20
(NT) has got it right. But then again, Unix' design pre-dates the local=
e=20
issue by quite some time, so it is not difficult to see where it comes =
from.
Changing the filename (on checkout), so that the user sees an =DC reg=
ardless=20
of his or her locale (instead of an \0xDC, which only resolves to an =
=DC on=20
latin-1) would be an absolutely broken concept here.
Why would it? It is my view as a user on my files that define how file =
names=20
are looked upon. If I have three machines, one Linux box using a iso885=
9-1=20
locale, an OS X box (where, I would believe, file APIs use UTF-8, someo=
ne=20
please correct me if I'm wrong), and a Windows box (which uses UTF-16 o=
n the=20
file system layer, but does provide compatibility functions that use ch=
ar=20
pointers), and create a file on each of these called "=DC.txt" (which w=
ould be=20
the sequence "DC 2E 74 78 74" on the Linux box, "C3 9C 2E 74 78 74" (or=
=20
probably something else since I believe OS X decomposes the string) on =
the=20
OS X box and "00DC 002E 0074 0078 0074" on the Windows box, I see these=
=20
three file names as equal.

If I would create a Git repo on each of the three machines and put the =
file=20
name in it, and then clone that on one of the other machines. *I* would=
=20
assume that the file names were converted to fit the host operating sys=
tem.
IMHO having encoding specific open functions is begging for problems.
Indeed. That's why I like Windows' wchar_t APIs, and dislike Unix' and=20
Linux' char APIs that, in some ways, depend on the user locale.

--=20
\\// Peter - http://www.softwolves.pp.se/
Lars Noschinski
2009-03-03 16:29:25 UTC
Permalink
Changing the filename (on checkout), so that the user sees an =DC re=
gardless of=20
his or her locale (instead of an \0xDC, which only resolves to an =DC=
on=20
latin-1) would be an absolutely broken concept here.
=20
Why would it? It is my view as a user on my files that define how fil=
e names=20
are looked upon. If I have three machines, one Linux box using a iso8=
859-1=20
locale, an OS X box (where, I would believe, file APIs use UTF-8, som=
eone=20
please correct me if I'm wrong), and a Windows box (which uses UTF-16=
on the=20
file system layer, but does provide compatibility functions that use =
char=20
pointers), and create a file on each of these called "=DC.txt" (which=
would be=20
the sequence "DC 2E 74 78 74" on the Linux box, "C3 9C 2E 74 78 74" (=
or=20
probably something else since I believe OS X decomposes the string) o=
n the OS X=20
box and "00DC 002E 0074 0078 0074" on the Windows box, I see these th=
ree file=20
names as equal.
Because a function in the source code refers to (e.g.) "DC 2E 74 78 74"=
,
not "C3 9C 2E 74 78 74" nor "00DC 0024 0074 0078 0074". And it does so
regardless of the locale.

The file name may look funny depending on your locale, but if you renam=
e
the file to fit your local enconding, it would not work.
Robin Rosenberg
2009-03-03 20:59:25 UTC
Permalink
Changing the filename (on checkout), so that the user sees an =DC =
regardless of=20
his or her locale (instead of an \0xDC, which only resolves to an =
=DC on=20
latin-1) would be an absolutely broken concept here.
=20
Why would it? It is my view as a user on my files that define how f=
ile names=20
are looked upon. If I have three machines, one Linux box using a is=
o8859-1=20
locale, an OS X box (where, I would believe, file APIs use UTF-8, s=
omeone=20
please correct me if I'm wrong), and a Windows box (which uses UTF-=
16 on the=20
file system layer, but does provide compatibility functions that us=
e char=20
pointers), and create a file on each of these called "=DC.txt" (whi=
ch would be=20
the sequence "DC 2E 74 78 74" on the Linux box, "C3 9C 2E 74 78 74"=
(or=20
probably something else since I believe OS X decomposes the string)=
on the OS X=20
box and "00DC 002E 0074 0078 0074" on the Windows box, I see these =
three file=20
names as equal.
=20
Because a function in the source code refers to (e.g.) "DC 2E 74 78 7=
4",
not "C3 9C 2E 74 78 74" nor "00DC 0024 0074 0078 0074". And it does s=
o
regardless of the locale.
The only actual language I know where I've seen people use non-ascii na=
mes for
referenced files, i.e. classes, is Java and there you specify the encod=
ing to
the compiler. Class names are not byte sequences there. XML files are a=
nother
case where references files are defined in unicode. I assume this appli=
es to
C# and other modern languages too.
The file name may look funny depending on your locale, but if you ren=
ame
the file to fit your local enconding, it would not work.
In the Java case, you /have/ to "rename" or the build will break. Build=
systems like Ant
or Maven require you to "rename" too regardless of what you build. A C =
Git clone
will produce unbuildable code, but JGit will produce a working one for =
unicode
aware systems and documentation, the case where unicode filenames are m=
ore common
than in source, will look good.

-- robin

PS. I readded the people you forgot to Cc
Dmitry Potapov
2009-03-03 09:47:31 UTC
Permalink
In most cases, I would most definitely agree with you on calling it that,_
but when it comes to Unicode support, Windows is one of the least broken__
OSes (with Symbian being my favourite).
The C Standard requires that the type wchar_t is capable of representing
any character in the current locale. If Windows uses UTF-16 as internal
encoding (so, it can work with symbols outside of the BMP), it means you
cannot have 16-bit wchar_t and be compliant with the C standard...

Dmitry
Peter Krefting
2009-03-03 11:48:02 UTC
Permalink
Post by Dmitry Potapov
The C Standard requires that the type wchar_t is capable of representing
any character in the current locale. If Windows uses UTF-16 as internal
encoding (so, it can work with symbols outside of the BMP), it means you
cannot have 16-bit wchar_t and be compliant with the C standard...
No, that's not quite correct. wchar_t is defined to be "an integer type whose
range of values can represent distinct codes for all members of
the largest extended character set specified among the supported locales".
Since Windows defines all local character sets as Unicode-based, having
wchar_t defined as Unicode means that it can represent everything.
--
\\// Peter - http://www.softwolves.pp.se/
Dmitry Potapov
2009-03-03 17:13:58 UTC
Permalink
Post by Peter Krefting
Post by Dmitry Potapov
The C Standard requires that the type wchar_t is capable of representing
any character in the current locale. If Windows uses UTF-16 as internal
encoding (so, it can work with symbols outside of the BMP), it means you
cannot have 16-bit wchar_t and be compliant with the C standard...
No, that's not quite correct. wchar_t is defined to be "an integer type
whose range of values can represent distinct codes for all members of the
largest extended character set specified among the supported locales". Since
Windows defines all local character sets as Unicode-based, having wchar_t
defined as Unicode means that it can represent everything.
No, it does not, if you have wchar_t that is only 16-bit wide, because
characters
outside of the BMP have integer values in Unicode greater than 65535...

Dmitry
Peter Krefting
2009-03-04 10:51:15 UTC
Permalink
Post by Dmitry Potapov
No, it does not, if you have wchar_t that is only 16-bit wide, because
characters outside of the BMP have integer values in Unicode greater than
65535...
UTF-16 allows you to reference all of Unicode (i.e up to U+10FFFF) using
surrogate pairs. That means that not all characters can be represented as a
single wchar_t, that is true. The problem with changing wchar_t is that it
was defined to use 16-bit values at a time where Unicode was defined to use
16-bit code points (but they soon figured out that was not enough).

Anyway, this is getting off-topic. Please feel free reply in private.
--
\\// Peter - http://www.softwolves.pp.se/
Dmitry Potapov
2009-03-04 14:18:41 UTC
Permalink
The problem with changing wchar_t is that_
it was defined to use 16-bit values at a time where Unicode was defined_
to use 16-bit code points (but they soon figured out that was not_
enough).
I do realize that is a problem, and unfortunately there is no easy and
quick fix to it. But you brought Windows as an example of good Unicode
support... Well, to my mind, it is not, at least, not for C programs.
You have two serious problems here:
1. wchar_t is too small to hold all Unicode characters as it is required
by C standard.
2. UTF-8 support is broken in C runtime library.

In fact, if UTF-8 were supported by C runtime, we would not have this thread
in the first place... Now, it is possible to wrap all C functions used by Git to
make them work with UTF-8, but it is a lot of work...

Dmitry
Johannes Sixt
2009-03-02 12:34:19 UTC
Permalink
Post by Peter Krefting
Post by Johannes Sixt
If git prints a file name under the assumption that it is UTF-8
encoded, then it will be displayed incorrectly because the system uses
a different encoding.
Here setting the local codepage to UTF-8 *might* work, although I
haven't tested that. Or always use the wchar_t versions of printf and
friends.
You cannot expect users to switch the locale. For example, I have to test
our software with Japanese settings: I *cannot* switch to UTF-8 just
because of git.

Can you set the local codepage per program? (I don't know.) It might help
here, but it doesn't help in all cases, particularly in certain pipelines:

git ls-files -o
git ls-files -o | git update-index --add --stdin
find . -name \*.jpg | git update-index --add --stdin

- What encoding should 'ls-files' use for its output? Certainly not always
UTF-8: stdout should use the local code page so that the file names are
interpreted correctly by the terminal window (it expects the local code page).

- What encoding should 'update-index' expect from its input? Can you be
sure that other programs generate UTF-8 output?

How do you solve that?

-- Hannes
Peter Krefting
2009-03-02 13:12:32 UTC
Permalink
Post by Johannes Sixt
Can you set the local codepage per program? (I don't know.)
The locale is set per thread, and gets reset when the program exits. So
setting the codepage to UTF-8 before outputting should work. That should
also work for displaying the log to the terminal if you have UTF-8 log
messages.

Converting it to wchar_t and using wprintf and similar should be safer,
though (and I have no idea what happens if you try to pipe the output to
something else).
Post by Johannes Sixt
- What encoding should 'ls-files' use for its output? Certainly not always
UTF-8: stdout should use the local code page so that the file names are
interpreted correctly by the terminal window (it expects the local code page).
That is exactly why trying to mix "protocol" data ("plumbing" in Git's case)
and user output will always come back and bite you, one way or another. I
haven't really the faintest how pipes work with Unicode on Windows.
Somewhere along the line there will probably be some conversions, which
would cause interesting issues.

Better not use pipes, then. Heh. I sense that there is a slight problem with
the architecture of Git and trying to get it to behave on Windows... :-)
Post by Johannes Sixt
- What encoding should 'update-index' expect from its input? Can you be
sure that other programs generate UTF-8 output?
Theoretically, if all the internal stuff is hacked around to output Unicode,
and the thread codepage is set up to use UTF-8, it should "just work". And
if run directly from the shell, it should still be converted to whatever the
system is set up to emit. That would mean, however, that a Git program that
internally runs

git-foo | git-bar | git-gazonk

might behave differently compared to if a user would enter it on the
command-line.
--
\\// Peter - http://www.softwolves.pp.se/
Robin Rosenberg
2009-03-02 19:58:33 UTC
Permalink
Post by Peter Krefting
Post by Johannes Sixt
Can you set the local codepage per program? (I don't know.)
The locale is set per thread, and gets reset when the program exits. So
setting the codepage to UTF-8 before outputting should work. That should
also work for displaying the log to the terminal if you have UTF-8 log
messages.
Messing with locale is probably going to break subtly. An explicit approach
is better, respecting the user's locale when necessary.
Post by Peter Krefting
Converting it to wchar_t and using wprintf and similar should be safer,
though (and I have no idea what happens if you try to pipe the output to
something else).
Post by Johannes Sixt
- What encoding should 'ls-files' use for its output? Certainly not always
UTF-8: stdout should use the local code page so that the file names are
interpreted correctly by the terminal window (it expects the local code page).
That is exactly why trying to mix "protocol" data ("plumbing" in Git's case)
and user output will always come back and bite you, one way or another. I
haven't really the faintest how pipes work with Unicode on Windows.
Somewhere along the line there will probably be some conversions, which
would cause interesting issues.
Pipes are just bytes so you have to know what you're piping by convention
or protocol. You can ask for the console output page, which may be set to
a multibyte locale or unicode and maybe trust that.... (just guessing, really).
Post by Peter Krefting
Better not use pipes, then. Heh. I sense that there is a slight problem with
the architecture of Git and trying to get it to behave on Windows... :-)
architecture? Like the "architecture" of species? No, it's evolution.
If that applies to the linux kernel, it's not so strange it applies to git too.
Post by Peter Krefting
Post by Johannes Sixt
- What encoding should 'update-index' expect from its input? Can you be
sure that other programs generate UTF-8 output?
Theoretically, if all the internal stuff is hacked around to output Unicode,
and the thread codepage is set up to use UTF-8, it should "just work". And
msys doesn't seem to understand UTF-8 at all, so depending on that to work
seems futile. Simply bypassing the locale for any internal work is probably the
most sane thing. That also won't depend of the quality of the locale support in
the runtime. Start by making the git commands working without msys bash,
and figure a way to fix msys later, unless someone has a very good idea on
how to fix msys.
Post by Peter Krefting
if run directly from the shell, it should still be converted to whatever the
system is set up to emit. That would mean, however, that a Git program that
internally runs
git-foo | git-bar | git-gazonk
might behave differently compared to if a user would enter it on the
command-line.
You might also want to check out my work in the area. See

http://www.jgit.org/cgi-bin/gitweb/gitweb.cgi?p=GIT.git;a=shortlog;h=i18n

The goal is locale neutrality yielding the "expected", in the users eyes, result regardless
of locale as much as possible. Junio didn't want to have it for five years, so I
guess there's still three and half to go. Hopefully he can change his mind. That branch
is heavily outdated by now, as some of functionality have been introduced by other
means like logoutputencoding and other parts of git have been rewritten.

Related to this, JGit assumes UTF-8 on reading. If it's not valid UTF-8 we try the user's
locale (rougly) and on writing object meta data, including any sort of identifier,
we always write UTF-8 when have to be explicit. We let the runtime decide on how
to encode file names in the file system using the user's locale.

I'd be almost happy with a solution that works when people are interacting using
the subset that is convertible between the character sets in use.

-- robin
Peter Krefting
2009-03-02 20:52:41 UTC
Permalink
Post by Robin Rosenberg
Pipes are just bytes so you have to know what you're piping by convention
or protocol. You can ask for the console output page, which may be set to
a multibyte locale or unicode and maybe trust that.... (just guessing, really).
You can get cmd.exe to write data to pipes and redirections as UTF-16
Unicode (cmd.exe /u), perhaps there is a way to capitalise on that?
"Unfortunately", the Git stuff is mostly called from a bash shell inside
msys, so it requires a "bit" more work...
Post by Robin Rosenberg
architecture? Like the "architecture" of species? No, it's evolution.
There's still an architecture there, somewhere. Perhaps not intended or
specified, but there definitely is one :-)
Post by Robin Rosenberg
http://www.jgit.org/cgi-bin/gitweb/gitweb.cgi?p=GIT.git;a=shortlog;h=i18n
The goal is locale neutrality yielding the "expected", in the users eyes,
result regardless of locale as much as possible.
Ah, yes, that looks like an interesting starting point. I already assumed
that Git on Linux would use UTF-8 for everything already, since it already
does that for the commit messages despite me using an iso8859-1 locale.
Apparently I haven't done my homework.
Post by Robin Rosenberg
We let the runtime decide on how to encode file names in the file system
using the user's locale.
That's good. That's what I'm trying to achieve. Or, rather, avoid the user
locale altogether (which is easy on Windows since the file names are always
stored in Unicode, and the user locale can be bypassed).
Post by Robin Rosenberg
I'd be almost happy with a solution that works when people are interacting
using the subset that is convertible between the character sets in use.
You mean like the "invariant" character set? :-) Using Unicode internally
(in whatever encoding) is nice, the problem is when you have to interact
with the world around you.
--
\\// Peter - http://www.softwolves.pp.se/
Robin Rosenberg
2009-03-02 21:21:09 UTC
Permalink
=20
I'd be almost happy with a solution that works when people are inte=
racting=20
using the subset that is convertible between the character sets in =
use.
=20
You mean like the "invariant" character set? :-) Using Unicode intern=
ally=20
(in whatever encoding) is nice, the problem is when you have to inter=
act=20
with the world around you.
Not sure what that is. I mean that in a local nordic, setting people ca=
n use iso-8859-1|15/windows-1252/UTF-8 for their needs be means of conv=
erting the characters as-needed without loss, with very few practial re=
strictions.=20

=46or a larger setting that won't do, but then the need is typically le=
ss since people tend to use ASCII only, or you jump to all unicode.

Just because I use UTF-8 doesn't mean I use start using more characters=
in practice.

-- robin
Peter Krefting
2009-03-03 05:51:59 UTC
Permalink
Post by Robin Rosenberg
Not sure what that is.
"Invariant" is defined in an old RFC as the common subset of several
ASCII-like and ASCII-based encodings. This was back before the MIME days,
IIANM.
Post by Robin Rosenberg
I mean that in a local nordic, setting people can use
iso-8859-1|15/windows-1252/UTF-8 for their needs be means of converting
the characters as-needed without loss, with very few practial
restrictions.
Indeed. The trick is to have the storage (in this case, Git and it's tree
objects) storing the file name data in a commonly agreed-upon way. Then it
is simple to convert at the end-points.
Post by Robin Rosenberg
Just because I use UTF-8 doesn't mean I use start using more characters in
practice.
Most people do not, no. But using a Unicode encoding means that they at
least have the option. Sometimes, having to mangle stuff down to ASCII is a
pain.
--
\\// Peter - http://www.softwolves.pp.se/
Dmitry Potapov
2009-03-03 09:43:55 UTC
Permalink
Post by Peter Krefting
When opening a file through open() or fopen(), the path passed is
UTF-8 encoded. To handle this on Windows, we need to convert the
path string to UTF-16 and use the Unicode-based interface.
IMHO, you grossly underestimate what is needed to enable UTF-8 encoding
in Windows. AFAIK, Microsoft C runtime library does not support UTF-8,
so you have to wrap all C functions taking 'char*' as an input parameter.
For example, think about what is going to happen if Git tries to print
a simple error message:
fprintf (stderr, "unable to open %s", path);
Post by Peter Krefting
Since there is no real file system abstraction beyond using stdio_
(AFAIK), I need to hack it by replacing fopen (and open). Probably_
opendir/readdir as well (might be trickier), and possibly even hack_
around main() to parse the wchar_t command-line instead of the char copy.
And the command-line is not the only source of file names. Some Git
commands read list of files from stdin usually though the pipe. In
what encoding are they going to be?

Dmitry
Peter Krefting
2009-03-03 11:56:47 UTC
Permalink
IMHO, you grossly underestimate what is needed to enable UTF-8 encoding in
Windows. AFAIK, Microsoft C runtime library does not support UTF-8, so you
have to wrap all C functions taking 'char*' as an input parameter.
I have to wrap all file-related functions, at least.
For example, think about what is going to happen if Git tries to print a
simple error message: fprintf (stderr, "unable to open %s", path);
Yeah. That's a problem. That might be solvable by setting the thread locale
to something UTF-8 based and have the console window convert to the output
codepage (that is what it does when you use wprintf and friends).
And the command-line is not the only source of file names. Some Git
commands read list of files from stdin usually though the pipe. In what
encoding are they going to be?
Indeed. Pipes are a problem.
--
\\// Peter - http://www.softwolves.pp.se/
John Dlugosz
2009-03-03 18:25:14 UTC
Permalink
===Re:===
The other way would be to keep the char* APIs but convert to the Windows

locale encoding ("ANSI codepage"), but that will break horribly as not
all
file names that can be used on a file system can be represented as such.
===end===

Actually, UTF-8 is a valid code page on Windows. The code page ID is
65001. So, if you set the process code page to that, =and= set the file
system API's code page to follow rather than using the OEM code page
(the default), it should work just fine.

Also, there is a national code page that =will= represent all file names
on the systems and is supported: That is the Chinese GB18030, code page
54936. That has every character that Unicode does, just encoded
differently to be forward compatible with GBK. That is fully supported
by windows, as it is required by law to sell in Chinese markets.

Let me know if I can be of help. I know character set stuff and Win32
fairly well.

--John



TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited.
If you received this in error, please contact the sender and delete the material from any computer.
Peter Krefting
2009-03-04 10:53:14 UTC
Permalink
Post by John Dlugosz
Actually, UTF-8 is a valid code page on Windows.
Yes, but I am unsure whether it can be set as a thread locale for the sake
of file APIs.
Post by John Dlugosz
Also, there is a national code page that =will= represent all file names
on the systems and is supported: That is the Chinese GB18030, code page
54936.
Yeah, but unfortunately it is explicitly documented that it is only
supported in MultiByteToWideChar, WideCharToMultiByte and some text painting
APIs in Windows, i.e the stdio functions and others may break horribly.
--
\\// Peter - http://www.softwolves.pp.se/
John Dlugosz
2009-03-04 19:34:33 UTC
Permalink
===Re:===
Yes, but I am unsure whether it [UTF-8] can be set as a thread locale
for the sake
of file APIs.
===end===

Why wouldn't it? If the ANSI forms simply allocate buffers and call
WideCharToMultiByte and MultiByteToWideChar, it should work with
anything those functions handles. My only concern would be with buffer
length when converting to MultiByte, if it assumes a limit based on 2
bytes max per character. But, it works with GB18030, which can have
4-byte characters.

It's certainly easy enough to try.

===Re:===
Yeah, but unfortunately it [GB18030] is explicitly documented that it is
only
supported in MultiByteToWideChar, WideCharToMultiByte and some text
painting
APIs in Windows, i.e the stdio functions and others may break horribly.
===end===

Code that works with the other multi-byte "ANSI" character sets, and GBK
in particular, will handle GB18030 "reasonably well" with no changes.
For example, printf ("xxx%sxxx", name), where each 'x' may actually be
any character, will work without problems -- it won't mis-identify the %
in the middle of a 4-byte character. But printf ("%5s",name) will count
some of the characters in 'name' as two, and print less than 5 of them;
or worse yet, break a character in half.

I can't think of anything that breaks horribly. Only situations that
involve counting them will have issues.

As empirical evidence, lots of Windows software works fine in China.
You need full GB18030 support to read a newspaper on the web, because
the 4-byte characters are mostly obscure and regional words, but also
proper nouns including the names of some prominent people (Prime
Minister or something like that; I don't remember exactly). But mostly
you don't encounter them and chug along with GBK and the occasional '?'
where some character did not work.

--John

TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited.
If you received this in error, please contact the sender and delete the material from any computer.
John Dlugosz
2009-03-03 19:36:45 UTC
Permalink
===Re:===
You cannot expect users to switch the locale. For example, I have to
test
our software with Japanese settings: I *cannot* switch to UTF-8 just
because of git.

Can you set the local codepage per program? (I don't know.) It might
help
here, but it doesn't help in all cases, particularly in certain
pipelines:
===end===

Yes, you can. The code page can be set per thread. The function call
is:

SetThreadLocale (lcid);

where lcid is just 65001 for UTF-8. (The other fields in the LCID are
high-order bits and all zero for no sublanguage and default sort order).

When a thread is created, it starts with the system default thread
locale. So call SetThreadLocale on every thread you create. In
particular, realize that the new thread does not inherit this from the
creating thread.

Meanwhile... the file I/O functions don't use the same code page. The
encoding of file names on a floppy disk or whatnot was historically done
using the "OEM code page", and when a different code page is used for
text editing, that shouldn't break compatibility. So, all functions
exported from Kernel32.dll that accept or return file names uses a
separate setting, and setting the locale as shown above will not affect
it. This might be the source of confusion to those experimenting with
it.

So, also make a call to

SetFileApisToANSI();

This affects the entire process, not just the thread.

So much for specifying UTF-8 file names in Windows. A related issue is
the console input and output of same. I don't know if the sh program
that is part of msys or Cygwin does anything to the console window it is
using, but each console window can have its own code page as well. The
default for 8-bit API (char*'s) is also the OEM character set, not the
so-called ANSI character set that is specified with SetThreadLocale.
I've not experimented with setting this (and restoring it) within a
program invoked in that console. But if you use the 16-bit API for
console I/O, it is not a problem and works regardless of how the user
chose to set it. To make it even more confusing, the console doesn't
respect the UTF-8 setting if the font is not set properly too.

--John


TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited.
If you received this in error, please contact the sender and delete the material from any computer.
John Dlugosz
2009-03-03 20:39:07 UTC
Permalink
Re: AFAIK, Microsoft C runtime library does not support UTF-8,

Actually, here is a clip from the runtime library source code:

tmode = _textmode(fh);

switch(tmode) {
case __IOINFO_TM_UTF8 :
/* For a UTF-8 file, we need 2 buffers, because after
reading we
need to convert it into UNICODE - MultiByteToWideChar
doesn't do
in-place conversions. */

/* MultiByte To WideChar conversion may double the size
of the
buffer required & hence we divide cnt by 2 */

/*
* Since we are reading UTF8 stream, cnt bytes read may
vary
* from cnt wchar_t characters to cnt/4 wchar_t
characters. For
* this reason if we need to read cnt characters, we
will
* allocate MBCS buffer of cnt. In case cnt is 0, we
will
* have 4 as minimum value. This will make sure we don't
* overflow for reading from pipe case.
*
*
* In this case the numbers of wchar_t characters that
we can
* read is cnt/2. This means that the buffer size that
we will
* require is cnt/2.
*/

/* For UTF8 we want the count to be an even number */

This is in the _read(fd, buffer, count) function, and shows that it will
in fact read UTF-8 and automatically transform it to UTF-16LE
transparently. The documentation for _open explains this feature.

Meanwhile, a quick look at _mbslen() etc. shows that they are
implemented, and will handle UTF-8 encoded text as variable-length char*
just fine as long as suitable tables are loaded in its locale. An
internal header shows macros for generating the lead-byte information as
needed by that table.

Now, the default when a program starts is to use the "C" locale. The
locale argument to setlocale can take a form ".code_page", so calling

setlocale (LC_CTYPE, ".65001");

should do the trick. Assuming, that is, that you don't hit macros that
assume that characters are never multibyte. So define the preprocessor
symbol _MBCS when you compile.

Older versions might not work right because MBCS (multibyte character
strings) was only actually implemented to DBCS (double-byte). That is,
a single lead byte would be followed by a second byte, and no other
cases are provided for. But, GB18030 has up to 4 bytes in a single
character. It might still not be completely "clean" though because
GB18030 has a "double double" nature to it. Just like assuming 16-bit
characters period mostly works with surrogate pairs even if you didn't
code full UTF-16 support, DBCS code will see a 4-byte GB18030 character
as two double byte characters. So it gets the len (in characters)
wrong, and might still break up what is supposed to be a single
character. So it really needs some improvement from the historical
DBCS-only code to work properly.

Anyway, if UTF-8 really doesn't work with MBCS functions acceptably
well, and the goal is to allow passage of all characters through the
program, then set the program to use Chinese. GB18030 is =fully=
supported and is just another (albeit strange) encoding for Unicode.

As for what
fprintf (stderr, "unable to open %s", path);
will do, it will have no problem copying the contents of path to the
output stream no matter how it is encoded. The result will be sent to
stderr, which may be autotranslating the local code page to UTF-16 or
UTF-8, but by default just feeds the stream of bytes to the console
window's 8-bit API, which has its own code page setting.

Personally, I have printf'ed UTF-8 encoded text to standard output. It
looks OK if the console is also set to UTF-8.

--John
(please excuse the footer; it's not my idea)



TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited.
If you received this in error, please contact the sender and delete the material from any computer.
Dmitry Potapov
2009-03-03 21:02:40 UTC
Permalink
Post by John Dlugosz
Now, the default when a program starts is to use the "C" locale. The
locale argument to setlocale can take a form ".code_page", so calling
setlocale (LC_CTYPE, ".65001");
should do the trick. Assuming, that is, that you don't hit macros that
assume that characters are never multibyte. So define the preprocessor
symbol _MBCS when you compile.
If Microsoft fixed the problem with UTF-8 support in C runtime, it is
a really good
news, because setlocale did not work not so long time ago:

http://blogs.msdn.com/michkap/archive/2006/03/13/550191.aspx

As to Win32 API, it has always worked correctly with UTF-8... In fact, the
documentation of GetOEMCP function goes as far as recommending
to use UTF-8 or UTF-16: "For the most consistent results, applications should
use Unicode, such as UTF-8 or UTF-16, instead of a specific code page.

So it would be great if Git supported UTF-8 on Windows (as an option), but it
is not my itch right now....

Dmitry
John Dlugosz
2009-03-03 21:56:44 UTC
Permalink
===Re:===
If Microsoft fixed the problem with UTF-8 support in C runtime, it is
a really good
news, because setlocale did not work not so long time ago:
===end===

They totally replaced it with one written by P.J.Plauger. I'm not sure
when, but I would guess around VC++7.1, which was a "sea change" and
felt more like a different brand than a simple update. That's when
templates started following the standard.

Re:
http://blogs.msdn.com/michkap/archive/2006/03/13/550191.aspx

Interesting. So it sort-of worked, as per my overlong muse as I looked
at the source code, but they started explicitly preventing it because it
doesn't always work for everything.

// verify codepage validity
if (!iCodePage || iCodePage == CP_UTF7 || iCodePage == CP_UTF8 ||
!IsValidCodePage((WORD)iCodePage))
return FALSE;


===Re:===
As to Win32 API, it has always worked correctly with UTF-8... In fact,
the
documentation of GetOEMCP function goes as far as recommending
to use UTF-8 or UTF-16: "For the most consistent results, applications
should
use Unicode, such as UTF-8 or UTF-16, instead of a specific code page.
===end===

I remember a time when it did not. I don't recall if it was NT (as
opposed to consumer windows) or some version of NT beyond 3.5 (starting
in 4?) that it became available. But I had to supply code with the
program because it could not count on it.

===Re:===
So it would be great if Git supported UTF-8 on Windows (as an option),
but it
is not my itch right now....
===end===

someone else mentioned "most people use ASCII file names", and I would
take that to be true only if "most people" == "developers". If you look
at my wife's "explorer" view, it's all Chinese. Files are downloaded
with Asian file names. Most people =in= China are used to seemless
support within Windows. It's only with Chinese MUI on English Windows
that the "ANSI" stuff doesn't match and programs that use 8-bit API
calls suddenly croak as they see "?????" for input.

--John


TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ GS: TRAD) of three operating subsidiaries, TradeStation Securities, Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies, Inc., a trading software and subscription company, and TradeStation Europe Limited, a United Kingdom, FSA-authorized introducing brokerage firm. None of these companies provides trading or investment advice, recommendations or endorsements of any kind. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited.
If you received this in error, please contact the sender and delete the material from any computer.
Robin Rosenberg
2009-03-07 10:38:14 UTC
Permalink
Slightly related; A new cygwin (not msysgit-related) version with UTF-8 support was announced. Most notably:

- New setlocale implementation allows to specify POSIX locale strings.
You can now use, for instance in bash, `export LC_ALL=en_US.UTF-8'.
The language and territory will be ignored for now, the charset
will be used by multibyte-releated functions.

- UTF-8 filenames are supported now.

- Support UTF-8 in console window.

This certainly makes it more feasable to interoperate with *nix repos that has non-ascii metadata and file names.

-- robin

Continue reading on narkive:
Loading...