Bug 199291 - UDF OSTA CS0 implementation leaks Unicode surrogates to UTF-8
Summary: UDF OSTA CS0 implementation leaks Unicode surrogates to UTF-8
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: UDF (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Jan Kara
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-04-04 17:59 UTC by Mingye Wang
Modified: 2018-04-22 01:12 UTC (History)
1 user (show)

See Also:
Kernel Version: HEAD/All
Subsystem:
Regression: No
Bisected commit-id:


Attachments
udf: Fix leak of UTF-16 surrogates into encoded string (1.58 KB, patch)
2018-04-13 13:42 UTC, Jan Kara
Details | Diff
Full UTF-16 support for UDF pre-Linux 4.5. (8.36 KB, patch)
2018-04-13 22:14 UTC, Mingye Wang
Details | Diff

Description Mingye Wang 2018-04-04 17:59:18 UTC
As of the current Linux kernel on HEAD, for UDF unicode in fs/udf/unicode.c, no special handling is given for Unicode surrogate characters found in a decoded stream of OSTA CS0, causing:

1. Leak of Unicode surrogates into UTF-8, creating an ill-formed string,
2. Loss of SMP characters in UTF-8 when converting to OSTA CS0.

Although no such handling is mentioned in the UDF specs (as of 2.60), it should be noted that the OSTA "Unicode", embodied in the two sample functions UncompressUnicode and CompressUnicode, is defined with 16-bit code units much like UCS-2. Such an assumption works well on the "Unicode" found on platforms like Microsoft Windows (think NTFS), as they are in fact UTF-16 and treated as such somewhere else later in the pipeline.

For a UTF-8 (or UTF-32) "Unicode" side like Linux's to achieve the same effect, care should be taken to split SMP characters into surrogates when encoding and to merge surrogates into SMP when decoding.

------------

(Speaking of NTFS, the ntfs_ucstonls function looks a bit fishy with its unit-by-unit calls to uni2char too. The entire uni2char interface is a bit (16 bits) too narrow to start with. Question marks are at least better than ill-formed strings I guess?)
Comment 1 Mingye Wang 2018-04-04 18:03:02 UTC
It can be reasonably assumed that Microsoft Windows handles the surrogates properly. FreeBSD converts the strings via a 16-bit "Unicode" representation, and can optionally use a "kiconv" to handle the non-latin1 part as UTF-16BE.
Comment 2 Jan Kara 2018-04-11 07:42:25 UTC
Thanks for the report. Yeah, the intention in the Linux UDF implementation was to handle just characters from Basic Multilingual Plane since that's how I understood the OSTA UDF standard (and that is the UCS-2 encoding you speak about). Now looking at that again, since we treat CS0 strings with CompressionID == 8 as utf8 string, it also makes sense to treat CS0 strings with CompressionID == 16 as utf16 string and not just UCS-2. Also the fact that Windows treat these strings as utf16 weight more towards the decision to threat the string as utf16 for compatibility reasons. I'll work on improving the code.
Comment 3 Mingye Wang 2018-04-11 08:05:29 UTC
Speaking of the compID == 8 mode, it's more similar to a ISO-8859-1 with C0+C1 control codes. You still need to call something UTF8 on that to be complaint.
Comment 4 Jan Kara 2018-04-13 13:42:23 UTC
Created attachment 275353 [details]
udf: Fix leak of UTF-16 surrogates into encoded string

So after looking into this for some time I've decided for a simple fix of treating UTF-16 surrogates as invalid characters. So they won't leak to the resulting decoded string but we won't be able to decode characters from 0x10000 .. 0x10ffff range. But base multilingual plane contains the most of useful characters so I don't think many people care about supplementary planes. In principle we could support them but because Linux NLS framework works with 16-bit unicode it would be a bit of a hassle to implement this (we'd have to have a special utf-16 to utf-8 path, including decoding of utf-16 to 32-bit unicode first). And we might always implement this later if there's a real demand.
Comment 5 Mingye Wang 2018-04-13 14:10:22 UTC
There are 2 specific u16/u8 routines in fs/nls, named utf16s_to_utf8s and the reverse. Joliet has a good 16->8 handling that bypasses the nls framework for utf8, which is the only encoding capable of doing any SMP characters in kernel for now. Vfat likewise does this for 8->16 properly.
Comment 6 Mingye Wang 2018-04-13 15:16:13 UTC
Evidently UDF driver did this almost right before commit 3e7fc2055c931b1c27a9834a753611c879492a34 was introduced -- dedicated routines for u8/nls, etc, etc. In tag-version-speak the commit appears in Linux 4.6.

These changes are required to get pre-4.6 kernels to work properly:

* udf_UTF8toCS0 should have its 16-bit part replaced with utf8s_to_utf16s. The 8/16-bit decision may need to be done before utf8s_to_utf16s, as the 16-bit representation halves the number of usable characters. A "dumb" way to do such a decision is by hopping through all the characters in the UTF-8 string, and check for:
  * all characters are represented with 1 or 2 bytes, i.e. (1st_octet & 0xa0) != 0xa0.
  * no 2-byte characters exceed 0xff; i.e. (1st_octet >> 2) == 0x110000.
* udf_CS0toUTF8 can be broken into two cases: utf16s_to_utf8s for compid == 16. and a normal utf8 encode for compid == 8. You can specialize that utf8 encode part to consider c <= 0xff only, or you can use the nls routine.
Comment 7 Mingye Wang 2018-04-13 22:14:26 UTC
Created attachment 275361 [details]
Full UTF-16 support for UDF pre-Linux 4.5.
Comment 8 Jan Kara 2018-04-16 07:08:32 UTC
Thanks for the patch. But the code prior to 4.5 had other issues (wrong handling of invalid characters, some overflow issues if I remember right) so we cannot really just go back. What you suggest can be done even with the current codebase, no doubts about that, and it isn't even that hard. It's only a question whether it is worth it - currently I know about 0 users of characters outside of BMP as filenames on UDF.
Comment 9 Jan Kara 2018-04-17 12:49:15 UTC
OK, after sleeping to it, I've decide to implement full UTF-16 decoding / encoding while I still have the code and UTF rules fresh in my head. I'll send patches to the list and CC you. If you can do some testing, it would be welcome (I've tested that both one word and two word UTF-16 chars work when encoding / decoding but not much more).
Comment 10 Jan Kara 2018-04-19 20:26:27 UTC
I've pulled patches to my tree. Closing the bug.
Comment 11 Mingye Wang 2018-04-22 01:12:36 UTC
Thanks! I was not quite able to make a UDF image to test for the past days, but the changes look all right.

Note You need to log in before you can comment on or make changes to this bug.