Bug 199361 - NTFS driver fails on UTF-16 SMP characters in file names
Summary: NTFS driver fails on UTF-16 SMP characters in file names
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: fs_other
Depends on:
Reported: 2018-04-11 17:48 UTC by Mingye Wang
Modified: 2018-04-13 03:37 UTC (History)
0 users

See Also:
Kernel Version: HEAD/All
Tree: Mainline
Regression: No


Description Mingye Wang 2018-04-11 17:48:13 UTC
The Kernel NTFS driver shares a problem with UDF (bug 199291) in that it handles UTF-16 code units one by one and fails on surrogates.

Steps to reproduce:
1. On a Windows box, created a file called 🐧.txt in some NTFS media.
2. Let Linux mount it with the RO driver.
3. Run `ls` on the mounted directory.

Expected results:
🐧.txt exists.

Actual results:
🐧.txt is not shown. Running `dmesg | tail` reveals "... contains chaacters that cannot be converted to utf8. try [...] nls=utf8".
Comment 1 Mingye Wang 2018-04-11 22:01:35 UTC
VFAT has a similar problem where 🐧.txt becomes ??.txt.

HFSplus driver calls uni2char, which is known to only accept a 16-bit wchar_t; it's therefore likely broken too.

JFS has a jfs_strfromUCS_le which seems to clear its own guilt with its name, but following the reasoning applied for UDF "Unicode" it should be fixed too.

Joliet uni16_to_x8 uses uni2char on Windows "Unicode" (UTF-16). 

* * *

I mean, just grep for "unichar" under fs/. You can probably open 10 separate reports from that grep. The NLS interface does not correctly handle SMP characters to start with.
Comment 2 Mingye Wang 2018-04-11 22:03:56 UTC
> grep for "unichar"


Note You need to log in before you can comment on or make changes to this bug.