Bug 199361

Summary: NTFS driver fails on UTF-16 SMP characters in file names
Product: File System Reporter: Mingye Wang (arthur200126)
Component: OtherAssignee: fs_other
Status: NEW ---    
Severity: normal    
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: HEAD/All Subsystem:
Regression: No Bisected commit-id:

Description Mingye Wang 2018-04-11 17:48:13 UTC
The Kernel NTFS driver shares a problem with UDF (bug 199291) in that it handles UTF-16 code units one by one and fails on surrogates.

Steps to reproduce:
1. On a Windows box, created a file called 🐧.txt in some NTFS media.
2. Let Linux mount it with the RO driver.
3. Run `ls` on the mounted directory.

Expected results:
🐧.txt exists.

Actual results:
🐧.txt is not shown. Running `dmesg | tail` reveals "... contains chaacters that cannot be converted to utf8. try [...] nls=utf8".
Comment 1 Mingye Wang 2018-04-11 22:01:35 UTC
VFAT has a similar problem where 🐧.txt becomes ??.txt.

HFSplus driver calls uni2char, which is known to only accept a 16-bit wchar_t; it's therefore likely broken too.

JFS has a jfs_strfromUCS_le which seems to clear its own guilt with its name, but following the reasoning applied for UDF "Unicode" it should be fixed too.

Joliet uni16_to_x8 uses uni2char on Windows "Unicode" (UTF-16). 

* * *

I mean, just grep for "unichar" under fs/. You can probably open 10 separate reports from that grep. The NLS interface does not correctly handle SMP characters to start with.
Comment 2 Mingye Wang 2018-04-11 22:03:56 UTC
> grep for "unichar"

*uni2char