Bug 60807

Summary: not all the pages are encoded using utf-8
Product: Documentation Reporter: Weizhou Pan (cs.wzpan)
Component: man-pagesAssignee: documentation_man-pages (documentation_man-pages)
Status: RESOLVED CODE_FIX    
Severity: normal CC: mtk.manpages, pschiffe
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:
Attachments: print_encoding.sh
convert_to_utf_8.sh

Description Weizhou Pan 2013-08-28 13:38:03 UTC
I found that not all the pages are encoded using utf-8. It may cause problems once we try to parse them.

These files are:

* man3/fflush.3
* man3/toupper.3
* man3/updwtmp.3
* man3/encrypt.3
* man3/lockf.3
* man3/rand.3
* man3/fclose.3
* man3/strtok.3
* man2/close.2
* man2/getdomainname.2
* man2/madvise.2
* man2/umask.2
* man2/sysinfo.2
* man2/getrlimit.2
* man5/utmp.5
* man7/cp1251.7
* man7/iso_8859-2.7
* man7/armscii-8.7
* man7/suffixes.7
* man7/iso_8859-4.7
* man7/iso_8859-8.7
* man7/iso_8859-16.7
* man7/hier.7
* man7/iso_8859-13.7
* man7/koi8-u.7
* man7/environ.7
* man7/iso_8859-15.7
* man7/iso_8859-9.7
* man7/iso_8859-11.7
* man7/iso_8859-14.7
* man7/iso_8859-10.7
* man7/iso_8859-6.7
* man7/iso_8859-1.7
* man7/iso_8859-7.7
* man7/koi8-r.7
* man7/iso_8859-5.7
* man7/iso_8859-3.7
Comment 1 Peter Schiffer 2013-12-05 17:43:50 UTC
Created attachment 117641 [details]
print_encoding.sh

Script which can find man pages not in us-ascii.
Comment 2 Peter Schiffer 2013-12-05 17:44:39 UTC
Created attachment 117651 [details]
convert_to_utf_8.sh

Script which can convert non us-ascii man pages to utf-8.
Comment 3 Peter Schiffer 2013-12-05 17:46:37 UTC
$ ./print_encoding.sh man?/*

   Man Page               Encoding by file   Encoding by first line

 * man2/close.2           iso-8859-1         
 * man2/getdomainname.2   iso-8859-1         
 * man2/getrlimit.2       iso-8859-1         
 * man2/madvise.2         iso-8859-1         
 * man2/mount.2           utf-8              
 * man2/sysinfo.2         iso-8859-1         
 * man2/umask.2           iso-8859-1         
 * man3/encrypt.3         iso-8859-1         
 * man3/fclose.3          iso-8859-1         
 * man3/fflush.3          iso-8859-1         
 * man3/lockf.3           iso-8859-1         
 * man3/rand.3            iso-8859-1         
 * man3/strtok.3          iso-8859-1         
 * man3/toupper.3         iso-8859-1         
 * man3/updwtmp.3         iso-8859-1         
 * man4/st.4              utf-8              
 * man5/utmp.5            iso-8859-1         
 * man7/armscii-8.7       iso-8859-1         ARMSCII-8
 * man7/cp1251.7          unknown-8bit       CP1251
 * man7/environ.7         iso-8859-1         
 * man7/hier.7            iso-8859-1         
 * man7/iso_8859-10.7     iso-8859-1         ISO-8859-10
 * man7/iso_8859-11.7     iso-8859-1         ISO-8859-11
 * man7/iso_8859-13.7     iso-8859-1         ISO-8859-7
 * man7/iso_8859-14.7     iso-8859-1         ISO-8859-14
 * man7/iso_8859-15.7     iso-8859-1         ISO-8859-15
 * man7/iso_8859-16.7     iso-8859-1         ISO-8859-16
 * man7/iso_8859-1.7      iso-8859-1         
 * man7/iso_8859-2.7      iso-8859-1         ISO-8859-2
 * man7/iso_8859-3.7      iso-8859-1         ISO-8859-3
 * man7/iso_8859-4.7      iso-8859-1         ISO-8859-4
 * man7/iso_8859-5.7      iso-8859-1         ISO-8859-5
 * man7/iso_8859-6.7      iso-8859-1         ISO-8859-6
 * man7/iso_8859-7.7      iso-8859-1         ISO-8859-7
 * man7/iso_8859-8.7      iso-8859-1         ISO-8859-8
 * man7/iso_8859-9.7      iso-8859-1         ISO-8859-9
 * man7/koi8-r.7          unknown-8bit       KOI8-R
 * man7/koi8-u.7          unknown-8bit       
 * man7/suffixes.7        iso-8859-1         

$ ./convert_to_utf_8.sh tmp_encoded man?/*
Converting man2/close.2            from iso-8859-1
Converting man2/getdomainname.2    from iso-8859-1
Converting man2/getrlimit.2        from iso-8859-1
Converting man2/madvise.2          from iso-8859-1
Converting man2/mount.2            from utf-8
Converting man2/sysinfo.2          from iso-8859-1
Converting man2/umask.2            from iso-8859-1
Converting man3/encrypt.3          from iso-8859-1
Converting man3/fclose.3           from iso-8859-1
Converting man3/fflush.3           from iso-8859-1
Converting man3/lockf.3            from iso-8859-1
Converting man3/rand.3             from iso-8859-1
Converting man3/strtok.3           from iso-8859-1
Converting man3/toupper.3          from iso-8859-1
Converting man3/updwtmp.3          from iso-8859-1
Converting man4/st.4               from utf-8
Converting man5/utmp.5             from iso-8859-1
Converting man7/armscii-8.7        from armscii-8
Converting man7/cp1251.7           from cp1251
Converting man7/environ.7          from iso-8859-1
Converting man7/hier.7             from iso-8859-1
Converting man7/iso_8859-10.7      from iso_8859-10
Converting man7/iso_8859-11.7      from iso-8859-1
Converting man7/iso_8859-13.7      from iso-8859-1
Converting man7/iso_8859-14.7      from iso_8859-14
Converting man7/iso_8859-15.7      from iso_8859-15
Converting man7/iso_8859-16.7      from iso_8859-16
Converting man7/iso_8859-1.7       from iso_8859-1
Converting man7/iso_8859-2.7       from iso_8859-2
Converting man7/iso_8859-3.7       from iso_8859-3
Converting man7/iso_8859-4.7       from iso_8859-4
Converting man7/iso_8859-5.7       from iso_8859-5
Converting man7/iso_8859-6.7       from iso_8859-6
Converting man7/iso_8859-7.7       from iso_8859-7
Converting man7/iso_8859-8.7       from iso_8859-8
Converting man7/iso_8859-9.7       from iso_8859-9
Converting man7/koi8-r.7           from koi8-r
Converting man7/koi8-u.7           from koi8-u
Converting man7/suffixes.7         from iso-8859-1

$ cd tmp_encoded/

$ ../print_encoding.sh man?/*

   Man Page               Encoding by file   Encoding by first line

 * man2/close.2           utf-8              UTF-8
 * man2/getdomainname.2   utf-8              UTF-8
 * man2/getrlimit.2       utf-8              UTF-8
 * man2/madvise.2         utf-8              UTF-8
 * man2/mount.2           utf-8              UTF-8
 * man2/sysinfo.2         utf-8              UTF-8
 * man2/umask.2           utf-8              UTF-8
 * man3/encrypt.3         utf-8              UTF-8
 * man3/fclose.3          utf-8              UTF-8
 * man3/fflush.3          utf-8              UTF-8
 * man3/lockf.3           utf-8              UTF-8
 * man3/rand.3            utf-8              UTF-8
 * man3/strtok.3          utf-8              UTF-8
 * man3/toupper.3         utf-8              UTF-8
 * man3/updwtmp.3         utf-8              UTF-8
 * man4/st.4              utf-8              UTF-8
 * man5/utmp.5            utf-8              UTF-8
 * man7/armscii-8.7       utf-8              UTF-8
 * man7/cp1251.7          utf-8              UTF-8
 * man7/environ.7         utf-8              UTF-8
 * man7/hier.7            utf-8              UTF-8
 * man7/iso_8859-10.7     utf-8              UTF-8
 * man7/iso_8859-11.7     utf-8              UTF-8
 * man7/iso_8859-13.7     utf-8              UTF-8
 * man7/iso_8859-14.7     utf-8              UTF-8
 * man7/iso_8859-15.7     utf-8              UTF-8
 * man7/iso_8859-16.7     utf-8              UTF-8
 * man7/iso_8859-1.7      utf-8              UTF-8
 * man7/iso_8859-2.7      utf-8              UTF-8
 * man7/iso_8859-3.7      utf-8              UTF-8
 * man7/iso_8859-4.7      utf-8              UTF-8
 * man7/iso_8859-5.7      utf-8              UTF-8
 * man7/iso_8859-6.7      utf-8              UTF-8
 * man7/iso_8859-7.7      utf-8              UTF-8
 * man7/iso_8859-8.7      utf-8              UTF-8
 * man7/iso_8859-9.7      utf-8              UTF-8
 * man7/koi8-r.7          utf-8              UTF-8
 * man7/koi8-u.7          utf-8              UTF-8
 * man7/suffixes.7        utf-8              UTF-8
Comment 4 Michael Kerrisk 2014-02-14 10:22:04 UTC
(In reply to Peter Schiffer from comment #3)
> $ ./print_encoding.sh man?/*
> 
>    Man Page               Encoding by file   Encoding by first line
> 
>  * man2/close.2           iso-8859-1         
>  * man2/getdomainname.2   iso-8859-1         
>  * man2/getrlimit.2       iso-8859-1         
>  * man2/madvise.2         iso-8859-1         
>  * man2/mount.2           utf-8              
>  * man2/sysinfo.2         iso-8859-1         
>  * man2/umask.2           iso-8859-1         
>  * man3/encrypt.3         iso-8859-1         
>  * man3/fclose.3          iso-8859-1         
>  * man3/fflush.3          iso-8859-1         
>  * man3/lockf.3           iso-8859-1         
>  * man3/rand.3            iso-8859-1         
>  * man3/strtok.3          iso-8859-1         
>  * man3/toupper.3         iso-8859-1         
>  * man3/updwtmp.3         iso-8859-1         
>  * man4/st.4              utf-8              
>  * man5/utmp.5            iso-8859-1         
>  * man7/armscii-8.7       iso-8859-1         ARMSCII-8
>  * man7/cp1251.7          unknown-8bit       CP1251
>  * man7/environ.7         iso-8859-1         
>  * man7/hier.7            iso-8859-1         
>  * man7/iso_8859-10.7     iso-8859-1         ISO-8859-10
>  * man7/iso_8859-11.7     iso-8859-1         ISO-8859-11
>  * man7/iso_8859-13.7     iso-8859-1         ISO-8859-7
>  * man7/iso_8859-14.7     iso-8859-1         ISO-8859-14
>  * man7/iso_8859-15.7     iso-8859-1         ISO-8859-15
>  * man7/iso_8859-16.7     iso-8859-1         ISO-8859-16
>  * man7/iso_8859-1.7      iso-8859-1         
>  * man7/iso_8859-2.7      iso-8859-1         ISO-8859-2
>  * man7/iso_8859-3.7      iso-8859-1         ISO-8859-3
>  * man7/iso_8859-4.7      iso-8859-1         ISO-8859-4
>  * man7/iso_8859-5.7      iso-8859-1         ISO-8859-5
>  * man7/iso_8859-6.7      iso-8859-1         ISO-8859-6
>  * man7/iso_8859-7.7      iso-8859-1         ISO-8859-7
>  * man7/iso_8859-8.7      iso-8859-1         ISO-8859-8
>  * man7/iso_8859-9.7      iso-8859-1         ISO-8859-9
>  * man7/koi8-r.7          unknown-8bit       KOI8-R
>  * man7/koi8-u.7          unknown-8bit       
>  * man7/suffixes.7        iso-8859-1         
> 
> $ ./convert_to_utf_8.sh tmp_encoded man?/*
> Converting man2/close.2            from iso-8859-1
> Converting man2/getdomainname.2    from iso-8859-1
> Converting man2/getrlimit.2        from iso-8859-1
> Converting man2/madvise.2          from iso-8859-1
> Converting man2/mount.2            from utf-8
> Converting man2/sysinfo.2          from iso-8859-1
> Converting man2/umask.2            from iso-8859-1
> Converting man3/encrypt.3          from iso-8859-1
> Converting man3/fclose.3           from iso-8859-1
> Converting man3/fflush.3           from iso-8859-1
> Converting man3/lockf.3            from iso-8859-1
> Converting man3/rand.3             from iso-8859-1
> Converting man3/strtok.3           from iso-8859-1
> Converting man3/toupper.3          from iso-8859-1
> Converting man3/updwtmp.3          from iso-8859-1
> Converting man4/st.4               from utf-8
> Converting man5/utmp.5             from iso-8859-1
> Converting man7/armscii-8.7        from armscii-8
> Converting man7/cp1251.7           from cp1251
> Converting man7/environ.7          from iso-8859-1
> Converting man7/hier.7             from iso-8859-1
> Converting man7/iso_8859-10.7      from iso_8859-10
> Converting man7/iso_8859-11.7      from iso-8859-1
> Converting man7/iso_8859-13.7      from iso-8859-1
> Converting man7/iso_8859-14.7      from iso_8859-14
> Converting man7/iso_8859-15.7      from iso_8859-15
> Converting man7/iso_8859-16.7      from iso_8859-16
> Converting man7/iso_8859-1.7       from iso_8859-1
> Converting man7/iso_8859-2.7       from iso_8859-2
> Converting man7/iso_8859-3.7       from iso_8859-3
> Converting man7/iso_8859-4.7       from iso_8859-4
> Converting man7/iso_8859-5.7       from iso_8859-5
> Converting man7/iso_8859-6.7       from iso_8859-6
> Converting man7/iso_8859-7.7       from iso_8859-7
> Converting man7/iso_8859-8.7       from iso_8859-8
> Converting man7/iso_8859-9.7       from iso_8859-9
> Converting man7/koi8-r.7           from koi8-r
> Converting man7/koi8-u.7           from koi8-u
> Converting man7/suffixes.7         from iso-8859-1
> 
> $ cd tmp_encoded/
> 
> $ ../print_encoding.sh man?/*
> 
>    Man Page               Encoding by file   Encoding by first line
> 
>  * man2/close.2           utf-8              UTF-8
>  * man2/getdomainname.2   utf-8              UTF-8
>  * man2/getrlimit.2       utf-8              UTF-8
>  * man2/madvise.2         utf-8              UTF-8
>  * man2/mount.2           utf-8              UTF-8
>  * man2/sysinfo.2         utf-8              UTF-8
>  * man2/umask.2           utf-8              UTF-8
>  * man3/encrypt.3         utf-8              UTF-8
>  * man3/fclose.3          utf-8              UTF-8
>  * man3/fflush.3          utf-8              UTF-8
>  * man3/lockf.3           utf-8              UTF-8
>  * man3/rand.3            utf-8              UTF-8
>  * man3/strtok.3          utf-8              UTF-8
>  * man3/toupper.3         utf-8              UTF-8
>  * man3/updwtmp.3         utf-8              UTF-8
>  * man4/st.4              utf-8              UTF-8
>  * man5/utmp.5            utf-8              UTF-8
>  * man7/armscii-8.7       utf-8              UTF-8
>  * man7/cp1251.7          utf-8              UTF-8
>  * man7/environ.7         utf-8              UTF-8
>  * man7/hier.7            utf-8              UTF-8
>  * man7/iso_8859-10.7     utf-8              UTF-8
>  * man7/iso_8859-11.7     utf-8              UTF-8
>  * man7/iso_8859-13.7     utf-8              UTF-8
>  * man7/iso_8859-14.7     utf-8              UTF-8
>  * man7/iso_8859-15.7     utf-8              UTF-8
>  * man7/iso_8859-16.7     utf-8              UTF-8
>  * man7/iso_8859-1.7      utf-8              UTF-8
>  * man7/iso_8859-2.7      utf-8              UTF-8
>  * man7/iso_8859-3.7      utf-8              UTF-8
>  * man7/iso_8859-4.7      utf-8              UTF-8
>  * man7/iso_8859-5.7      utf-8              UTF-8
>  * man7/iso_8859-6.7      utf-8              UTF-8
>  * man7/iso_8859-7.7      utf-8              UTF-8
>  * man7/iso_8859-8.7      utf-8              UTF-8
>  * man7/iso_8859-9.7      utf-8              UTF-8
>  * man7/koi8-r.7          utf-8              UTF-8
>  * man7/koi8-u.7          utf-8              UTF-8
>  * man7/suffixes.7        utf-8              UTF-8

Peter,

Sorry to be slow following up on this. Thanks for the scripts.

As some background, I'll just note that the current encoding markers in the iso_8859* pages were added in response to this 2009 bug report:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=519209

It seems a reasonable idea to convert everything to UTF-8, but I have some concerns/questions.

1. Is the encoding line: 
'\" t -*- coding: UTF-8 -*-
really needed, or does modern groff just work this out?

2. I'm concerned about backward compatibility issues. As in: what if someone loads the man pages onto a system with old groff. Now, as far as I can work out, groff added input unicode support in v1.20, 2009
(http://lists.gnu.org/archive/html/groff/2009-01/msg00011.html). So, perhaps that's long enough ago that we don't need to worry too much about these issues.

Any thoughts?
Comment 5 Peter Schiffer 2014-02-14 12:47:13 UTC
Hi Michael,

1. It looks like it works without the encoding line, but as Colin said in the email, it's better with it.

2. Also, greatly answered by Colin, I'll just add that we are converting man-pages to utf-8 since before the first RHEL-6 was released, when Fedora wasn't using the man-db but man 1.6. In general, we convert almost everything to the utf-8 what is not and has special characters, otherwise it's usually troubles..

peter
Comment 6 Michael Kerrisk 2014-02-16 06:34:47 UTC
For reference, the discussion thread on linx-man@

http://thread.gmane.org/gmane.linux.man/5069
Subject: Converting man-pages to UTF-8
Date: 2014-02-14 10:43:30 UTC
Comment 7 Michael Kerrisk 2014-02-16 07:44:04 UTC
(In reply to Peter Schiffer from comment #5)
> Hi Michael,
> 
> 1. It looks like it works without the encoding line, but as Colin said in
> the email, it's better with it.
> 
> 2. Also, greatly answered by Colin, I'll just add that we are converting
> man-pages to utf-8 since before the first RHEL-6 was released, when Fedora
> wasn't using the man-db but man 1.6. In general, we convert almost
> everything to the utf-8 what is not and has special characters, otherwise
> it's usually troubles..

Pete,

I've applied your scipts, with the slight (manual) tweak that the "coding:"
line is added only to pages with UTF-8 in the source.

I've also checked your scripts into the scripts/ directory in the man-pages Git repo. Maybe they'll come in useful someday or for someone else. Thanks for doing the legwork on this issue.

Cheers,

Michael
Comment 8 Weizhou Pan 2014-02-18 15:42:32 UTC
Looks awesome. Thanks for your works!