I found that not all the pages are encoded using utf-8. It may cause problems once we try to parse them. These files are: * man3/fflush.3 * man3/toupper.3 * man3/updwtmp.3 * man3/encrypt.3 * man3/lockf.3 * man3/rand.3 * man3/fclose.3 * man3/strtok.3 * man2/close.2 * man2/getdomainname.2 * man2/madvise.2 * man2/umask.2 * man2/sysinfo.2 * man2/getrlimit.2 * man5/utmp.5 * man7/cp1251.7 * man7/iso_8859-2.7 * man7/armscii-8.7 * man7/suffixes.7 * man7/iso_8859-4.7 * man7/iso_8859-8.7 * man7/iso_8859-16.7 * man7/hier.7 * man7/iso_8859-13.7 * man7/koi8-u.7 * man7/environ.7 * man7/iso_8859-15.7 * man7/iso_8859-9.7 * man7/iso_8859-11.7 * man7/iso_8859-14.7 * man7/iso_8859-10.7 * man7/iso_8859-6.7 * man7/iso_8859-1.7 * man7/iso_8859-7.7 * man7/koi8-r.7 * man7/iso_8859-5.7 * man7/iso_8859-3.7
Created attachment 117641 [details] print_encoding.sh Script which can find man pages not in us-ascii.
Created attachment 117651 [details] convert_to_utf_8.sh Script which can convert non us-ascii man pages to utf-8.
$ ./print_encoding.sh man?/* Man Page Encoding by file Encoding by first line * man2/close.2 iso-8859-1 * man2/getdomainname.2 iso-8859-1 * man2/getrlimit.2 iso-8859-1 * man2/madvise.2 iso-8859-1 * man2/mount.2 utf-8 * man2/sysinfo.2 iso-8859-1 * man2/umask.2 iso-8859-1 * man3/encrypt.3 iso-8859-1 * man3/fclose.3 iso-8859-1 * man3/fflush.3 iso-8859-1 * man3/lockf.3 iso-8859-1 * man3/rand.3 iso-8859-1 * man3/strtok.3 iso-8859-1 * man3/toupper.3 iso-8859-1 * man3/updwtmp.3 iso-8859-1 * man4/st.4 utf-8 * man5/utmp.5 iso-8859-1 * man7/armscii-8.7 iso-8859-1 ARMSCII-8 * man7/cp1251.7 unknown-8bit CP1251 * man7/environ.7 iso-8859-1 * man7/hier.7 iso-8859-1 * man7/iso_8859-10.7 iso-8859-1 ISO-8859-10 * man7/iso_8859-11.7 iso-8859-1 ISO-8859-11 * man7/iso_8859-13.7 iso-8859-1 ISO-8859-7 * man7/iso_8859-14.7 iso-8859-1 ISO-8859-14 * man7/iso_8859-15.7 iso-8859-1 ISO-8859-15 * man7/iso_8859-16.7 iso-8859-1 ISO-8859-16 * man7/iso_8859-1.7 iso-8859-1 * man7/iso_8859-2.7 iso-8859-1 ISO-8859-2 * man7/iso_8859-3.7 iso-8859-1 ISO-8859-3 * man7/iso_8859-4.7 iso-8859-1 ISO-8859-4 * man7/iso_8859-5.7 iso-8859-1 ISO-8859-5 * man7/iso_8859-6.7 iso-8859-1 ISO-8859-6 * man7/iso_8859-7.7 iso-8859-1 ISO-8859-7 * man7/iso_8859-8.7 iso-8859-1 ISO-8859-8 * man7/iso_8859-9.7 iso-8859-1 ISO-8859-9 * man7/koi8-r.7 unknown-8bit KOI8-R * man7/koi8-u.7 unknown-8bit * man7/suffixes.7 iso-8859-1 $ ./convert_to_utf_8.sh tmp_encoded man?/* Converting man2/close.2 from iso-8859-1 Converting man2/getdomainname.2 from iso-8859-1 Converting man2/getrlimit.2 from iso-8859-1 Converting man2/madvise.2 from iso-8859-1 Converting man2/mount.2 from utf-8 Converting man2/sysinfo.2 from iso-8859-1 Converting man2/umask.2 from iso-8859-1 Converting man3/encrypt.3 from iso-8859-1 Converting man3/fclose.3 from iso-8859-1 Converting man3/fflush.3 from iso-8859-1 Converting man3/lockf.3 from iso-8859-1 Converting man3/rand.3 from iso-8859-1 Converting man3/strtok.3 from iso-8859-1 Converting man3/toupper.3 from iso-8859-1 Converting man3/updwtmp.3 from iso-8859-1 Converting man4/st.4 from utf-8 Converting man5/utmp.5 from iso-8859-1 Converting man7/armscii-8.7 from armscii-8 Converting man7/cp1251.7 from cp1251 Converting man7/environ.7 from iso-8859-1 Converting man7/hier.7 from iso-8859-1 Converting man7/iso_8859-10.7 from iso_8859-10 Converting man7/iso_8859-11.7 from iso-8859-1 Converting man7/iso_8859-13.7 from iso-8859-1 Converting man7/iso_8859-14.7 from iso_8859-14 Converting man7/iso_8859-15.7 from iso_8859-15 Converting man7/iso_8859-16.7 from iso_8859-16 Converting man7/iso_8859-1.7 from iso_8859-1 Converting man7/iso_8859-2.7 from iso_8859-2 Converting man7/iso_8859-3.7 from iso_8859-3 Converting man7/iso_8859-4.7 from iso_8859-4 Converting man7/iso_8859-5.7 from iso_8859-5 Converting man7/iso_8859-6.7 from iso_8859-6 Converting man7/iso_8859-7.7 from iso_8859-7 Converting man7/iso_8859-8.7 from iso_8859-8 Converting man7/iso_8859-9.7 from iso_8859-9 Converting man7/koi8-r.7 from koi8-r Converting man7/koi8-u.7 from koi8-u Converting man7/suffixes.7 from iso-8859-1 $ cd tmp_encoded/ $ ../print_encoding.sh man?/* Man Page Encoding by file Encoding by first line * man2/close.2 utf-8 UTF-8 * man2/getdomainname.2 utf-8 UTF-8 * man2/getrlimit.2 utf-8 UTF-8 * man2/madvise.2 utf-8 UTF-8 * man2/mount.2 utf-8 UTF-8 * man2/sysinfo.2 utf-8 UTF-8 * man2/umask.2 utf-8 UTF-8 * man3/encrypt.3 utf-8 UTF-8 * man3/fclose.3 utf-8 UTF-8 * man3/fflush.3 utf-8 UTF-8 * man3/lockf.3 utf-8 UTF-8 * man3/rand.3 utf-8 UTF-8 * man3/strtok.3 utf-8 UTF-8 * man3/toupper.3 utf-8 UTF-8 * man3/updwtmp.3 utf-8 UTF-8 * man4/st.4 utf-8 UTF-8 * man5/utmp.5 utf-8 UTF-8 * man7/armscii-8.7 utf-8 UTF-8 * man7/cp1251.7 utf-8 UTF-8 * man7/environ.7 utf-8 UTF-8 * man7/hier.7 utf-8 UTF-8 * man7/iso_8859-10.7 utf-8 UTF-8 * man7/iso_8859-11.7 utf-8 UTF-8 * man7/iso_8859-13.7 utf-8 UTF-8 * man7/iso_8859-14.7 utf-8 UTF-8 * man7/iso_8859-15.7 utf-8 UTF-8 * man7/iso_8859-16.7 utf-8 UTF-8 * man7/iso_8859-1.7 utf-8 UTF-8 * man7/iso_8859-2.7 utf-8 UTF-8 * man7/iso_8859-3.7 utf-8 UTF-8 * man7/iso_8859-4.7 utf-8 UTF-8 * man7/iso_8859-5.7 utf-8 UTF-8 * man7/iso_8859-6.7 utf-8 UTF-8 * man7/iso_8859-7.7 utf-8 UTF-8 * man7/iso_8859-8.7 utf-8 UTF-8 * man7/iso_8859-9.7 utf-8 UTF-8 * man7/koi8-r.7 utf-8 UTF-8 * man7/koi8-u.7 utf-8 UTF-8 * man7/suffixes.7 utf-8 UTF-8
(In reply to Peter Schiffer from comment #3) > $ ./print_encoding.sh man?/* > > Man Page Encoding by file Encoding by first line > > * man2/close.2 iso-8859-1 > * man2/getdomainname.2 iso-8859-1 > * man2/getrlimit.2 iso-8859-1 > * man2/madvise.2 iso-8859-1 > * man2/mount.2 utf-8 > * man2/sysinfo.2 iso-8859-1 > * man2/umask.2 iso-8859-1 > * man3/encrypt.3 iso-8859-1 > * man3/fclose.3 iso-8859-1 > * man3/fflush.3 iso-8859-1 > * man3/lockf.3 iso-8859-1 > * man3/rand.3 iso-8859-1 > * man3/strtok.3 iso-8859-1 > * man3/toupper.3 iso-8859-1 > * man3/updwtmp.3 iso-8859-1 > * man4/st.4 utf-8 > * man5/utmp.5 iso-8859-1 > * man7/armscii-8.7 iso-8859-1 ARMSCII-8 > * man7/cp1251.7 unknown-8bit CP1251 > * man7/environ.7 iso-8859-1 > * man7/hier.7 iso-8859-1 > * man7/iso_8859-10.7 iso-8859-1 ISO-8859-10 > * man7/iso_8859-11.7 iso-8859-1 ISO-8859-11 > * man7/iso_8859-13.7 iso-8859-1 ISO-8859-7 > * man7/iso_8859-14.7 iso-8859-1 ISO-8859-14 > * man7/iso_8859-15.7 iso-8859-1 ISO-8859-15 > * man7/iso_8859-16.7 iso-8859-1 ISO-8859-16 > * man7/iso_8859-1.7 iso-8859-1 > * man7/iso_8859-2.7 iso-8859-1 ISO-8859-2 > * man7/iso_8859-3.7 iso-8859-1 ISO-8859-3 > * man7/iso_8859-4.7 iso-8859-1 ISO-8859-4 > * man7/iso_8859-5.7 iso-8859-1 ISO-8859-5 > * man7/iso_8859-6.7 iso-8859-1 ISO-8859-6 > * man7/iso_8859-7.7 iso-8859-1 ISO-8859-7 > * man7/iso_8859-8.7 iso-8859-1 ISO-8859-8 > * man7/iso_8859-9.7 iso-8859-1 ISO-8859-9 > * man7/koi8-r.7 unknown-8bit KOI8-R > * man7/koi8-u.7 unknown-8bit > * man7/suffixes.7 iso-8859-1 > > $ ./convert_to_utf_8.sh tmp_encoded man?/* > Converting man2/close.2 from iso-8859-1 > Converting man2/getdomainname.2 from iso-8859-1 > Converting man2/getrlimit.2 from iso-8859-1 > Converting man2/madvise.2 from iso-8859-1 > Converting man2/mount.2 from utf-8 > Converting man2/sysinfo.2 from iso-8859-1 > Converting man2/umask.2 from iso-8859-1 > Converting man3/encrypt.3 from iso-8859-1 > Converting man3/fclose.3 from iso-8859-1 > Converting man3/fflush.3 from iso-8859-1 > Converting man3/lockf.3 from iso-8859-1 > Converting man3/rand.3 from iso-8859-1 > Converting man3/strtok.3 from iso-8859-1 > Converting man3/toupper.3 from iso-8859-1 > Converting man3/updwtmp.3 from iso-8859-1 > Converting man4/st.4 from utf-8 > Converting man5/utmp.5 from iso-8859-1 > Converting man7/armscii-8.7 from armscii-8 > Converting man7/cp1251.7 from cp1251 > Converting man7/environ.7 from iso-8859-1 > Converting man7/hier.7 from iso-8859-1 > Converting man7/iso_8859-10.7 from iso_8859-10 > Converting man7/iso_8859-11.7 from iso-8859-1 > Converting man7/iso_8859-13.7 from iso-8859-1 > Converting man7/iso_8859-14.7 from iso_8859-14 > Converting man7/iso_8859-15.7 from iso_8859-15 > Converting man7/iso_8859-16.7 from iso_8859-16 > Converting man7/iso_8859-1.7 from iso_8859-1 > Converting man7/iso_8859-2.7 from iso_8859-2 > Converting man7/iso_8859-3.7 from iso_8859-3 > Converting man7/iso_8859-4.7 from iso_8859-4 > Converting man7/iso_8859-5.7 from iso_8859-5 > Converting man7/iso_8859-6.7 from iso_8859-6 > Converting man7/iso_8859-7.7 from iso_8859-7 > Converting man7/iso_8859-8.7 from iso_8859-8 > Converting man7/iso_8859-9.7 from iso_8859-9 > Converting man7/koi8-r.7 from koi8-r > Converting man7/koi8-u.7 from koi8-u > Converting man7/suffixes.7 from iso-8859-1 > > $ cd tmp_encoded/ > > $ ../print_encoding.sh man?/* > > Man Page Encoding by file Encoding by first line > > * man2/close.2 utf-8 UTF-8 > * man2/getdomainname.2 utf-8 UTF-8 > * man2/getrlimit.2 utf-8 UTF-8 > * man2/madvise.2 utf-8 UTF-8 > * man2/mount.2 utf-8 UTF-8 > * man2/sysinfo.2 utf-8 UTF-8 > * man2/umask.2 utf-8 UTF-8 > * man3/encrypt.3 utf-8 UTF-8 > * man3/fclose.3 utf-8 UTF-8 > * man3/fflush.3 utf-8 UTF-8 > * man3/lockf.3 utf-8 UTF-8 > * man3/rand.3 utf-8 UTF-8 > * man3/strtok.3 utf-8 UTF-8 > * man3/toupper.3 utf-8 UTF-8 > * man3/updwtmp.3 utf-8 UTF-8 > * man4/st.4 utf-8 UTF-8 > * man5/utmp.5 utf-8 UTF-8 > * man7/armscii-8.7 utf-8 UTF-8 > * man7/cp1251.7 utf-8 UTF-8 > * man7/environ.7 utf-8 UTF-8 > * man7/hier.7 utf-8 UTF-8 > * man7/iso_8859-10.7 utf-8 UTF-8 > * man7/iso_8859-11.7 utf-8 UTF-8 > * man7/iso_8859-13.7 utf-8 UTF-8 > * man7/iso_8859-14.7 utf-8 UTF-8 > * man7/iso_8859-15.7 utf-8 UTF-8 > * man7/iso_8859-16.7 utf-8 UTF-8 > * man7/iso_8859-1.7 utf-8 UTF-8 > * man7/iso_8859-2.7 utf-8 UTF-8 > * man7/iso_8859-3.7 utf-8 UTF-8 > * man7/iso_8859-4.7 utf-8 UTF-8 > * man7/iso_8859-5.7 utf-8 UTF-8 > * man7/iso_8859-6.7 utf-8 UTF-8 > * man7/iso_8859-7.7 utf-8 UTF-8 > * man7/iso_8859-8.7 utf-8 UTF-8 > * man7/iso_8859-9.7 utf-8 UTF-8 > * man7/koi8-r.7 utf-8 UTF-8 > * man7/koi8-u.7 utf-8 UTF-8 > * man7/suffixes.7 utf-8 UTF-8 Peter, Sorry to be slow following up on this. Thanks for the scripts. As some background, I'll just note that the current encoding markers in the iso_8859* pages were added in response to this 2009 bug report: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=519209 It seems a reasonable idea to convert everything to UTF-8, but I have some concerns/questions. 1. Is the encoding line: '\" t -*- coding: UTF-8 -*- really needed, or does modern groff just work this out? 2. I'm concerned about backward compatibility issues. As in: what if someone loads the man pages onto a system with old groff. Now, as far as I can work out, groff added input unicode support in v1.20, 2009 (http://lists.gnu.org/archive/html/groff/2009-01/msg00011.html). So, perhaps that's long enough ago that we don't need to worry too much about these issues. Any thoughts?
Hi Michael, 1. It looks like it works without the encoding line, but as Colin said in the email, it's better with it. 2. Also, greatly answered by Colin, I'll just add that we are converting man-pages to utf-8 since before the first RHEL-6 was released, when Fedora wasn't using the man-db but man 1.6. In general, we convert almost everything to the utf-8 what is not and has special characters, otherwise it's usually troubles.. peter
For reference, the discussion thread on linx-man@ http://thread.gmane.org/gmane.linux.man/5069 Subject: Converting man-pages to UTF-8 Date: 2014-02-14 10:43:30 UTC
(In reply to Peter Schiffer from comment #5) > Hi Michael, > > 1. It looks like it works without the encoding line, but as Colin said in > the email, it's better with it. > > 2. Also, greatly answered by Colin, I'll just add that we are converting > man-pages to utf-8 since before the first RHEL-6 was released, when Fedora > wasn't using the man-db but man 1.6. In general, we convert almost > everything to the utf-8 what is not and has special characters, otherwise > it's usually troubles.. Pete, I've applied your scipts, with the slight (manual) tweak that the "coding:" line is added only to pages with UTF-8 in the source. I've also checked your scripts into the scripts/ directory in the man-pages Git repo. Maybe they'll come in useful someday or for someone else. Thanks for doing the legwork on this issue. Cheers, Michael
Looks awesome. Thanks for your works!