Bug 217713 - Encoding issues with --auto-to-cc
Summary: Encoding issues with --auto-to-cc
Status: RESOLVED CODE_FIX
Alias: None
Product: Tools
Classification: Unclassified
Component: Infra (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: Konstantin Ryabitsev
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-07-26 19:54 UTC by Bugbot
Modified: 2023-07-26 20:37 UTC (History)
0 users

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Bugbot 2023-07-26 19:54:15 UTC
Duje Mihanović <duje.mihanovic@skole.hr> writes:

I decided to try using b4 to submit a patchset for adding Marvell PXA1908 ARM 
SoC support. Having enrolled an existing branch, I ran `b4 prep -c` and got 
the following error (this is with the -d switch added):

Collecting from: [PATCH v2 06/10] dt-bindings: clock: Add documentation for 
Marvell PXA1908
Running git --no-pager rev-parse --show-toplevel
Changing dir to /home/duje/code/linux
Running /home/duje/code/linux/scripts/get_maintainer.pl --nogit --nogit-
fallback --nogit-chief-penguins --norolestats --nol
Changing back into /home/duje/code/linux/Documentation/process
Traceback (most recent call last):
  File "/usr/bin/b4", line 33, in <module>
    sys.exit(load_entry_point('b4==0.12.3', 'console_scripts', 'b4')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/b4/command.py", line 360, in cmd
    cmdargs.func(cmdargs)
  File "/usr/lib/python3.11/site-packages/b4/command.py", line 76, in cmd_prep
    b4.ez.cmd_prep(cmdargs)
  File "/usr/lib/python3.11/site-packages/b4/ez.py", line 1994, in cmd_prep
    auto_to_cc()
  File "/usr/lib/python3.11/site-packages/b4/ez.py", line 1896, in auto_to_cc
    for tname, pairs in (('To', get_addresses_from_cmd(tocmd, msgbytes)),
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/b4/ez.py", line 948, in 
get_addresses_from_cmd
    addrs = out.strip().decode()
            ^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 201: 
invalid start byte

I suspected that using an existing branch instead of creating a new one could 
have caused this problem, so I unzipped a 6.5-rc2 tarball into /tmp, 
initialized a Git repository there and used `git am` to apply initially only 
the patch 06/10 which was causing issues in the other repository and ran `b4 
prep -c` again. This time b4 was able to collect the addresses without issues, 
but then I tried `git am`ing the whole set into the /tmp repository and this 
time `b4 prep -c` failed with the same error on the same patch.

Steps to reproduce:
    - Checkout Linux 6.5-rc2
    - Run `b4 prep -F "<20230721210042.21535-1-duje.mihanovic@skole.hr>" -n 
<any branch name>`
    - Run `b4 prep -c`

--
Regards,
Duje

(via https://msgid.link/1940519.PYKUYFuaPT@radijator)
Comment 1 Bugbot 2023-07-26 19:54:20 UTC
Konstantin Ryabitsev <konstantin@linuxfoundation.org> writes:

On Mon, Jul 24, 2023 at 12:49:41PM +0200, Duje Mihanović wrote:
> Steps to reproduce:
>     - Checkout Linux 6.5-rc2
>     - Run `b4 prep -F "<20230721210042.21535-1-duje.mihanovic@skole.hr>" -n 
> <any branch name>`
>     - Run `b4 prep -c`

Thank you for that -- I can verify that it's happening.

bugbot assign to me

-K

(via https://msgid.link/20230726-hula-wad-c9241b@meerkat)
Comment 2 Bugbot 2023-07-26 20:31:46 UTC
Konstantin Ryabitsev <konstantin@linuxfoundation.org> writes:

On Mon, Jul 24, 2023 at 12:49:41PM +0200, Duje Mihanović wrote:
> I decided to try using b4 to submit a patchset for adding Marvell PXA1908 ARM 
> SoC support. Having enrolled an existing branch, I ran `b4 prep -c` and got 
> the following error (this is with the -d switch added):

So, there's apparently something very interesting about that final ć in your
name that trips up get_maintainer.pl. For example, run the following:

$ ./scripts/get_maintainer.pl -f Documentation/devicetree/bindings/clock/marvell,pxa1908.yaml

You will get back a byte sequence \x87 where your name should be:

    "<87>" <duje.mihanovic@skole.hr> (in file)

This is because ć is 0xC4 0x87, but I have no idea why get_maintainer.pl trips
up and splits the unicode sequence into two bytes. It seems to want to do that
for anything above base extended ascii (Latin-A).

I can "fix" this in b4 by forcing it to ignore any unrecognized unicode errors
in get_maintainer.pl output, but it's not a real fix for the underlying
problem.

-K

(via https://msgid.link/20230726-gush-slouching-a5cd41@meerkat)
Comment 3 Bugbot 2023-07-26 20:37:14 UTC
Konstantin Ryabitsev writes in commit 034f2fb2ac27c89c1c7ab2af04d26ba63be9ea6c:

ez: ignore invalid unicode returned by get_maintainer

There's a bug in get_maintainer.pl that returns invalid unicode in
certain situations (see bug linked below). We can't fix this in b4, but
at least we can avoid crashing when we encounter this problem.

Reported-by: Duje Mihanović <duje.mihanovic@skole.hr>
Link: https://msgid.link/1940519.PYKUYFuaPT@radijator
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217713
Signed-off-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>

(via https://git.kernel.org/pub/scm/utils/b4/b4.git/commit/?id=034f2fb2ac27)

Note You need to log in before you can comment on or make changes to this bug.