Bug 212887
Summary: | Clarify getopt.3 behaviour | ||
---|---|---|---|
Product: | Documentation | Reporter: | James Hunt (jamesodhunt) |
Component: | man-pages | Assignee: | documentation_man-pages (documentation_man-pages) |
Status: | RESOLVED DOCUMENTED | ||
Severity: | normal | CC: | alx |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | Subsystem: | ||
Regression: | No | Bisected commit-id: | |
Attachments: |
getopt.3 test program
[patch] getopt.3: Clarify behaviour [patch v2] getopt.3: Clarify behaviour [patch v3] getopt.3: Clarify behaviour Additional patch to clarify getopt behaviour. |
Description
James Hunt
2021-04-29 20:10:32 UTC
Created attachment 296547 [details]
[patch] getopt.3: Clarify behaviour
Ouch - markdown support in bugzilla seems a bit non-standard ;) Hi James, Thanks for documenting that! Please see some comments below. Thanks, Alex On 4/30/21 10:00 PM, Alejandro Colomar wrote: > From: "James O. D. Hunt" <jamesodhunt@gmail.com> > > Improved the `getopt(3)` man page in the following ways: > > 1) Defined the existing term "legitimate option character". > 2) Added an additional NOTE stressing that arguments are parsed in strict > order and the implications of this when numeric options are utilised. > 3) Added a new WARNINGS section that alerts the reader to the fact they > should: > - Validate all option argument. > - Take care if mixing numeric options and arguments accepting numeric > values. > > Signed-off-by: James O. D. Hunt <jamesodhunt@gmail.com> > Bugzilla: <https://bugzilla.kernel.org/show_bug.cgi?id=212887> > --- > > I'm only forwarding the patch to the list to better discuss it. > > man3/getopt.3 | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 58 insertions(+), 1 deletion(-) > > diff --git a/man3/getopt.3 b/man3/getopt.3 > index 921e747f8..ec3a5640f 100644 > --- a/man3/getopt.3 > +++ b/man3/getopt.3 > @@ -125,7 +125,13 @@ Then \fIoptind\fP is the index in \fIargv\fP of the > first > \fIargv\fP-element that is not an option. > .PP > .I optstring > -is a string containing the legitimate option characters. > +is a string containing the legitimate option characters. A legitimate > +option character is any visible one byte > +.BR ascii (7) > +character (for which > +.BR isgraph (3) > +would return nonzero) that is not the null byte (\(aq\e0\(aq), Is it necessary to mention the null byte ('\0') here? It is already not in the 'visible character' set. I don't think isgraph(3) would return true for it, right? > +dash (\(aq-\(aq) or colon (\(aq:\(aq). So the only visible characters that are invalid are '-' and ':', right? BTW, you should escape the dash: (\(aq\-\(aq) See: $ man 7 man-pages | sed -n '/Generating optimal glyphs/,+19p'; Generating optimal glyphs Where a real minus character is required (e.g., for numbers such as -1, for man page cross references such as utf-8(7), or when writing options that have a leading dash, such as in ls -l), use the following form in the man page source: \- This guideline applies also to code examples. The use of real minus signs serves the following purposes: * To provide better renderings on various targets other than ASCII terminals, notably in PDF and on Uni‐ code/UTF-8‐capable terminals. * To generate glyphs that when copied from rendered pages will produce real minus signs when pasted into a termi‐ nal. > If such a > character is followed by a colon, the option requires an argument, so > .BR getopt () > @@ -402,6 +408,22 @@ routine that rechecks > .B POSIXLY_CORRECT > and checks for GNU extensions in > .IR optstring .) > + $ man 7 man-pages | sed -n '/Formatting conventions (general)/,/^$/p'; Formatting conventions (general) Paragraphs should be separated by suitable markers (usually either .PP or .IP). Do not separate paragraphs using blank lines, as this results in poor rendering in some output formats (such as PostScript and PDF). > +Command-line arguments are parsed in strict order meaning that an option > requiring > +an argument will consume the next argument, regardless of whether that > +argument is the correctly specified option argument or simply the next > option > +(in the scenario the user mis-specifies the command-line). For example, if $ man 7 man-pages | sed -n '/Use semantic newlines/,/^$/p'; Use semantic newlines In the source of a manual page, new sentences should be started on new lines, and long sentences should split into lines at clause breaks (commas, semicolons, colons, and so on). This convention, sometimes known as "semantic new‐ lines", makes it easier to see the effect of patches, which often operate at the level of individual sentences or sen‐ tence clauses. > +.IR optstring > +is specified as "1n:" > +and the user incorrectly specifies the command-line arguments as > +\(aqprog\ -n\ -1\(aq, the > +.IR \-n Use .I instead of .IR here. $ man 7 groff_man | sed -n '/^ *\.IR italic/,+5p'; .IR italic‐text roman‐text ... Set each argument in italics and roman, alternately. This is the first command of the .IR prologue . $ man 7 groff_man | sed -n '/^ *\.I \[text]/,+16p'; .I [text] Set text in italics. If the macro is given no argu‐ ments, the text of the next input line is set in italics. Use italics for file and path names, for environment variables, for enumeration or preprocessor constants in C, for variable (user‐determined) portions of syntax synopses, for the first occurrence only of a technical concept being introduced, for names of works of software (including commands and functions, but excluding names of operating systems or their kernels), and anywhere a parameter requiring re‐ placement by the user is encountered. An exception involves variable text in a context that is already marked up in italics, such as file or path names with variable components; in such cases, follow the > +option will be given the > +.BR optarg The same as with .I. Use .B here. See groff_man(7). > +value \(aq\-1\(aq, and the > +.IR \-1 > +option will be considered to have not been specified. > + > .SH EXAMPLES > .SS getopt() > The following trivial example program uses > @@ -542,6 +564,41 @@ main(int argc, char **argv) > exit(EXIT_SUCCESS); > } > .EE > + > +.SH WARNINGS > +Since > +.BR getopt () > +allows users to provide values to the program, every care should be taken to > +validate every option value specified by the user calling the program. > +.BR getopt () > +itself provides no validation so the programmer should perform boundary > value > +checks on > +.ft I > +every > +.ft Please use: .I every > +argument to minimise the risk of bad input data being accepted by the > program. > +String values should be checked to ensure they are not empty (unless > +permitted), sanitized appropriately and that internal buffers used to > +store the string values returned in \fIoptarg\fP are large enough to hold Please avoid embedding formatted text in normal text. Use a separate line: .I optarg > +pathologically long values. Numeric values should be verified to ensure they > +are within the expected permissible range of values. > + > +Further, since > +.BR getopt () > +can handle numeric options (such as \(aq-1\(aq or \(aq-2 foo\(aq), care > should be > +taken when writing a program that accepts both a numeric flag option and an > option > +accepting a numeric argument. Specifically, the program should sanity check > the numeric > +\fIoptarg\fP value carefully to protect against the case where a user > +mis-specifies the command-line which chould result in a numeric option flag > +being specified as the \fIoptarg\fP value for the numeric option by mistake. > +For example, if > +.IR optstring > +is specified as "1n:" and the \(aqn\(aq option accepts a numeric value, if > the > +command-line is specified accidentally as \(aqprog\ -n\ -1\(aq, care needs > to > +be taken to ensure the program does not try to convert the \(aq-1\(aq passed > +to the \(aqn\(aq option into an unsigned numeric value since that would > result > +in it being set to the largest possible integer value for the type used to > +encode it. > .SH SEE ALSO > .BR getopt (1), > .BR getsubopt (3) > Created attachment 296581 [details]
[patch v2] getopt.3: Clarify behaviour
Thanks for reviewing Alejandro!
I've updated the patch based on your feedback and attached it as version 2.
> So the only visible characters that are invalid are '-' and ':', right?
Yes, because `--` is the end of options marker and `:` is used to denote an option accepting an argument.
Hi James, See some more comments below. Thanks, Alex On 5/1/21 9:41 PM, Alejandro Colomar wrote: > From: "James O. D. Hunt" <jamesodhunt@gmail.com> > > Improved the `getopt(3)` man page in the following ways: > > 1) Defined the existing term "legitimate option character". > 2) Added an additional NOTE stressing that arguments are parsed in strict > order and the implications of this when numeric options are utilised. > 3) Added a new WARNINGS section that alerts the reader to the fact they > should: > - Validate all option argument. > - Take care if mixing numeric options and arguments accepting numeric > values. Could you please separate this into 2 patches? 1 & 2 seem to be very related, but 3 is quite different. > > Signed-off-by: James O. D. Hunt <jamesodhunt@gmail.com> > Bugzilla: <https://bugzilla.kernel.org/show_bug.cgi?id=212887> > --- > > Forward patch v2 from bugzilla to linux-man@. > > man3/getopt.3 | 61 +++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 61 insertions(+) > > diff --git a/man3/getopt.3 b/man3/getopt.3 > index 921e747f8..810298505 100644 > --- a/man3/getopt.3 > +++ b/man3/getopt.3 > @@ -126,6 +126,11 @@ Then \fIoptind\fP is the index in \fIargv\fP of the > first > .PP > .I optstring > is a string containing the legitimate option characters. > +A legitimate option character is any visible one byte > +.BR ascii (7) > +character (for which > +.BR isgraph (3) > +would return nonzero) that is not dash (\(aq\-\(aq) or colon (\(aq:\(aq). > If such a > character is followed by a colon, the option requires an argument, so > .BR getopt () > @@ -402,6 +407,23 @@ routine that rechecks > .B POSIXLY_CORRECT > and checks for GNU extensions in > .IR optstring .) > +.PP Some more "semantic newline" cuts (//): > +Command-line arguments are parsed in strict order // meaning that an option > +requiring an argument will consume the next argument, // regardless of > whether > +that argument is the correctly specified option argument // or simply the > next > +option // (in the scenario the user mis-specifies the command line). > +For example, if > +.IR optstring > +is specified as "1n:" > +and the user incorrectly specifies the command line arguments as > +\(aqprog\ \-n\ \-1\(aq, the Please replace the quotes by italics (.IR or .I) (and possibly non-breaking spaces). $ man 7 man-pages | sed -n '/Complete commands/,+11p'; Complete commands should, if long, be written as an in‐ dented line on their own, with a blank line before and after the command, for example man 7 man-pages If the command is short, then it can be included inline in the text, in italic format, for example, man 7 man‐ pages. In this case, it may be worth using nonbreaking spaces ("\ ") at suitable places in the command. Command options should be written in italics (e.g., -l). > +.I \-n > +option will be given the > +.B optarg > +value \(aq\-1\(aq, and the Given that in the case above -1 is a C string, double quotes would probably be more appropriate. > +.I \-1 > +option will be considered to have not been specified. > +.PP > .SH EXAMPLES > .SS getopt() > The following trivial example program uses > @@ -542,6 +564,45 @@ main(int argc, char **argv) > exit(EXIT_SUCCESS); > } > .EE > +.PP > +.SH WARNINGS This should probably be a subsection in NOTES. $ man 7 man-pages | sed -n '/^ *Where.*traditional/,/^$/p'; Where a traditional heading would apply, please use it; this kind of consistency can make the information easier to understand. If you must, you can create your own headings if they make things easier to understand (this can be especially useful for pages in Sections 4 and 5). However, before doing this, consider whether you could use the traditional headings, with some subsections (.SS) within those sections. > +Since > +.BR getopt () Some more "semantic newline" cuts (//): > +allows users to provide values to the program, // every care should be taken > to > +validate // every option value specified by the user calling the program. > +.BR getopt () > +itself provides no validation so // the programmer should perform boundary > value > +checks on > +.I every > +argument to minimise the risk of bad input data being accepted by the > program. > +String values should be checked to // ensure they are not empty (unless > +permitted), // sanitized appropriately and // that internal buffers used to > store Review wording (s/that internal/check that internal/?) > +the string values returned in > +.I optarg > +are large enough to hold pathologically long values. I'm not sure if this is extending these notes too much. I see it obvious that any user input, especially strings, should be checked for every corner case in paranoid mode. But I checked scanf(3) and didn't see any NOTES about that. > +Numeric values should be verified to ensure they are within the expected > +permissible range of values. > +.PP > +Further, since > +.BR getopt () > +can handle numeric options (such as \(aq\-1\(aq or \(aq\-2\ foo\(aq), care > should > +be taken when writing a program that accepts both a numeric flag option and > +an option accepting a numeric argument. > +Specifically, the program should sanity check the numeric > +.I optarg > +value carefully to protect against the case where a user mis-specifies the > +command line which chould result in a numeric option flag being specified as > +the > +.I optarg > +value for the numeric option by mistake. > +For example, if > +.IR optstring > +is specified as "1n:" and the \(aqn\(aq option accepts a numeric value, if > the > +command line is specified accidentally as \(aqprog\ \-n\ \-1\(aq, care needs > to > +be taken to ensure the program does not try to convert the \(aq\-1\(aq > passed > +to the \(aqn\(aq option into an unsigned numeric value since that would > result > +in it being set to the largest possible integer value for the type used to > +encode it. I don't think we should warn about this. If the user inputs a wrong command line, he can only expect undefined behavior. For the program, as long as it doesn't have any security problems, it doesn't need to care if the user doesn't provide a valid input. Normal checks should be done. > .SH SEE ALSO > .BR getopt (1), > .BR getsubopt (3) > Created attachment 296587 [details]
[patch v3] getopt.3: Clarify behaviour
Hi Alex, Thanks for reviewing again. I've updated the patch to v3 which now only includes the first two changes. I agree - I was in two minds about the WARNINGS section myself... # Validating input Good programmers should know to do this. But given that `getopt` is a primary interface to a program (which might be running `SUID-root` for example), I wonder if it needs to be "called out" more strongly? There are lots of other syscalls and library calls that might deserve similar treatment of course. In fact, I wonder a new "security` man page might be an interesting project, since that could summarise expectations, security + safety advice to programmers, and refer to man pages like `getopt(3)`. that would also avoid cluttering the main man pages as the details could be provided in the security one. # Numeric options Yes, the user is responsible for a well formed command line. However, I was trying to raise the fact that the programmer needs to be even more careful in the case of the program using getopt to support numeric options which doesn't seem to be that common, but could lead to unexpected trouble if a programmer decides to use that feature of getopt. Hi James, On 5/2/21 5:16 PM, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=212887 > > --- Comment #8 from James Hunt (jamesodhunt@gmail.com) --- > Hi Alex, > > Thanks for reviewing again. I've updated the patch to v3 which now only > includes the first two changes. Okay, I'll have a look at it. > > I agree - I was in two minds about the WARNINGS section myself... > > # Validating input > > Good programmers should know to do this. But given that `getopt` is a primary > interface to a program (which might be running `SUID-root` for example), I > wonder if it needs to be "called out" more strongly? There are lots of other > syscalls and library calls that might deserve similar treatment of course. In > fact, I wonder a new "security` man page might be an interesting project, > since > that could summarise expectations, security + safety advice to programmers, > and > refer to man pages like `getopt(3)`. that would also avoid cluttering the > main > man pages as the details could be provided in the security one. I like that idea. Could you draft such a page? I know it's asking a very hard work, but maybe start with some basic ideas. Also, security is a very broad thing. Maybe small pages restricted to a specific context are easier to handle. For example, one page dedicated to security in the context of user input in programs, which could treat things like getopt(3), scanf(3), fgets(3), and their families of functions (and maybe some others) If you would like to do this, could you please start a new thread by writing a new email, and CC the mailing list <linux-man@vger.kernel.org> (and any other lists that may be interested, such as probably <libc-alpha@sourceware.org>, please?. > > # Numeric options > > Yes, the user is responsible for a well formed command line. However, I was > trying to raise the fact that the programmer needs to be even more careful in > the case of the program using getopt to support numeric options which doesn't > seem to be that common, but could lead to unexpected trouble if a programmer > decides to use that feature of getopt. > getopt(3) just gets strings (it does something more, but let's simplify). In that sense it's very similar to fgets(3) as an interface (even safer, as it doesn't write to a buffer). It's strtol(3) or whatever function that will convert the input into a number the one that has to take care of the input limits. Also, As long as the user uses the options correctly, numeric options are just another character. They're treated exactly in the same way. And in the case of long options, everything is a string, even if it has numbers in it. The only valid case where a number is treated as a number is in an option argument (in the case that the option has a numeric argument). But in this case, it is still a string at the time of getopt(3). And in this case, the programmer already expects a number, so I don't think any warning applies. It is only when the user inputs options in the wrong order that numbers and options may be mixed and unexpected things may happen, but invalid input resulting in undefined behavior is not an issue. Thanks, Alex Two minor comments to v3: s/^.IR optstring$/.I optstring/ and .PP just before .SH is not necessary (.SH already ensures there is a paragraph separator (a blank line)), and should be removed. But those were trivial changes, so I fixed those, and applied your patch. Thanks, Alex Created attachment 296701 [details]
Additional patch to clarify getopt behaviour.
Thanks Alex. Please ignore the patch I just attached. I'll raise a new bug and reference this one. I'll see if I can get some time at a weekend to look at the new security man page(s) and follow the process you recommended... Follow-on bug is: https://bugzilla.kernel.org/show_bug.cgi?id=213013. |