Re: Patch proposal: make use of regular expressions for the username in pg_hba.conf - Mailing list pgsql-hackers

On Wed, Oct 12, 2022 at 08:17:14AM +0200, Drouvot, Bertrand wrote:
> Indeed, ;-)

So, I have spent the last two days looking at all that, studying the
structure of the patch and the existing HEAD code, and it looks like
that a few things could be consolidated.

First, as of HEAD, AuthToken is only used for elements in a list of
role and database names in hba.conf before filling in each HbaLine,
hence we limit its usage to the initial parsing.  The patch assigns an
optional regex_t to it, then extends the use of AuthToken for single
hostname entries in pg_hba.conf.  Things going first: shouldn't we
combine ident_user and "re" together in the same structure?  Even if
we finish by not using AuthToken to store the computed regex, it seems
to me that we'd better use the same base structure for pg_ident.conf
and pg_hba.conf.  While looking closely at the patch, we would expand
the use of AuthToken outside its original context.  I have also looked
at make_auth_token(), and wondered if it could be possible to have this
routine compile the regexes.  This approach would not stick with
pg_ident.conf though, as we validate the fields in each line when we
put our hands on ident_user and after the base validation of a line
(number of fields, etc.).  So with all that in mind, it feels right to
not use AuthToken at all when building each HbaLine and each
IdentLine, but a new, separate, structure.  We could call that an
AuthItem (string, its compiled regex) perhaps?  It could have its own
make() routine, taking in input an AuthToken and process
pg_regcomp().  Better ideas for this new structure would be welcome,
and the idea is that we'd store the post-parsing state of an
AuthToken to something that has a compiled regex.  We could finish by
using AuthToken at the end and expand its use, but it does not feel
completely right either to have a make() routine but not be able to
compile its regular expression when creating the AuthToken.

The logic in charge of compiling the regular expressions could be
consolidated more.  The patch refactors the logic with
token_regcomp(), uses it for the user names (ident_user in
parse_ident_line() from pg_ident.conf), then extended to the hostnames
(single item) and the role/database names (list possible in these
cases).  This approach looks right to me.  Once we plug in an AuthItem
to IdentLine, token_regcomp could be changed so as it takes an
AuthToken in input, saving directly the compiled regex_t in the input
structure.

At the end, the global structure of the code should, I guess, respect
the following rules:
- The number of places where we check if a string is a regex should be
minimal (aka string beginning by '/').
- Only one code path of hba.c should call pg_regcomp() (the patch does
that), and only one code path should call pg_regexec() (two code paths
of hba.c do that with the patch, as of the need to store matching
expression).  This should minimize the areas where we call
pg_mb2wchar_with_len(), for one.

About this last point, token_regexec() does not include
check_ident_usermap() in its logic, and it seems to me that it should.
The difference is with the expected regmatch_t structures, so
shouldn't token_regexec be extended with two arguments as of an array
of regmatch_t and the number of elements in the array?  This would
save a bit some of the logic around pg_mb2wchar_with_len(), for
example.  To make all that work, token_regexec() should return an int,
coming from pg_regexec, but no specific error strings as we don't want
to spam the logs when checking hosts, roles and databases in
pg_hba.conf.

       /* Check if it has a CIDR suffix and if so isolate it */
-      cidr_slash = strchr(str, '/');
-      if (cidr_slash)
-          *cidr_slash = '\0';
+      if (!is_regexp)
+      {
+          cidr_slash = strchr(str, '/');
+          if (cidr_slash)
+              *cidr_slash = '\0';
+      }
[...]
            /* Get the netmask */
-           if (cidr_slash)
+           if (cidr_slash && !is_regexp)
            {
Some of the code handling regexes for hostnames itches me a bit, like
this one.  Perhaps it would be better to evaluate this interaction
with regular expressions separately.  The database and role names
don't have this need, so their cases are much simpler to think about.

The code could be split to tackle things step-by-step:
- One refactoring patch to introduce token_regcomp() and
token_regexec(), with the introduction of a new structure that
includes the compiled regexes.  (Feel free to counterargue about the
use of AuthToken for this purpose, of course!)
- Plug in the refactored logic for the lists of role names and
database names in pg_hba.conf.
- Handle the case of single host entries in pg_hba.conf.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: hash_xlog_split_allocate_page: failed to acquire cleanup lock
Next
From: Michael Paquier
Date:
Subject: Re: [meson] add missing pg_attribute_aligned for MSVC in meson build