Thread: Fwd: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack

Fwd: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack

From

Jeremy Schneider

Date:

20 July 2018, 02:17:44

I'd like to bump this old bug that Lloyd filed for more discussion. It
seems serious enough to me that we should at least talk about it.

Anyone with simply the login privilege and the ability to run SQL can
instantly block all new incoming connections to a DB including new
superuser connections.

session 1:
select pg_sleep(9999999999) from pg_stat_activity;

session 2:
vacuum full pg_authid; -or- truncate table pg_authid;

(there are likely other SQL you could run in session 2 as well.)

-------- Forwarded Message --------
Subject: BUG #15182: Canceling authentication due to timeout aka Denial
of Service Attack
Date: Mon, 30 Apr 2018 20:41:11 +0000
From: PG Bug reporting form <noreply@postgresql.org>
Reply-To: lalbin@scharp.org, pgsql-bugs@lists.postgresql.org
To: pgsql-bugs@lists.postgresql.org
CC: lalbin@scharp.org

The following bug has been logged on the website:

Bug reference:      15182
Logged by:          Lloyd Albin
Email address:      lalbin@scharp.org
PostgreSQL version: 10.3
Operating system:   OpenSUSE
Description:
Over the last several weeks our developers caused a Denial of Service Attack
against ourselves by accident. When looking at the log files, I noticed that
we had authentication timeouts during these time periods. In researching the
problem I found this is due to locks being held on shared system catalog
items, aka system catalog items that are shared between all databases on the
same cluster/server. This can be caused by beginning a long running
transaction that queries pg_stat_activity, pg_roles, pg_database, etc and
then another connection that runs either a REINDEX DATABASE, REINDEX SYSTEM,
or VACUUM FULL. This issue is of particular importance to database resellers
who use the same cluster/server for multiple clients, as two clients can
cause this issue to happen inadvertently or a single client can either cause
it to happen maliciously or inadvertently. Note: The large cloud providers
give each of their clients their own cluster/server so this will not affect
across cloud clients but can affect an individual client. The problem is
that traditional hosting companies will have all clients from one or more
web servers share the same PostgreSQL cluster/server. This means that one or
two clients could inadvertently stop all the other clients from being able
to connect to their databases until the first client does either a COMMIT or
ROLLBACK of their transaction which they could hold open for hours, which is
what happened to us internally.

In Connection 1 we need to BEGIN a transaction and then query a shared
system item; pg_authid, pg_database, etc; or a view that depends on a shared
system item; pg_stat_activity, pg_roles, etc. Our developers were accessing
pg_roles.

Connection 1 (Any database, Any User)
BEGIN;
SELECT * FROM pg_stat_activity;

Connection 2 (Any database will do as long as you are the database owner)
REINDEX DATABASE postgres;

Connection 3 (Any Database, Any User)
psql -h sqltest-alt -d sandbox

All future Connection 3's will hang for however long the transaction in
Connection 1 runs. In our case this was hours and denied everybody else the
ability to log into the server until Connection 1 was committed. psql will
just hang for hours, even overnight in my testing, but our apps would get
the "Canceling authentication due to timeout" after 1 minute.

Connection 2 can also do any of these commands to also cause the same
issue:
REINDEX SYSTEM postgres;
VACUUM FULL pg_authid;
vacuumdb -f -h sqltest-alt -d lloyd -U lalbin

Even worse is that the VACUUM FULL pg_authid; can be started by an
unprivileged user and it will wait for the AccessShareLock by connection 1
to be released before returning the error that you don't have permission to
perform this action, so even an unprivileged user can cause this to happen.
The privilege check needs to happen before the waiting for the
AccessExclusiveLock happens.

This bug report has been simplified and shorted drastically. To read the
full information about this issue please see my blog post:
http://lloyd.thealbins.com/Canceling%20authentication%20due%20to%20timeout

Lloyd Albin
Database Administrator
Statistical Center for HIV/AIDS Research and Prevention (SCHARP)
Fred Hutchinson Cancer Research Center


-- 
Jeremy Schneider
Database Engineer
Amazon Web Services

Re: BUG #15182: Canceling authentication due to timeout aka Denial ofService Attack

From

Marko Tiikkaja

Date:

21 July 2018, 00:56:13

On Fri, Jul 20, 2018 at 2:17 AM, Jeremy Schneider <schnjere@amazon.com> wrote:

I'd like to bump this old bug that Lloyd filed for more discussion. It
seems serious enough to me that we should at least talk about it.

Anyone with simply the login privilege and the ability to run SQL can
instantly block all new incoming connections to a DB including new
superuser connections.

So.. don't VACUUM FULL pg_authid without lock_timeout?

I can come up with dozens of ways to achieve the same effect, all of them silly.

Re: BUG #15182: Canceling authentication due to timeout aka Denial ofService Attack

From

Robert Haas

Date:

23 July 2018, 18:29:33

On Thu, Jul 19, 2018 at 7:17 PM, Jeremy Schneider <schnjere@amazon.com> wrote:
> I'd like to bump this old bug that Lloyd filed for more discussion. It
> seems serious enough to me that we should at least talk about it.
>
> Anyone with simply the login privilege and the ability to run SQL can
> instantly block all new incoming connections to a DB including new
> superuser connections.
>
> session 1:
> select pg_sleep(9999999999) from pg_stat_activity;
>
> session 2:
> vacuum full pg_authid; -or- truncate table pg_authid;
>
> (there are likely other SQL you could run in session 2 as well.)

ExecuteTruncate needs to be refactored to use RangeVarGetRelidExtended
with a non-NULL callback rather than heap_openrv, and
expand_vacuum_rel needs to use RangeVarGetRelidExtended with a
callback instead of RangeVarGetRelid.  See
cbe24a6dd8fb224b9585f25b882d5ffdb55a0ba5 as an example of what to do.
I fixed a large number of cases of this problem back around that time,
but then ran out of steam and had to move onto other things before I
got them all.  Patches welcome.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #15182: Canceling authentication due to timeout aka Denial ofService Attack

From

Jeff Janes

Date:

23 July 2018, 23:14:40

On Fri, Jul 20, 2018 at 5:56 PM, Marko Tiikkaja <marko@joh.to> wrote:

On Fri, Jul 20, 2018 at 2:17 AM, Jeremy Schneider <schnjere@amazon.com> wrote:
I'd like to bump this old bug that Lloyd filed for more discussion. It
seems serious enough to me that we should at least talk about it.

Anyone with simply the login privilege and the ability to run SQL can
instantly block all new incoming connections to a DB including new
superuser connections.

So.. don't VACUUM FULL pg_authid without lock_timeout?

That's like saying the solution to a security hole is for no one to attempt to exploit it.

Note that you do not need to have permissions to do the vacuum full. This works merely from the attempt to do so, before the permissions are checked.

Cheers,

Jeff

Re: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack

From

Michael Paquier

Date:

24 July 2018, 07:14:03

On Mon, Jul 23, 2018 at 11:29:33AM -0400, Robert Haas wrote:
> ExecuteTruncate needs to be refactored to use RangeVarGetRelidExtended
> with a non-NULL callback rather than heap_openrv, and
> expand_vacuum_rel needs to use RangeVarGetRelidExtended with a
> callback instead of RangeVarGetRelid.  See
> cbe24a6dd8fb224b9585f25b882d5ffdb55a0ba5 as an example of what to do.
> I fixed a large number of cases of this problem back around that time,
> but then ran out of steam and had to move onto other things before I
> got them all.  Patches welcome.

Thanks for pointing those out, I looked at both code paths recently for
some other work...  The amount of work does not consist just in using
for example RangeVarCallbackOwnsRelation for VACUUM and TRUNCATE.  There
are a couple of reasons behind that:
- While it would make sense, at least to me, to make VACUUM fall into if
allow_system_table_mods is allowed, that's not the case of ANALYZE as I
think that we should be able to call ANALYZE on a system catalog as
well.  So we would basically a new flavor of
RangeVarCallbackOwnsRelation for VACUUM which makes this difference
between vacuum and analyze with an argument in the callback, the options
of VacuumStmt would be nice.  This would not be used by autovacuum
anyway, but adding an assertion and mentioning that in the comments
would not hurt.  There is an argument for just restricting VACUUM FULL
as well and not plain VACUUM, as that's the one hurting here.
- TRUNCATE is closer to a solution, as it has its own flavor of relation
checks with truncate_check_rel.  So the callback would replace
truncate_check_rel but CheckTableNotInUse should be moved out of it.
TRUNCATE already uses allow_system_table_mods for its checks.

Thoughts?
--
Michael

Attachment

signature.asc

Re: BUG #15182: Canceling authentication due to timeout aka Denial of Service Attack

From

Andres Freund

Date:

24 July 2018, 07:17:53

On July 23, 2018 9:14:03 PM PDT, Michael Paquier <michael@paquier.xyz> wrote:
>- While it would make sense, at least to me, to make VACUUM fall into
>if
>allow_system_table_mods is allowed,

I might be mis-parsing this due to typos. Are you actually suggesting vacuum on system tables should depend on that
GUC?If so, why? That's seems like a terrible idea. It's pretty normal to occasionally have to vacuum them? 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack

From

Michael Paquier

Date:

24 July 2018, 07:50:10

On Mon, Jul 23, 2018 at 09:17:53PM -0700, Andres Freund wrote:
> I might be mis-parsing this due to typos. Are you actually suggesting
> vacuum on system tables should depend on that GUC? If so, why? That's
> seems like a terrible idea.  It's pretty normal to occasionally have
> to vacuum them?

Oh, yes, that would be bad.  My mind has slipped here.  I have seen
manual VACUUMs on system catalogs for applications using many temp
tables...  So we would want to have only VACUUM FULL being conditionally
happening?  The question comes then about what to do when a VACUUM FULL
is run without a list of relations because expand_vacuum_rel() is not
actually the only problem.  Would we want to ignore system tables as
well except if allow_system_table_mods is on?  When no relation list is
specified, get_all_vacuum_rels() builds the list of relations which
causes vacuum_rel() to complain on try_relation_open(), so patching
just expand_vacuum_rel() solves only half of the problem for manual
VACUUMs.
--
Michael

Attachment

signature.asc

Re: BUG #15182: Canceling authentication due to timeout aka Denial of Service Attack

From

Andres Freund

Date:

24 July 2018, 07:51:54


On July 23, 2018 9:50:10 PM PDT, Michael Paquier <michael@paquier.xyz> wrote:
>On Mon, Jul 23, 2018 at 09:17:53PM -0700, Andres Freund wrote:
>> I might be mis-parsing this due to typos. Are you actually suggesting
>> vacuum on system tables should depend on that GUC? If so, why? That's
>> seems like a terrible idea.  It's pretty normal to occasionally have
>> to vacuum them?
>
>Oh, yes, that would be bad.  My mind has slipped here.  I have seen
>manual VACUUMs on system catalogs for applications using many temp
>tables...  So we would want to have only VACUUM FULL being
>conditionally
>happening?  The question comes then about what to do when a VACUUM FULL
>is run without a list of relations because expand_vacuum_rel() is not
>actually the only problem.  Would we want to ignore system tables as
>well except if allow_system_table_mods is on?  When no relation list is
>specified, get_all_vacuum_rels() builds the list of relations which
>causes vacuum_rel() to complain on try_relation_open(), so patching
>just expand_vacuum_rel() solves only half of the problem for manual
>VACUUMs.

I think any such restriction is entirely unacceptable. FULL or not.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack

From

Michael Paquier

Date:

24 July 2018, 08:23:02

On Mon, Jul 23, 2018 at 09:51:54PM -0700, Andres Freund wrote:
> On July 23, 2018 9:50:10 PM PDT, Michael Paquier <michael@paquier.xyz> wrote:
>> Oh, yes, that would be bad.  My mind has slipped here.  I have seen
>> manual VACUUMs on system catalogs for applications using many temp
>> tables...  So we would want to have only VACUUM FULL being
>> conditionally
>> happening?  The question comes then about what to do when a VACUUM FULL
>> is run without a list of relations because expand_vacuum_rel() is not
>> actually the only problem.  Would we want to ignore system tables as
>> well except if allow_system_table_mods is on?  When no relation list is
>> specified, get_all_vacuum_rels() builds the list of relations which
>> causes vacuum_rel() to complain on try_relation_open(), so patching
>> just expand_vacuum_rel() solves only half of the problem for manual
>> VACUUMs.
>
> I think any such restriction is entirely unacceptable. FULL or not.

Well, letting any users take an exclusive lock on system catalogs at
will is not acceptable either, so two possible answers would be to fail
or skip such relations.  The first concept applies if a relation list is
given by the user, and the second if no list is given.

Do you have any thoughts on the matter?
--
Michael

Attachment

signature.asc

Re: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack

From

Michael Paquier

Date:

24 July 2018, 08:25:40

On Tue, Jul 24, 2018 at 02:23:02PM +0900, Michael Paquier wrote:
> Well, letting any users take an exclusive lock on system catalogs at
> will is not acceptable either, so two possible answers would be to fail
> or skip such relations.  The first concept applies if a relation list is
> given by the user, and the second if no list is given.

The first sentence is incorrect.  That's actually "letting any users
attempt to take an exclusive lock which makes others to be stuck as
well".
--
Michael

Attachment

signature.asc

Re: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack

From

Alvaro Herrera

Date:

24 July 2018, 08:34:13

On 2018-Jul-24, Michael Paquier wrote:

> On Mon, Jul 23, 2018 at 11:29:33AM -0400, Robert Haas wrote:
> > ExecuteTruncate needs to be refactored to use RangeVarGetRelidExtended
> > with a non-NULL callback rather than heap_openrv, and
> > expand_vacuum_rel needs to use RangeVarGetRelidExtended with a
> > callback instead of RangeVarGetRelid.  See
> > cbe24a6dd8fb224b9585f25b882d5ffdb55a0ba5 as an example of what to do.
> > I fixed a large number of cases of this problem back around that time,
> > but then ran out of steam and had to move onto other things before I
> > got them all.  Patches welcome.
> 
> Thanks for pointing those out, I looked at both code paths recently for
> some other work...  The amount of work does not consist just in using
> for example RangeVarCallbackOwnsRelation for VACUUM and TRUNCATE.

I don't think we're forced to reuse the existing callbacks -- maybe
write a specific callback for each case, if really needed.  But anyway
like Andres I don't think this is related to allow_system_table_mods at
all; you just need to do the checks in the right order, no?

But I don't see why RangeVarCallbackOwnsTable isn't sufficient.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack

From

Michael Paquier

Date:

24 July 2018, 08:52:35

On Tue, Jul 24, 2018 at 01:34:13AM -0400, Alvaro Herrera wrote:
> But I don't see why RangeVarCallbackOwnsTable isn't sufficient.

The set of relkinds checked by truncate_check_rel and
RangeVarCallbackOwnsTable is different (toast and matviews).  And in the
case of VACUUM, partitioned tables can call RangeVarGetRelidExtended
when one is listed as part of a manual VACUUM command.
--
Michael

Attachment

signature.asc

Re: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack

From

Kyotaro HORIGUCHI

Date:

24 July 2018, 12:27:16

At Tue, 24 Jul 2018 14:23:02 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180724052302.GB4736@paquier.xyz>
> On Mon, Jul 23, 2018 at 09:51:54PM -0700, Andres Freund wrote:
> > On July 23, 2018 9:50:10 PM PDT, Michael Paquier <michael@paquier.xyz> wrote:
> >> Oh, yes, that would be bad.  My mind has slipped here.  I have seen
> >> manual VACUUMs on system catalogs for applications using many temp
> >> tables...  So we would want to have only VACUUM FULL being
> >> conditionally
> >> happening?  The question comes then about what to do when a VACUUM FULL
> >> is run without a list of relations because expand_vacuum_rel() is not
> >> actually the only problem.  Would we want to ignore system tables as
> >> well except if allow_system_table_mods is on?  When no relation list is
> >> specified, get_all_vacuum_rels() builds the list of relations which
> >> causes vacuum_rel() to complain on try_relation_open(), so patching
> >> just expand_vacuum_rel() solves only half of the problem for manual
> >> VACUUMs.
> > 
> > I think any such restriction is entirely unacceptable. FULL or not.
> 
> Well, letting any users attempt to take an exclusive lock which
> makes others to be stuck as well is not acceptable either, so
> two possible answers would be to fail
> or skip such relations.  The first concept applies if a relation list is
> given by the user, and the second if no list is given.
> 
> Do you have any thoughts on the matter?

I'm not sure what is the exact problem here. If it is that a
non-privilege user can stuck on VACUUM FULL on the scenario, we
should just give up VACUUM_FULLing on unprivileged relations
*before* trying to take a lock on it. There's no reason for
allowing VACUUM FULL on a relation on that the same user is not
allowed to run VACUUM. I don't think it's a prbolem that
superuser can be involed in the blocking scenario running VACUUM
FULL.

I may be misunderstanding something because (really ?) it's still
extremely hot in Japan, today.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack

From

Michael Paquier

Date:

26 July 2018, 07:49:07

On Tue, Jul 24, 2018 at 06:27:16PM +0900, Kyotaro HORIGUCHI wrote:
> I may be misunderstanding something because (really ?) it's still
> extremely hot in Japan, today.

It is better since yesterday, in exchange of a typhoon heading straight
to Tokyo :)
--
Michael

Attachment

signature.asc

Re: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack

From

Michael Paquier

Date:

26 July 2018, 08:24:16

Hi,

So, I have spent the last couple of days trying to figure out a nice
solution for VACUUM, TRUNCATE and REINDEX, and attached is a set of
three patches to address the issues of this thread:
1) 0001 does the work for TRUNCATE, but using RangeVarGetRelidExtended
with a custom callback based on the existing truncate_check_rel().  I
had to split the session-level checks into a separate routine as no
Relation is available, but that finishes by being a neat result without
changing any user-facing behavior.
2) 0002 does the work for VACUUM.  It happens that vacuum_rel() already
does all the skip-related checks we need to know about to decide if a
relation needs to be vacuum'ed or not, so I refactored things as
follows:
2-1) For a VACUUM manual listing relations, issue an ERROR if it cannot
be vacuum'ed. Previously vacuum_rel() would just log a WARNING and call
it a day *after* locking the relation.  But as we need to rely on
RangeVarGetRelidExtended() an ERROR is necessary.  The ERROR happens
only if VACUUM FULL is used.
2-2) When a relation list is not specified in a manual VACUUM command,
then the decision to skip the relation is done in get_all_vacuum_rels()
when building the relation list with the pg_class lookup.  This logs a
DEBUG message when the relation is skipped, which is more information
that what we have now.  The checks need to happen again in vacuum_rel as
the VACUUM work could be spawned across multiple transactions, where a
WARNING is logged.
3) REINDEX is already smart enough to check for ownership of relations
if one is manually listed and reports an ERROR.  However it can cause
the instance to be stuck when doing a database-wide REINDEX on a
database using just the owner of this database.  In this case it seems
to me that we need to make ReindexMultipleTables in terms of ACL
checks, as per 0003.

I quite like the shape of the patches proposed here, and the refactoring
is I think pretty clear.  Each patch can be treated independently as
well.  Comments are welcome.  (Those patches are not indented yet, which
does not matter much at this stage anyway.)

Thanks,
--
Michael

Attachment

Re: BUG #15182: Canceling authentication due to timeout aka Denial ofService Attack

From

"Bossart, Nathan"

Date:

26 July 2018, 18:06:59

I took a look at 0001.

On 7/26/18, 12:24 AM, "Michael Paquier" <michael@paquier.xyz> wrote:
> 1) 0001 does the work for TRUNCATE, but using RangeVarGetRelidExtended
> with a custom callback based on the existing truncate_check_rel().  I
> had to split the session-level checks into a separate routine as no
> Relation is available, but that finishes by being a neat result without
> changing any user-facing behavior.

Splitting the checks like this seems reasonable.  As you pointed out,
it doesn't change the behavior of the session checks, which AFAICT
aren't necessary for the kind of permissions checks we want to add to
the RangeVarGetRelidExtended() call.

-        myrelid = RelationGetRelid(rel);
+        myrelid = RangeVarGetRelidExtended(rv, AccessExclusiveLock,
+                                           false, RangeVarCallbackForTruncate,
+                                           NULL);

Should the flags argument be 0 instead of false?

+    /* Nothing to do if the relation was not found. */
+    if (!OidIsValid(relId))
+        return;
+
+    tuple = SearchSysCache1(RELOID, ObjectIdGetDatum(relId));
+    if (!HeapTupleIsValid(tuple))    /* should not happen */
+        elog(ERROR, "cache lookup failed for relation %u", relId);

The first time we use this callback, the relation won't be locked, so
isn't it possible that we won't get a valid tuple here?  I did notice
that callbacks like RangeVarCallbackForRenameRule,
RangeVarCallbackForPolicy, and RangeVarCallbackForRenameTrigger assume
that the relation can be concurrently dropped, but
RangeVarCallbackOwnsRelation does not.  Instead, we assume that the
syscache search will succeed if the given OID is valid.  Is this a
bug, or am I missing something?

Nathan

Re: BUG #15182: Canceling authentication due to timeout aka Denial ofService Attack

From

"Bossart, Nathan"

Date:

26 July 2018, 18:40:07

On 7/26/18, 10:07 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
> The first time we use this callback, the relation won't be locked, so
> isn't it possible that we won't get a valid tuple here?  I did notice
> that callbacks like RangeVarCallbackForRenameRule,
> RangeVarCallbackForPolicy, and RangeVarCallbackForRenameTrigger assume
> that the relation can be concurrently dropped, but
> RangeVarCallbackOwnsRelation does not.  Instead, we assume that the
> syscache search will succeed if the given OID is valid.  Is this a
> bug, or am I missing something?

Please pardon the noise.  I see that we don't accept invalidation
messages until later on in RangeVarGetRelidExtended(), at which point
we'll retry and get InvalidOid for concurrently dropped relations.

Nathan

Re: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack

From

Michael Paquier

Date:

26 July 2018, 23:41:27

On Thu, Jul 26, 2018 at 03:06:59PM +0000, Bossart, Nathan wrote:
> I took a look at 0001.

Thanks for the lookup.  0003 is the most simple in the set by the way.

> On 7/26/18, 12:24 AM, "Michael Paquier" <michael@paquier.xyz> wrote:
> -    myrelid = RelationGetRelid(rel);
> +    myrelid = RangeVarGetRelidExtended(rv, AccessExclusiveLock,
> +        false, RangeVarCallbackForTruncate, NULL);
>
> Should the flags argument be 0 instead of false?

Yes, those should be 0.  All patches are missing that.  It does not have
a bad consequence on the patch, still that's incorrect.
--
Michael

Attachment

signature.asc

Re: BUG #15182: Canceling authentication due to timeout aka Denial ofService Attack

From

"Bossart, Nathan"

Date:

27 July 2018, 02:14:31

On 7/26/18, 3:42 PM, "Michael Paquier" <michael@paquier.xyz> wrote:
> Thanks for the lookup.  0003 is the most simple in the set by the way.

Minus a possible documentation update, 0003 seems almost ready, too.

For 0002, would adding vacuum_skip_rel() before and after
try_relation_open() in vacuum_rel() be enough?  That way, we could
avoid emitting an ERROR for only the VACUUM FULL case (and skip it
with a WARNING like everything else).

Nathan

Re: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack

From

Michael Paquier

Date:

27 July 2018, 05:31:23

On Thu, Jul 26, 2018 at 11:14:31PM +0000, Bossart, Nathan wrote:
> On 7/26/18, 3:42 PM, "Michael Paquier" <michael@paquier.xyz> wrote:
> Minus a possible documentation update, 0003 seems almost ready, too.

The docs mentioned that shared catalogs are processed, so I did not
bother, but visibly your comment is that we could be more precise about
the ownership in this case?  An attempt is attached.

> For 0002, would adding vacuum_skip_rel() before and after
> try_relation_open() in vacuum_rel() be enough?  That way, we could
> avoid emitting an ERROR for only the VACUUM FULL case (and skip it
> with a WARNING like everything else).

Er, we need a lock on it while looking at its data in vacuum_rel(), no?
So the call to vacuum_skip_rel() needs to happen after
try_relation_open(), once we are sure that the relation is opened.
Having two set of checks is actually better as the operation can involve
multiple operations.

The error messages generated by vacuum_skip_rel are not especially smart
when elevel >= ERROR.  As we need a proper errcode for that case as well
just using a separate error message is less confusing.  I have
implemented my idea in the updated set attached.  Another issue I have
found is that when doing for example a system-wide analyze, we would
finish with spurious warnings, as toast relations need to be discarded
from the first set of relations built.

Anyway, I have done more work on the patches, mainly I have fixed the
calls to RangeVarGetRelidExtended using booleans.  I have added
isolation tests for cases which are cheap, aka those not involving a
system-wide operation.  Running those tests on HEAD, it is easy to see
that TRUNCATE or VACUUM complete after a session doing a catalog lookup
commits its transaction.  VACUUM skips a relation and VACUUM FULL issues
an error.

Regarding those patches, I am pretty happy how things turn out for
TRUNCATE and REINDEX, way less for VACUUM, so getting 0001 and 0003
committed first makes the most sense to me as their logic is rather
straight-forward (well way of speaking ;p ).
--
Michael

Attachment

Re: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack

From

Kyotaro HORIGUCHI

Date:

27 July 2018, 06:37:02

Hello.

At Fri, 27 Jul 2018 11:31:23 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180727023123.GE1754@paquier.xyz>
> On Thu, Jul 26, 2018 at 11:14:31PM +0000, Bossart, Nathan wrote:
> > On 7/26/18, 3:42 PM, "Michael Paquier" <michael@paquier.xyz> wrote:
> > Minus a possible documentation update, 0003 seems almost ready, too.
> 
> The docs mentioned that shared catalogs are processed, so I did not
> bother, but visibly your comment is that we could be more precise about
> the ownership in this case?  An attempt is attached.
> 
> > For 0002, would adding vacuum_skip_rel() before and after
> > try_relation_open() in vacuum_rel() be enough?  That way, we could
> > avoid emitting an ERROR for only the VACUUM FULL case (and skip it
> > with a WARNING like everything else).
> 
> Er, we need a lock on it while looking at its data in vacuum_rel(), no?
> So the call to vacuum_skip_rel() needs to happen after
> try_relation_open(), once we are sure that the relation is opened.
> Having two set of checks is actually better as the operation can involve
> multiple operations.

I intended the same as Bossart said. It is not reasonable that
VACUUM and VACUUM FULL behaves differently on rejection for the
same reason. The objective of the first (or early) check is
rejecting (non-superuser's) unprivileged VACUUM FULL. Any
possible modifications on a syscache tuple before taking a lock
doesn't seem to harm in the point of view. Anyway the lock
acquired in expand_vacuum_rel is not held until vacuum_rel.

> The error messages generated by vacuum_skip_rel are not especially smart
> when elevel >= ERROR.  As we need a proper errcode for that case as well
> just using a separate error message is less confusing.  I have

Apart from the discussion above, it is uneasy for me that the
messages seem heavily affected by the callers context.

> implemented my idea in the updated set attached.  Another issue I have
> found is that when doing for example a system-wide analyze, we would
> finish with spurious warnings, as toast relations need to be discarded
> from the first set of relations built.

It seems reasonable.

> Anyway, I have done more work on the patches, mainly I have fixed the
> calls to RangeVarGetRelidExtended using booleans.  I have added

I think that it is useful to let callback reutrn bool value that
indicates whether stop or go. This allows WARNING in the
callback.

> isolation tests for cases which are cheap, aka those not involving a
> system-wide operation.  Running those tests on HEAD, it is easy to see
> that TRUNCATE or VACUUM complete after a session doing a catalog lookup
> commits its transaction.  VACUUM skips a relation and VACUUM FULL issues
> an error.
> 
> Regarding those patches, I am pretty happy how things turn out for
> TRUNCATE and REINDEX, way less for VACUUM, so getting 0001 and 0003
> committed first makes the most sense to me as their logic is rather
> straight-forward (well way of speaking ;p ).

About 0001, I'm not sure what the "session-level" means but it
looks fine and works as expected. 0003 looks fine overall. I
think I heard that gender-free wording is appreciated but it
might be only about documentation.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack

From

Michael Paquier

Date:

27 July 2018, 07:15:42

On Fri, Jul 27, 2018 at 12:37:02PM +0900, Kyotaro HORIGUCHI wrote:
> I intended the same as Bossart said. It is not reasonable that
> VACUUM and VACUUM FULL behaves differently on rejection for the
> same reason. The objective of the first (or early) check is
> rejecting (non-superuser's) unprivileged VACUUM FULL. Any
> possible modifications on a syscache tuple before taking a lock
> doesn't seem to harm in the point of view. Anyway the lock
> acquired in expand_vacuum_rel is not held until vacuum_rel.

Thinking about it harder, it is possible to pass down a status pointer
that the callback of RangeVarGetRelidExtended could fill if we pass it
down using the argument list.  This would need special handling for the
case where expand_vacuum_rel() is NIL but that would be workable.  If
vacuum_check_rel() does not need to work with elevel >= ERROR, then the
set of log messages remains the same.

> Apart from the discussion above, it is uneasy for me that the
> messages seem heavily affected by the callers context.

Sure, the patch for vacuum is still heavily WIP.  I am not much happy
about that either.

>> Regarding those patches, I am pretty happy how things turn out for
>> TRUNCATE and REINDEX, way less for VACUUM, so getting 0001 and 0003
>> committed first makes the most sense to me as their logic is rather
>> straight-forward (well way of speaking ;p ).
>
> About 0001, I'm not sure what the "session-level" means but it
> looks fine and works as expected. 0003 looks fine overall.

What I mean here is checking the interactions with other backends,
truncate_check_session is the best name I could come up with.
truncate_check_activity was another.  I am not attached to a single
name, if you have an ideaof course please feel free.

> I think I heard that gender-free wording is appreciated but it
> might be only about documentation.

Sure.
--
Michael

Attachment

signature.asc

Re: BUG #15182: Canceling authentication due to timeout aka Denial ofService Attack

From

"Bossart, Nathan"

Date:

27 July 2018, 17:40:42

On 7/26/18, 11:16 PM, "Michael Paquier" <michael@paquier.xyz> wrote:
> On Fri, Jul 27, 2018 at 12:37:02PM +0900, Kyotaro HORIGUCHI wrote:
>> I intended the same as Bossart said. It is not reasonable that
>> VACUUM and VACUUM FULL behaves differently on rejection for the
>> same reason. The objective of the first (or early) check is
>> rejecting (non-superuser's) unprivileged VACUUM FULL. Any
>> possible modifications on a syscache tuple before taking a lock
>> doesn't seem to harm in the point of view. Anyway the lock
>> acquired in expand_vacuum_rel is not held until vacuum_rel.
>
> Thinking about it harder, it is possible to pass down a status pointer
> that the callback of RangeVarGetRelidExtended could fill if we pass it
> down using the argument list.  This would need special handling for the
> case where expand_vacuum_rel() is NIL but that would be workable.  If
> vacuum_check_rel() does not need to work with elevel >= ERROR, then the
> set of log messages remains the same.

Perhaps I am misunderstanding the goal of the patch.  The original
issue reported above is for a lock pile-up that blocks connection
authentication.  Specifically, VACUUM FULL blocks waiting for an
AccessExclusiveLock on pg_authid in vacuum_rel(), even if the user
does not have the right permissions.  IIUC 0002 is adding a callback
to the OID lookup logic in expand_vacuum_rel() that aims to prevent
users from taking an AccessShareLock when they don't have permissions
to VACUUM the relation.  However, ISTM that this AccessShareLock is
not the issue.  If we allowed all users to get the AccessShareLock but
we ensured we called vacuum_skip_rel() prior to waiting for the
AccessExclusiveLock, I think we are good.  Of course, there's a chance
that a concurrent change would cause the role to lose permissions in
between expand_vacuum_rel() and vacuum_rel(), but that seems like a
risk we are taking regardless.

I think I'm essentially suggesting what you have in 0002 but without
the new RangeVarGetRelidExtended() callback.  I've attached a modified
version of 0002 that seems to fix the originally reported issue.  (I
haven't looked into any extra handling needed for ANALYZE or
partitioned tables.)  Running the same checks for all VACUUMs would
keep things simple and provide a more uniform user experience.

> The docs mentioned that shared catalogs are processed, so I did not
> bother, but visibly your comment is that we could be more precise about
> the ownership in this case?  An attempt is attached.

Sorry, I should have been clearer.  But yes, your update is what I was
thinking.

Nathan

Attachment

0002-idea.patch

Re: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack

From

Michael Paquier

Date:

28 July 2018, 03:10:12

On Fri, Jul 27, 2018 at 02:40:42PM +0000, Bossart, Nathan wrote:
> On 7/26/18, 11:16 PM, "Michael Paquier" <michael@paquier.xyz> wrote:
> I think I'm essentially suggesting what you have in 0002 but without
> the new RangeVarGetRelidExtended() callback.  I've attached a modified
> version of 0002 that seems to fix the originally reported issue.  (I
> haven't looked into any extra handling needed for ANALYZE or
> partitioned tables.)  Running the same checks for all VACUUMs would
> keep things simple and provide a more uniform user experience.

Okay, let me check that.  Your patch has at least an error in
get_all_vacuum_rels() where toast relations cannot be skipped.

>> The docs mentioned that shared catalogs are processed, so I did not
>> bother, but visibly your comment is that we could be more precise about
>> the ownership in this case?  An attempt is attached.
>
> Sorry, I should have been clearer.  But yes, your update is what I was
> thinking.

No problem.  If there are no objections, I am going to fix the REINDEX
issue first and back-patch.  Its patch is the least invasive of the
set.
--
Michael

Attachment

signature.asc

Re: BUG #15182: Canceling authentication due to timeout aka Denial ofService Attack

From

"Bossart, Nathan"

Date:

29 July 2018, 19:11:38

On 7/27/18, 7:10 PM, "Michael Paquier" <michael@paquier.xyz> wrote:
> No problem.  If there are no objections, I am going to fix the REINDEX
> issue first and back-patch.  Its patch is the least invasive of the
> set.

This seems like a reasonable place to start.  I took a closer look at
0003.

+        /*
+         * We allow the user to reindex a table if he is superuser, the table
+         * owner, or the database owner (but in the latter case, only if it's not
+         * a shared relation).
+         */
+        if (!(pg_class_ownercheck(relid, GetUserId()) ||
+              (pg_database_ownercheck(MyDatabaseId, GetUserId()) &&
+               !classtuple->relisshared)))
+            continue;

This is added to ReindexMultipleTables(), which is used for REINDEX
SCHEMA, REINDEX SYSTEM, and REINDEX DATABASE.  Presently, REINDEX
SCHEMA requires being the owner of the schema, and REINDEX SYSTEM and
REINDEX DATABASE require being the owner of the database.  So, if a
role is an owner of a database or the pg_catalog schema, they are able
to reindex shared catalogs like pg_authid.

This patch changes things so that only superusers or the owner of the
table can reindex shared relations.  This works as expected for
REINDEX SYSTEM and REINDEX DATABASE, but I think there is a problem
for REINDEX SCHEMA.  If the user is not the owner of the database but
is the owner of the schema, they will not be able to use REINDEX
SCHEMA to reindex any tables that they do not own.  To fix this, we
will likely need to use pg_namespace_ownercheck() instead of
pg_database_ownercheck() for the REINDEX_OBJECT_SCHEMA case.

I also noticed that this patch causes shared relations to be skipped
silently.  Perhaps we should emit a WARNING or DEBUG message when this
happens, at least for REINDEXOPT_VERBOSE.

    Reindexing a single index or table requires being the owner of that
    index or table.  Reindexing a database requires being the owner of
    the database (note that the owner can therefore rebuild indexes of
-   tables owned by other users).  Of course, superusers can always
+   tables owned by other users).  Reindexing a shared catalog requires
+   being the owner of that shared catalog.  Of course, superusers can always
    reindex anything.

I noticed that there is no mention that the owner of a schema can do
REINDEX SCHEMA, which seems important to note.  Also, the proposed
wording might seem slightly ambiguous for the REINDEX DATABASE case.
It might be clearer to say something like the following:

        Reindexing a single index or table requires being the owner of
        that index of table.  REINDEX DATABASE and REINDEX SYSTEM
        require being the owner of the database, and REINDEX SCHEMA
        requires being the owner of the schema (note that the user can
        therefore rebuild indexes of tables owned by other users).
        Reindexing a shared catalog requires being the owner of the
        shared catalog, even if the user is the owner of the specified
        database or schema.  Of course, superusers can always reindex
        anything.

Nathan

Re: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack

From

Michael Paquier

Date:

30 July 2018, 03:34:22

On Sun, Jul 29, 2018 at 04:11:38PM +0000, Bossart, Nathan wrote:
> On 7/27/18, 7:10 PM, "Michael Paquier" <michael@paquier.xyz> wrote:
> > No problem.  If there are no objections, I am going to fix the REINDEX
> > issue first and back-patch.  Its patch is the least invasive of the
> > set.
>
> This seems like a reasonable place to start.  I took a closer look at
> 0003.

Thanks for looking at it!

> This is added to ReindexMultipleTables(), which is used for REINDEX
> SCHEMA, REINDEX SYSTEM, and REINDEX DATABASE.  Presently, REINDEX
> SCHEMA requires being the owner of the schema, and REINDEX SYSTEM and
> REINDEX DATABASE require being the owner of the database.  So, if a
> role is an owner of a database or the pg_catalog schema, they are able
> to reindex shared catalogs like pg_authid.

Yeah, I was testing that yesterday night and bumped on this case when
trying do a REINDEX SCHEMA pg_class.  The point is that you can simplify
the check and remove pg_database_ownercheck as there is already an ACL
check on the database/system/schema at the top of the routine, so you
are already sure that pg_database_ownercheck() or
pg_namespace_ownercheck would return true.  This shaves a few cycles as
well for each relation checked.

> I also noticed that this patch causes shared relations to be skipped
> silently.  Perhaps we should emit a WARNING or DEBUG message when this
> happens, at least for REINDEXOPT_VERBOSE.

That's intentional.  I thought about that as well, but I am hesitant to
do so as we don't bother mentioning the other relations skipped.
REINDEX VERBOSE also shows up what are the tables processed, so it is
easy to guess what are the tables skipped, still more painful.  And the
documentation changes added cover the gap.

> I noticed that there is no mention that the owner of a schema can do
> REINDEX SCHEMA, which seems important to note.  Also, the proposed
> wording might seem slightly ambiguous for the REINDEX DATABASE case.
> It might be clearer to say something like the following:
>
>         Reindexing a single index or table requires being the owner of
>         that index of table.  REINDEX DATABASE and REINDEX SYSTEM
>         require being the owner of the database, and REINDEX SCHEMA
>         requires being the owner of the schema (note that the user can
>         therefore rebuild indexes of tables owned by other users).
>         Reindexing a shared catalog requires being the owner of the
>         shared catalog, even if the user is the owner of the specified
>         database or schema.  Of course, superusers can always reindex
>         anything.

+1, I have included your suggestions.  The patch attached applies easily
down to 9.5 where REINDEX SCHEMA was added.  For 9.4 and 9.3, there is
no schema case, still the new check is similar.  The commit message is
slightly changed so as there is no mention of REINDEX SCHEMA.  9.3 needs
a slight change compared to 9.4 as well.

So attached are patches for 9.5~master, 9.4 and 9.3 with commit
messages.  Does that look fine to folks of this thread?
--
Michael

Attachment

Re: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack

From

Kyotaro HORIGUCHI

Date:

30 July 2018, 11:53:54

At Mon, 30 Jul 2018 09:34:22 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180730003422.GA2878@paquier.xyz>
> On Sun, Jul 29, 2018 at 04:11:38PM +0000, Bossart, Nathan wrote:
> > On 7/27/18, 7:10 PM, "Michael Paquier" <michael@paquier.xyz> wrote:
> > This is added to ReindexMultipleTables(), which is used for REINDEX
> > SCHEMA, REINDEX SYSTEM, and REINDEX DATABASE.  Presently, REINDEX
> > SCHEMA requires being the owner of the schema, and REINDEX SYSTEM and
> > REINDEX DATABASE require being the owner of the database.  So, if a
> > role is an owner of a database or the pg_catalog schema, they are able
> > to reindex shared catalogs like pg_authid.
> 
> Yeah, I was testing that yesterday night and bumped on this case when
> trying do a REINDEX SCHEMA pg_class.  The point is that you can simplify
> the check and remove pg_database_ownercheck as there is already an ACL
> check on the database/system/schema at the top of the routine, so you
> are already sure that pg_database_ownercheck() or
> pg_namespace_ownercheck would return true.  This shaves a few cycles as
> well for each relation checked.

There's a case where the database owner differs from catalogs'
owner. ALTER DATABASE OWNER TO can make such configuration.
In this case, the previous code allows REINDEX of a shared
catalog for the database owner who is not the catalog's owner.

Superuser      : alice
Database owner : joe
Catalog owner  : andy
Schema owner   : joe

"REINDEX SCHERMA <the schema>" by joe allows shared catalog to be
reindexed, even he is *not* the owner of the catalog.

The revised patch behaves differently to this case. "REINDEX
SCHERMA <the schema>" by joe *skips* shared catalog since he is
not the owner of the catalog.

# Uh? Alice doesn't involved anywhere..

I feel that just being a database owner doesn't justify to cause
this problem innocently. Catalog owner is also doubious but we
can carefully configure the ownerships to avoid the problem since
only superuser can change it. So I vote +1 for the revised patch.

> > I also noticed that this patch causes shared relations to be skipped
> > silently.  Perhaps we should emit a WARNING or DEBUG message when this
> > happens, at least for REINDEXOPT_VERBOSE.
> 
> That's intentional.  I thought about that as well, but I am hesitant to
> do so as we don't bother mentioning the other relations skipped.
> REINDEX VERBOSE also shows up what are the tables processed, so it is
> easy to guess what are the tables skipped, still more painful.  And the
> documentation changes added cover the gap.

Even if it is written in the "Notes" section, doesn't the
following need some fix? It is the same for the DATBASE item.

| Parameters
...
| SYSTEM
|   Recreate all indexes on system catalogs within the current
|   database. *Indexes on shared system catalogs are included*.
|   Indexes on user tables are not processed. This form
|   of REINDEX cannot be executed inside a transaction block.

> > I noticed that there is no mention that the owner of a schema can do
> > REINDEX SCHEMA, which seems important to note.  Also, the proposed
> > wording might seem slightly ambiguous for the REINDEX DATABASE case.
> > It might be clearer to say something like the following:
> > 
> >         Reindexing a single index or table requires being the owner of
> >         that index of table.  REINDEX DATABASE and REINDEX SYSTEM
> >         require being the owner of the database, and REINDEX SCHEMA
> >         requires being the owner of the schema (note that the user can
> >         therefore rebuild indexes of tables owned by other users).
> >         Reindexing a shared catalog requires being the owner of the
> >         shared catalog, even if the user is the owner of the specified
> >         database or schema.  Of course, superusers can always reindex
> >         anything.
> 
> +1, I have included your suggestions.  The patch attached applies easily
> down to 9.5 where REINDEX SCHEMA was added.  For 9.4 and 9.3, there is
> no schema case, still the new check is similar.  The commit message is
> slightly changed so as there is no mention of REINDEX SCHEMA.  9.3 needs
> a slight change compared to 9.4 as well.
> 
> So attached are patches for 9.5~master, 9.4 and 9.3 with commit
> messages.  Does that look fine to folks of this thread?

This apparently changes the documented behavior and the problem
seems to be a result of a rather malicious/intentional
combination of operatoins (especially named vacuum on a shared
catalog). I vote -0.5 to backpatch unless we categorize this as a
security issue.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack

From

Michael Paquier

Date:

30 July 2018, 13:21:31

On Mon, Jul 30, 2018 at 05:53:54PM +0900, Kyotaro HORIGUCHI wrote:
> I feel that just being a database owner doesn't justify to cause
> this problem innocently. Catalog owner is also doubious but we
> can carefully configure the ownerships to avoid the problem since
> only superuser can change it. So I vote +1 for the revised patch.

Thanks for the review.  Yes that sucks that being just a database or a
schema owner allows such a user to take an exclusive lock on shared
catalogs.  Please note that depending on the order of the relations,
authentication may or may not be blocked depending on what kind of locks
the second session takes.

> | Parameters
> ...
> | SYSTEM
> |   Recreate all indexes on system catalogs within the current
> |   database. *Indexes on shared system catalogs are included*.
> |   Indexes on user tables are not processed. This form
> |   of REINDEX cannot be executed inside a transaction block.

This looks correct to me, shared catalogs are included, and the "notes"
section clealy mentions that being an owner of the shared catalog is
required.

> This apparently changes the documented behavior and the problem
> seems to be a result of a rather malicious/intentional
> combination of operatoins (especially named vacuum on a shared
> catalog). I vote -0.5 to backpatch unless we categorize this as a
> security issue.

Ask that to any vendors doing shared hosting of Postgres :)
A backpatch looks like the correct course of events to me.  Anybody here
is free to express his/her concerns of course.
--
Michael

Attachment

signature.asc

Re: BUG #15182: Canceling authentication due to timeout aka Denial ofService Attack

From

"Bossart, Nathan"

Date:

30 July 2018, 18:42:55

On 7/29/18, 7:35 PM, "Michael Paquier" <michael@paquier.xyz> wrote:
> Yeah, I was testing that yesterday night and bumped on this case when
> trying do a REINDEX SCHEMA pg_class.  The point is that you can simplify
> the check and remove pg_database_ownercheck as there is already an ACL
> check on the database/system/schema at the top of the routine, so you
> are already sure that pg_database_ownercheck() or
> pg_namespace_ownercheck would return true.  This shaves a few cycles as
> well for each relation checked.

Makes sense.

>> I also noticed that this patch causes shared relations to be skipped
>> silently.  Perhaps we should emit a WARNING or DEBUG message when this
>> happens, at least for REINDEXOPT_VERBOSE.
>
> That's intentional.  I thought about that as well, but I am hesitant to
> do so as we don't bother mentioning the other relations skipped.
> REINDEX VERBOSE also shows up what are the tables processed, so it is
> easy to guess what are the tables skipped, still more painful.  And the
> documentation changes added cover the gap.

Okay, that seems reasonable to me, too.

> +1, I have included your suggestions.  The patch attached applies easily
> down to 9.5 where REINDEX SCHEMA was added.  For 9.4 and 9.3, there is
> no schema case, still the new check is similar.  The commit message is
> slightly changed so as there is no mention of REINDEX SCHEMA.  9.3 needs
> a slight change compared to 9.4 as well.

For 9.3 and 9.4, it might be nice to add the "even if the user is the
owner of the specified database" part, too.

> So attached are patches for 9.5~master, 9.4 and 9.3 with commit
> messages.  Does that look fine to folks of this thread?

Looks good to me.  Since REINDEX can be used to block calls to
load_critical_index() from new connections, back-patching seems appropriate.

Nathan

Re: BUG #15182: Canceling authentication due to timeout aka Denial ofService Attack

From

Robert Haas

Date:

31 July 2018, 17:34:57

On Mon, Jul 30, 2018 at 6:21 AM, Michael Paquier <michael@paquier.xyz> wrote:
> On Mon, Jul 30, 2018 at 05:53:54PM +0900, Kyotaro HORIGUCHI wrote:
>> I feel that just being a database owner doesn't justify to cause
>> this problem innocently. Catalog owner is also doubious but we
>> can carefully configure the ownerships to avoid the problem since
>> only superuser can change it. So I vote +1 for the revised patch.
>
> Thanks for the review.  Yes that sucks that being just a database or a
> schema owner allows such a user to take an exclusive lock on shared
> catalogs.  Please note that depending on the order of the relations,
> authentication may or may not be blocked depending on what kind of locks
> the second session takes.

Just as a general statement, I don't think we should, as part of a
patch for the issue discussed on this thread, make any changes AT ALL
to who has permission to perform which operations.  We *certainly*
should not back-patch such changes, but we really also should not make
them on master unless they are discussed on a separate thread with a
clear subject line and agreed by a clear consensus.

This patch should only be about detecting cases where we lack
permission earlier, before we try to acquire a lock.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack

From

Michael Paquier

Date:

31 July 2018, 20:44:52

On Tue, Jul 31, 2018 at 10:34:57AM -0400, Robert Haas wrote:
> Just as a general statement, I don't think we should, as part of a
> patch for the issue discussed on this thread, make any changes AT ALL
> to who has permission to perform which operations.  We *certainly*
> should not back-patch such changes, but we really also should not make
> them on master unless they are discussed on a separate thread with a
> clear subject line and agreed by a clear consensus.

Well, perhaps spawning one thread for each command would make the most
sense then?  I am feeling a bit of confusion here.  There have been
three cases discussed up to now:
1) TRUNCATE, which results in a patch that has no behavioral changes.
2) VACUUM [FULL], where we are also heading toward a patch that has no
behavioral change per the last arguments exchanged, where we make sure
that permission checks are done before acquiring any locks.
3) REINDEX, where a database or a schema owner is able to take an
exclusive lock on a shared catalog with limited permissions.  That
sucks as this block calls to load_critical_index, but I would be ready
to buy to not back-patch such a change if the consensus reached is to
not skip shared catalogs if a database/schema owner has no ownership on
the shared catalog directly.

> This patch should only be about detecting cases where we lack
> permission earlier, before we try to acquire a lock.

Yeah, the REINDEX case is the only one among the three where the set of
relations worked on changes.  Please note that ownership checks happen
in this case for indexes, tables, database and schemas before taking a
lock on them.  Databases/systems and schemas just check for ownership of
respectively the database and the schema.

I'll be happy to create one thread per patch if that helps.  Thinking
about it this would attract more attention to each individual problem
reported.
--
Michael

Attachment

signature.asc

Re: BUG #15182: Canceling authentication due to timeout aka Denial ofService Attack

From

ahsan hadi

Date:

01 August 2018, 13:55:11

Hi Michael,

Can you rebase the reindex-priv-93.patch, it is getting some hunk failures.

patching file doc/src/sgml/ref/reindex.sgml
Hunk #1 succeeded at 227 (offset 13 lines).
patching file src/backend/commands/indexcmds.c
Hunk #1 FAILED at 1873.
1 out of 1 hunk FAILED -- saving rejects to file src/backend/commands/indexcmds.c.rej

Thanks.

Re: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack

From

Michael Paquier

Date:

02 August 2018, 22:07:13

On Wed, Aug 01, 2018 at 10:55:11AM +0000, Ahsan Hadi wrote:
> Can you rebase the reindex-priv-93.patch, it is getting some hunk failures.

Patch reindex-priv-93.patch is for REL9_3_STABLE, where it applies
cleanly for me.  reindex-priv-94.patch is for REL9_4_STABLE and
reindex-priv-95-master.patch is for 9.5~master.
--
Michael

Attachment

signature.asc

Re: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack

From

Michael Paquier

Date:

11 August 2018, 12:50:03

Hi Nathan,

On Fri, Jul 27, 2018 at 02:40:42PM +0000, Bossart, Nathan wrote:
> I think I'm essentially suggesting what you have in 0002 but without
> the new RangeVarGetRelidExtended() callback.  I've attached a modified
> version of 0002 that seems to fix the originally reported issue.  (I
> haven't looked into any extra handling needed for ANALYZE or
> partitioned tables.)  Running the same checks for all VACUUMs would
> keep things simple and provide a more uniform user experience.

I have been looking at this patch for VACUUM (perhaps we ought to spawn
a new thread as the cases of REINDEX and TRUNCATE have been addressed),
and I can see a problem with this approach.  If multiple transactions
are used with a list of relations vacuum'ed then simply moving around
the ACL checks would cause a new problem: if the relation vacuum'ed
disappears when processing it with vacuum_rel() after the list of
relations is built then we would get an ERROR instead of a skip because
of pg_class_ownercheck() which is not fail-safe.

Wouldn't it be enough to check just the ACLs in expand_vacuum_rel() and
get_all_vacuum_rels()?  The first is rather similar to what happens for
TRUNCATE, and the second is similar to what has been fixed for VACUUM.
We should also remove the relkind checks out of vacuum_skip_rel() as
those checks are not completely consistent for all those code paths...
--
Michael

Attachment

signature.asc

Re: BUG #15182: Canceling authentication due to timeout aka Denialof Service Attack

From

Michael Paquier

Date:

29 August 2018, 04:34:41

Hi all,

Here is a summary of what has happened since this thread has been
created.  Three problems reported on this thread have been solved and
resulted in different commits for early lock lookups:
- VACUUM FULL, patched on 12~:
https://www.postgresql.org/message-id/20180812222142.GA6097@paquier.xyz
Commit a556549: Improve VACUUM and ANALYZE by avoiding early lock queue
- TRUNCATE, patched on 12~:
https://www.postgresql.org/message-id/20180806165816.GA19883@paquier.xyz
Commit f841ceb: Improve TRUNCATE by avoiding early lock queue
- REINDEX, patched on 11~:
https://www.postgresql.org/message-id/20180805211059.GA2185@paquier.xyz
Commit 661dd23: Restrict access to reindex of shared catalogs for
non-privileged users

Please note that I have been very conservative with the different fixes
as v11 is getting very close to release.  The patch for REINDEX is a
behavior change which will not get further down anyway.  It would still
be nice to get a second lookup at the code and look if there are other
suspicious calls of relation_open or such which could allow
non-privileged users to pile up locks and cause more DOS problems.

Thanks,
--
Michael

Attachment

signature.asc