BUG #15182: Canceling authentication due to timeout aka Denial ofService Attack - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #15182: Canceling authentication due to timeout aka Denial ofService Attack
Date
Msg-id 152512087100.19803.12733865831237526317@wrigleys.postgresql.org
Whole thread Raw
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      15182
Logged by:          Lloyd Albin
Email address:      lalbin@scharp.org
PostgreSQL version: 10.3
Operating system:   OpenSUSE
Description:

Over the last several weeks our developers caused a Denial of Service Attack
against ourselves by accident. When looking at the log files, I noticed that
we had authentication timeouts during these time periods. In researching the
problem I found this is due to locks being held on shared system catalog
items, aka system catalog items that are shared between all databases on the
same cluster/server. This can be caused by beginning a long running
transaction that queries pg_stat_activity, pg_roles, pg_database, etc and
then another connection that runs either a REINDEX DATABASE, REINDEX SYSTEM,
or VACUUM FULL. This issue is of particular importance to database resellers
who use the same cluster/server for multiple clients, as two clients can
cause this issue to happen inadvertently or a single client can either cause
it to happen maliciously or inadvertently. Note: The large cloud providers
give each of their clients their own cluster/server so this will not affect
across cloud clients but can affect an individual client. The problem is
that traditional hosting companies will have all clients from one or more
web servers share the same PostgreSQL cluster/server. This means that one or
two clients could inadvertently stop all the other clients from being able
to connect to their databases until the first client does either a COMMIT or
ROLLBACK of their transaction which they could hold open for hours, which is
what happened to us internally.

In Connection 1 we need to BEGIN a transaction and then query a shared
system item; pg_authid, pg_database, etc; or a view that depends on a shared
system item; pg_stat_activity, pg_roles, etc. Our developers were accessing
pg_roles.

Connection 1 (Any database, Any User)
BEGIN;
SELECT * FROM pg_stat_activity;

Connection 2 (Any database will do as long as you are the database owner)
REINDEX DATABASE postgres;

Connection 3 (Any Database, Any User)
psql -h sqltest-alt -d sandbox

All future Connection 3's will hang for however long the transaction in
Connection 1 runs. In our case this was hours and denied everybody else the
ability to log into the server until Connection 1 was committed. psql will
just hang for hours, even overnight in my testing, but our apps would get
the "Canceling authentication due to timeout" after 1 minute.

Connection 2 can also do any of these commands to also cause the same
issue:
REINDEX SYSTEM postgres;
VACUUM FULL pg_authid;
vacuumdb -f -h sqltest-alt -d lloyd -U lalbin

Even worse is that the VACUUM FULL pg_authid; can be started by an
unprivileged user and it will wait for the AccessShareLock by connection 1
to be released before returning the error that you don't have permission to
perform this action, so even an unprivileged user can cause this to happen.
The privilege check needs to happen before the waiting for the
AccessExclusiveLock happens.

This bug report has been simplified and shorted drastically. To read the
full information about this issue please see my blog post:
http://lloyd.thealbins.com/Canceling%20authentication%20due%20to%20timeout

Lloyd Albin
Database Administrator
Statistical Center for HIV/AIDS Research and Prevention (SCHARP)
Fred Hutchinson Cancer Research Center


pgsql-bugs by date:

Previous
From: Adam Bartoszewicz
Date:
Subject: Re: BUG #15181: pg_dump - missing schema in CREATE TEXT SEARCHDICTIONARY
Next
From: Huong Dangminh
Date:
Subject: RE: power() function in Windows: "value out of range: underflow"