Thread: [PoC] configurable out of disk space elog level

[PoC] configurable out of disk space elog level

From
Maxim Orlov
Date:
Hi!


PROBLEM

Our customer stumble onto the next behaviour of the Postgres cluster: if disk space is exhausted, Postgres continues to
work until WAL can be successfully written. Thus, upon disk space exhaustion, clients will get an “ERROR: could not
extend file “base/XXXXX/XXXXX”: No space left on device” messages and transactions will be aborted. But the cluster
continues to work for a quite some time.

This behaviour of the PostgreSQL, of course, is perfectly legit. Cluster just translate OS error to the user and can do
nothing about it, expecting space may be available later.

On the other hand, users continues to send more data and having more and more transactions to be aborted.

There are next possible ways to diagnose described situation:
 —external monitoring system;
 —log analysis;
 —create/drop table and analyse results.

Each one have advantages and disadvantages. I'm not going to dive deeper here, if you don't mind.

The customer, mentioned above, in this particular case, would be glad to be able to have a mechanism to stop the cluster.
Again, in this concrete case.


PROPOSAL

My proposal is to add a tablespace option in order to be able to configure which behaviour is appropriate for a
particular user. I've decided to call this option “on_no_space” for now. If anyone has a better naming for this feature,
please, report.

So, the idea is to add both GUC and tablespace option “on_no_space”. The tablespace option defines the behaviour of the
cluster for a particular tablespace in “on_no_space” situation. The GUC defines the default value of tablespace option.

Patch is posted as PoC is attached.

Here's what it looks like:
===============================================================================================
== Create 100Mb disk
$ dd if=/dev/zero of=/tmp/foo.img bs=100M count=1
$ mkfs.ext4 /tmp/foo.img
$ mkdir /tmp/foo
$ sudo mount -t ext4 -o loop /tmp/foo.img /tmp/foo
$ sudo chown -R orlov:orlov /tmp/foo
===============================================================================================
== Into psql
postgres=# CREATE TABLESPACE foo LOCATION '/tmp/foo' WITH (on_no_space=fatal);
CREATE TABLESPACE
postgres=# \db+
                                       List of tablespaces
    Name    | Owner | Location | Access privileges |       Options       |  Size   | Description
------------+-------+----------+-------------------+---------------------+---------+-------------
 foo        | orlov | /tmp/foo |                   | {on_no_space=fatal} | 0 bytes |

...

postgres=# CREATE TABLE bar(qux int, quux text) WITH (autovacuum_enabled = false) TABLESPACE foo;
CREATE TABLE
postgres=# INSERT INTO bar(qux, quux) SELECT id, md5(id::text) FROM generate_series(1, 10000000) AS id;
FATAL:  could not extend file "pg_tblspc/16384/PG_16_202211121/5/16385": No space left on device
HINT:  Check free disk space.
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
===============================================================================================


CAVEATS

Again, I've posted this patch as a PoC. This is not a complete realization of described functionality. AFAICS, there are
next problems:
 - I have to put get_tablespace_elevel call in RelationGetBufferForTuple in order to tablespace in cache; overwise,
   cache miss in get_tablespace triggers assertion failing in lock.c:887 (Assert("!IsRelationExtensionLockHeld")).
   This assertion was added by commit 15ef6ff4 (see [0] for details).
 - What error should be when mdextend called not to insert a tuple into a heap (WAL applying, for example)?

Maybe, adding just GUC without ability to customize certain tablespaces to define "out of disk space" behaviour is enough?
I would appreciate it if you give your opinions on a subject.

--
Attachment

Re: [PoC] configurable out of disk space elog level

From
Pavel Borisov
Date:
Hi, Maxim!
> My proposal is to add a tablespace option in order to be able to configure which behaviour is appropriate for a
> particular user. I've decided to call this option “on_no_space” for now. If anyone has a better naming for this
feature,
> please, report.
>
> So, the idea is to add both GUC and tablespace option “on_no_space”. The tablespace option defines the behaviour of
the
> cluster for a particular tablespace in “on_no_space” situation. The GUC defines the default value of tablespace
option.

I suppose there can be a kind of attack with this feature i.e.

- If someone already has his own tablespace he can do:
ALTER TABLESPACE my SET on_no_space=fatal; // This needs tablespace
ownership, not superuser permission.
- Then fill up his own db with garbage to fill his tablespace.
- Then all the db cluster will go fatal, even if the other users'
tablespaces are almost free.

If this can be avoided, I think the patch can be useful.

Regards,
Pavel Borisov.



Re: [PoC] configurable out of disk space elog level

From
Andres Freund
Date:
Hi,

On 2022-11-16 15:59:09 +0300, Maxim Orlov wrote:
> Patch is posted as PoC is attached.

> --- a/src/backend/storage/smgr/md.c
> +++ b/src/backend/storage/smgr/md.c
> @@ -40,6 +40,7 @@
>  #include "storage/sync.h"
>  #include "utils/hsearch.h"
>  #include "utils/memutils.h"
> +#include "utils/spccache.h"
>  
>  /*
>   *    The magnetic disk storage manager keeps track of open file
> @@ -479,14 +480,16 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
>  
>      if ((nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ, seekpos, WAIT_EVENT_DATA_FILE_EXTEND)) != BLCKSZ)
>      {
> +        int elevel = get_tablespace_elevel(reln->smgr_rlocator.locator.spcOid);
> +

You can't do catalog access below the bufmgr.c layer. It could lead to all
kinds of nastiness, including potentially recursing back to md.c. Even leaving
that aside, we can't do catalog accesses in all kinds of environments that
this currently is active in - most importantly it's affecting the startup
process. We don't do catalog accesses in the startup process, and even if we
were to do so, we couldn't unconditionally because the catalog might not even
be consistent at this point (nor is it guaranteed that the wal_level even
allows to access catalogs during recovery).

I'm not convinced by the usecase in the first place, but purely technically I
think it's not viable to make this a tablespace option.

Greetings,

Andres Freund



Re: [PoC] configurable out of disk space elog level

From
Maxim Orlov
Date:
On Wed, 16 Nov 2022 at 20:41, Andres Freund <andres@anarazel.de> wrote:
Hi,
You can't do catalog access below the bufmgr.c layer. It could lead to all
kinds of nastiness, including potentially recursing back to md.c. Even leaving
Yep, this is my biggest concern. It turns out, that the way to make such a feature is to use just GUC for all tablespaces or 
forward elevel "from above".
 
that aside, we can't do catalog accesses in all kinds of environments that
this currently is active in - most importantly it's affecting the startup
process. We don't do catalog accesses in the startup process, and even if we
were to do so, we couldn't unconditionally because the catalog might not even
be consistent at this point (nor is it guaranteed that the wal_level even
allows to access catalogs during recovery).
Yep, that is why I do use in get_tablespace_elevel:
+       /*
+        * Use GUC level only in normal mode.
+        */
+       if (!IsNormalProcessingMode())
+               return ERROR;

Anyway, I appreciate the opinion, thank you!

--
Best regards,
Maxim Orlov.

Re: [PoC] configurable out of disk space elog level

From
Andres Freund
Date:
On 2022-11-17 14:40:02 +0300, Maxim Orlov wrote:
> On Wed, 16 Nov 2022 at 20:41, Andres Freund <andres@anarazel.de> wrote:
> > that aside, we can't do catalog accesses in all kinds of environments that
> > this currently is active in - most importantly it's affecting the startup
> > process. We don't do catalog accesses in the startup process, and even if
> > we
> > were to do so, we couldn't unconditionally because the catalog might not
> > even
> > be consistent at this point (nor is it guaranteed that the wal_level even
> > allows to access catalogs during recovery).
> >
> Yep, that is why I do use in get_tablespace_elevel:
> +       /*
> +        * Use GUC level only in normal mode.
> +        */
> +       if (!IsNormalProcessingMode())
> +               return ERROR;
> 
> Anyway, I appreciate the opinion, thank you!

The startup process is in normal processing mode.



Re: [PoC] configurable out of disk space elog level

From
Robert Haas
Date:
On Wed, Nov 16, 2022 at 7:59 AM Maxim Orlov <orlovmg@gmail.com> wrote:
> The customer, mentioned above, in this particular case, would be glad to be able to have a mechanism to stop the
cluster.
> Again, in this concrete case.
>
> My proposal is to add a tablespace option in order to be able to configure which behaviour is appropriate for a
> particular user. I've decided to call this option “on_no_space” for now. If anyone has a better naming for this
feature,
> please, report.

I don't think this is a good feature to add to PostgreSQL. First, it's
unclear why stopping the cluster is a desirable behavior. It doesn't
stop the user transactions from failing; it just makes them fail in
some other way. Now it is of course perfectly legitimate for a
particular user to want that particular behavior anyway, but there are
a bunch of other things that a user could equally legitimately want to
do, like page the DBA or trigger a failover or kill off sessions that
are using large temporary relations or whatever. And, equally, there
are many other operating system errors to which a user could want the
database system to respond in similar ways. For example, someone might
want any given one of those treatments when an I/O error occurs
writing to the data directory, or a read-only filesystem error, or a
permission denied error.

Having a switch for one particular kind of error (out of many that
could possibly occur) that triggers one particular coping strategy
(out of many that could possibly be used) seems far too specific a
thing to add as a core feature. And even if we had something much more
general, I'm not sure why that should go into the database rather than
being implemented outside it. After all, nothing at all prevents the
user from scanning the database logs for "out of space" errors and
shutting down the database if any are found. While you're at it, you
could make your monitoring script also check the free space on the
relevant partition using statfs() and page somebody if the utilization
goes above 95% or whatever threshold you like, which would probably
avoid service outages much more effectively than $SUBJECT.

I just can't see much real benefit in putting this logic inside the database.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: [PoC] configurable out of disk space elog level

From
Maxim Orlov
Date:

I don't think this is a good feature to add to PostgreSQL. First, it's
unclear why stopping the cluster is a desirable behavior. It doesn't
stop the user transactions from failing; it just makes them fail in
some other way. Now it is of course perfectly legitimate for a
particular user to want that particular behavior anyway, but there are
a bunch of other things that a user could equally legitimately want to
do, like page the DBA or trigger a failover or kill off sessions that
are using large temporary relations or whatever. And, equally, there
are many other operating system errors to which a user could want the
database system to respond in similar ways. For example, someone might
want any given one of those treatments when an I/O error occurs
writing to the data directory, or a read-only filesystem error, or a
permission denied error.

Having a switch for one particular kind of error (out of many that
could possibly occur) that triggers one particular coping strategy
(out of many that could possibly be used) seems far too specific a
thing to add as a core feature. And even if we had something much more
general, I'm not sure why that should go into the database rather than
being implemented outside it. After all, nothing at all prevents the
user from scanning the database logs for "out of space" errors and
shutting down the database if any are found.
 
Yes, you are right. Actually, this is one of possible ways to deal with described situation I
mentioned above. And if I would deal with such a task, I would make it via log monitoring.
The question was: "could we be more general here?". Probably, not.
 
While you're at it, you
could make your monitoring script also check the free space on the
relevant partition using statfs() and page somebody if the utilization
goes above 95% or whatever threshold you like, which would probably
avoid service outages much more effectively than $SUBJECT.

I just can't see much real benefit in putting this logic inside the database.

OK, I got it. Thanks for your thoughts!

--
Best regards,
Maxim Orlov.

Re: [PoC] configurable out of disk space elog level

From
Greg Stark
Date:
On Thu, 17 Nov 2022 at 14:56, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Having a switch for one particular kind of error (out of many that
> could possibly occur) that triggers one particular coping strategy
> (out of many that could possibly be used) seems far too specific a
> thing to add as a core feature. And even if we had something much more
> general, I'm not sure why that should go into the database rather than
> being implemented outside it. After all, nothing at all prevents the
> user from scanning the database logs for "out of space" errors and
> shutting down the database if any are found. While you're at it, you
> could make your monitoring script also check the free space on the
> relevant partition using statfs() and page somebody if the utilization
> goes above 95% or whatever threshold you like, which would probably
> avoid service outages much more effectively than $SUBJECT.

I have often thought we report a lot of errors to the user as
transaction errors that database admins are often going to feel they
would rather treat as system-wide errors. Often the error the user
sees seems like a very low level error with no context that they can't
do anythign about. This seems to be an example of that.

I don't really have a good solution for it but I do think most users
would rather deal with these errors at a higher level than individual
queries from individual clients. Out of disk space, hardware errors,
out of memory, etc they would rather handle in one centralized place
as a global condition. You can work around that with
middleware/libraries/monitoring but it's kind of working around the
database giving you the information at the wrong time and place for
your needs.