Re: BUG #17042: Concurrent Modifications of PostgreSQL Function Can Corrupt Database - Mailing list pgsql-bugs

From Kim-Alexander Brodowski
Subject Re: BUG #17042: Concurrent Modifications of PostgreSQL Function Can Corrupt Database
Date
Msg-id c66357f2-7d83-b94e-b5cd-9acfe0d46391@iserv.eu
Whole thread Raw
In response to BUG #17042: Concurrent Modifications of PostgreSQL Function Can Corrupt Database  (PG Bug reporting form <noreply@postgresql.org>)
List pgsql-bugs
Of course, you only realize something went wrong once you hit submit:

Please ignore the inconsistencies between file and database names, as 
the logs have been put together from multiple terminal sessions. Context 
should make clear what was meant to be there.

On 01.06.21 22:45, PG Bug reporting form wrote:
> The following bug has been logged on the website:
>
> Bug reference:      17042
> Logged by:          Kim-Alexander Brodowski
> Email address:      kim.brodowski@iserv.eu
> PostgreSQL version: 13.3
> Operating system:   Debian Bullseye/Buster
> Description:
>
> Dear Readers,
>
> We, IServ GmbH, are a German company providing software primarily for the
> education sector. We deploy PostgreSQL as our primary database backend on
> almost 5,000 machines. Due to an unfortunate bug in our software and under
> rare circumstances, a database schema migration script for DAViCal
> (https://www.davical.org/) was executed twice in parallel.
>
> Under normal circumstances this shouldn't be a problem, as databases like
> PostgreSQL are designed for concurrent reading and writing. However, we have
> noticed corruption on some of our PostgreSQL instances. Particularly our
> backups using pg_dumpall would fail:
>
>    LC_ALL=C pg_dumpall -l davical -U postgres
>    [...]
>    pg_dump: error: query failed: ERROR: cache lookup failed for type 243143
>    pg_dump: error: query was: SELECT proretset, prosrc, probin,
> pg_catalog.pg_get_function_arguments(oid) AS funcargs,
> pg_catalog.pg_get_function_identity_arguments(oid) AS funciargs,
> pg_catalog.pg_get_function_result(oid) AS funcresult,
> array_to_string(protrftypes, ' ') AS protrftypes, prokind, provolatile,
> proisstrict, prosecdef, proleakproof, proconfig, procost, prorows,
> prosupport, proparallel, (SELECT lanname FROM pg_catalog.pg_language WHERE
> oid = prolang) AS lanname FROM pg_catalog.pg_proc WHERE oid =
> '243147'::pg_catalog.oid
>    pg_dumpall: error: pg_dump failed on database "iserv", exiting
>
> Interestingly, dumping only the actual table contents works fine. The
> corruption occurred in the database schema.
>
> Digging through pg_proc looking for the culprit revealed that for varying
> user defined functions, the OID referenced in prorettype could not be found
> in pg_type. Cross-checking with the database schema showed us that the data
> type was indeed still defined in pg_type, but the OID appears to have
> changed.
>
> Reproducing the bug:
>
> My co-worker Martin von Wittich came up with a set of minimal SQL statements
> to trigger the issue:
>
> bug.sql:
>
>    DROP TYPE foo CASCADE;
>    CREATE TYPE foo AS (foo INTEGER);
>    
>    CREATE OR REPLACE FUNCTION foobar() RETURNS foo AS $$
>    BEGIN
>      RETURN ROW(1)::foo;
>    END
>    $$
>    LANGUAGE plpgsql;
>    
>    SELECT foobar();
>
> In order to trigger the issue, we need to run our statements in parallel:
>
>    psql -f test.sql & psql -f test.sql
>
> You might need to run this more than once to trigger the issue. Occasionally
> errors will show up.
>
> To check whether we succeeded, the easiest way is to dump the schema of the
> database in use:
>
>    pg_dumpall -s
>
> We were able to reproduce this issue on PostgreSQL 13.3, 13.2 and 11.12.
>
> There are a couple of accounts of similar issues coming up in the past, such
> as in
> https://www.postgresql-archive.org/problem-with-create-function-and-drop-type-td6162498.html.
> Therefore, we believe we are not the only ones affected by this issue nor is
> this issue particularly new. Nobody appears to have made a connection with
> parallel execution of statements though until now.
>
> Workaround:
> The database schema migration script unconditionally recreates the functions
> in use (CREATE OR REPLACE ...). Unfortunately, this doesn't appear to fix
> the issue. Instead, you have to manually DROP and CREATE all affected
> functions.
>
> Cause:
> We didn't dive deep into PostgreSQL's codebase, but it seems likely, that a
> lack of proper locking might be to blame here. While parallel execution of
> queries like that is certainly uncommon, we believe the PostgreSQL cluster
> should never end up in an inconsistent state.
>
> On behalf of my employer, I'd like to thank the PostgreSQL developers and
> community for their work on this project. We heavily rely on PostgreSQL.
>
> Kind regards,
> Kim-Alexander Brodowski
> IServ GmbH
>



pgsql-bugs by date:

Previous
From: PG Bug reporting form
Date:
Subject: BUG #17042: Concurrent Modifications of PostgreSQL Function Can Corrupt Database
Next
From: Tom Lane
Date:
Subject: Re: BUG #17042: Concurrent Modifications of PostgreSQL Function Can Corrupt Database