BUG #17042: Concurrent Modifications of PostgreSQL Function Can Corrupt Database - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #17042: Concurrent Modifications of PostgreSQL Function Can Corrupt Database
Date
Msg-id 17042-cfa218be6e24a03d@postgresql.org
Whole thread Raw
Responses Re: BUG #17042: Concurrent Modifications of PostgreSQL Function Can Corrupt Database  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: BUG #17042: Concurrent Modifications of PostgreSQL Function Can Corrupt Database  (Kim-Alexander Brodowski <kim.brodowski@iserv.eu>)
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      17042
Logged by:          Kim-Alexander Brodowski
Email address:      kim.brodowski@iserv.eu
PostgreSQL version: 13.3
Operating system:   Debian Bullseye/Buster
Description:

Dear Readers,

We, IServ GmbH, are a German company providing software primarily for the
education sector. We deploy PostgreSQL as our primary database backend on
almost 5,000 machines. Due to an unfortunate bug in our software and under
rare circumstances, a database schema migration script for DAViCal
(https://www.davical.org/) was executed twice in parallel.

Under normal circumstances this shouldn't be a problem, as databases like
PostgreSQL are designed for concurrent reading and writing. However, we have
noticed corruption on some of our PostgreSQL instances. Particularly our
backups using pg_dumpall would fail:

  LC_ALL=C pg_dumpall -l davical -U postgres
  [...]
  pg_dump: error: query failed: ERROR: cache lookup failed for type 243143
  pg_dump: error: query was: SELECT proretset, prosrc, probin,
pg_catalog.pg_get_function_arguments(oid) AS funcargs,
pg_catalog.pg_get_function_identity_arguments(oid) AS funciargs,
pg_catalog.pg_get_function_result(oid) AS funcresult,
array_to_string(protrftypes, ' ') AS protrftypes, prokind, provolatile,
proisstrict, prosecdef, proleakproof, proconfig, procost, prorows,
prosupport, proparallel, (SELECT lanname FROM pg_catalog.pg_language WHERE
oid = prolang) AS lanname FROM pg_catalog.pg_proc WHERE oid =
'243147'::pg_catalog.oid
  pg_dumpall: error: pg_dump failed on database "iserv", exiting

Interestingly, dumping only the actual table contents works fine. The
corruption occurred in the database schema.

Digging through pg_proc looking for the culprit revealed that for varying
user defined functions, the OID referenced in prorettype could not be found
in pg_type. Cross-checking with the database schema showed us that the data
type was indeed still defined in pg_type, but the OID appears to have
changed.

Reproducing the bug:

My co-worker Martin von Wittich came up with a set of minimal SQL statements
to trigger the issue:

bug.sql:

  DROP TYPE foo CASCADE;
  CREATE TYPE foo AS (foo INTEGER);
  
  CREATE OR REPLACE FUNCTION foobar() RETURNS foo AS $$
  BEGIN
    RETURN ROW(1)::foo;
  END
  $$
  LANGUAGE plpgsql;
  
  SELECT foobar();

In order to trigger the issue, we need to run our statements in parallel:

  psql -f test.sql & psql -f test.sql

You might need to run this more than once to trigger the issue. Occasionally
errors will show up.

To check whether we succeeded, the easiest way is to dump the schema of the
database in use:

  pg_dumpall -s

We were able to reproduce this issue on PostgreSQL 13.3, 13.2 and 11.12.

There are a couple of accounts of similar issues coming up in the past, such
as in
https://www.postgresql-archive.org/problem-with-create-function-and-drop-type-td6162498.html.
Therefore, we believe we are not the only ones affected by this issue nor is
this issue particularly new. Nobody appears to have made a connection with
parallel execution of statements though until now.

Workaround:
The database schema migration script unconditionally recreates the functions
in use (CREATE OR REPLACE ...). Unfortunately, this doesn't appear to fix
the issue. Instead, you have to manually DROP and CREATE all affected
functions.

Cause:
We didn't dive deep into PostgreSQL's codebase, but it seems likely, that a
lack of proper locking might be to blame here. While parallel execution of
queries like that is certainly uncommon, we believe the PostgreSQL cluster
should never end up in an inconsistent state.

On behalf of my employer, I'd like to thank the PostgreSQL developers and
community for their work on this project. We heavily rely on PostgreSQL.

Kind regards,
Kim-Alexander Brodowski
IServ GmbH


pgsql-bugs by date:

Previous
From: "Godfrin, Philippe E"
Date:
Subject: RE: [EXTERNAL] Re: CREATE SERVER makes 2 entries in the dictionary
Next
From: Kim-Alexander Brodowski
Date:
Subject: Re: BUG #17042: Concurrent Modifications of PostgreSQL Function Can Corrupt Database