[BUG] Race in online checksums launcher_exit() - Mailing list pgsql-hackers

From Ayush Tiwari
Subject [BUG] Race in online checksums launcher_exit()
Date
Msg-id CAJTYsWWg6tFrdMhWs5PkwESTNeeUUsMuY17O4UmPPh771c3stA@mail.gmail.com
Whole thread
Responses Re: [BUG] Race in online checksums launcher_exit()
List pgsql-hackers
Hi hackers,

While using the pg_enable_data_checksums() feature, I found a likely bug, a race condition in  datachecksum_state.c's launcher_exit().

When pg_enable_data_checksums() is called twice before the first launcher starts, two bg workers are registered (the code expects this).  The redundant launcher exits early, but it's launcher_exit() callback unconditionally clears the shared launcher_running flag and may call SetDataChecksumsOff() -- even though it never owned the flag.

This allows a third pg_enable_data_checksums() call to launch another launcher concurrently with the first (duplicate work, doubled I/O, spurious warnings).  Worse, if the redundant launcher initialized after the winner transitioned to inprogress-on, its exit handler calls SetDataChecksumsOff(), silently aborting the enable operation.  (I have not triggered the SetDataChecksumsOff part though calling out ad it can be a likely scenario based on timing of workers)

Reproduced by firing three calls in quick succession:

  psql -c "SELECT pg_enable_data_checksums();" &
  psql -c "SELECT pg_enable_data_checksums();" &
  sleep 0.5
  psql -c "SELECT pg_enable_data_checksums();" &

Log shows two launchers processing databases concurrently:

  [2093292] LOG:  enabling data checksums requested
  [2093293] LOG:  already running, exiting
  [2093299] LOG:  enabling data checksums requested     -- third launcher admitted
  [2093292] LOG:  processing database "postgres"
  [2093299] LOG:  processing database "postgres"        -- same DB, concurrently
  [2093299] WARNING:  cannot set data checksums to "on", current state is not "inprogress-on"

I think the process-local launcher_running flag exists for this purpose and is already used for the worker-kill block, but the flag-clear and state-revert blocks do not use it.

The attached patch returns early from launcher_exit() when the local flag is false. Thoughts?

Regards,
Ayush
Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: First draft of PG 19 release notes
Next
From: Paul A Jungwirth
Date:
Subject: Re: FOR PORTION OF does not recompute GENERATED STORED columns that depend on the range column