Hi hackers,
While using the pg_enable_data_checksums() feature, I found a likely bug, a race condition in datachecksum_state.c's launcher_exit().
When pg_enable_data_checksums() is called twice before the first launcher starts, two bg workers are registered (the code expects this). The redundant launcher exits early, but it's launcher_exit() callback unconditionally clears the shared launcher_running flag and may call SetDataChecksumsOff() -- even though it never owned the flag.
This allows a third pg_enable_data_checksums() call to launch another launcher concurrently with the first (duplicate work, doubled I/O, spurious warnings). Worse, if the redundant launcher initialized after the winner transitioned to inprogress-on, its exit handler calls SetDataChecksumsOff(), silently aborting the enable operation. (I have not triggered the SetDataChecksumsOff part though calling out ad it can be a likely scenario based on timing of workers)
Reproduced by firing three calls in quick succession:
psql -c "SELECT pg_enable_data_checksums();" &
psql -c "SELECT pg_enable_data_checksums();" &
sleep 0.5
psql -c "SELECT pg_enable_data_checksums();" &
Log shows two launchers processing databases concurrently:
[2093292] LOG: enabling data checksums requested
[2093293] LOG: already running, exiting
[2093299] LOG: enabling data checksums requested -- third launcher admitted
[2093292] LOG: processing database "postgres"
[2093299] LOG: processing database "postgres" -- same DB, concurrently
[2093299] WARNING: cannot set data checksums to "on", current state is not "inprogress-on"
I think the process-local launcher_running flag exists for this purpose and is already used for the worker-kill block, but the flag-clear and state-revert blocks do not use it.
The attached patch returns early from launcher_exit() when the local flag is false. Thoughts?
Regards,
Ayush