Re: [PATCH] initdb: do not exit after warn_on_mount_point - Mailing list pgsql-hackers

From andrey.arapov@nixaid.com
Subject Re: [PATCH] initdb: do not exit after warn_on_mount_point
Date
Msg-id 21269e094e35a1b32c426408cc0a4408@nixaid.com
Whole thread Raw
In response to Re: [PATCH] initdb: do not exit after warn_on_mount_point  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
September 12, 2022 7:51 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> Julien Rouhaud <rjuju123@gmail.com> writes:
>
>> On Sun, Sep 11, 2022 at 06:22:21PM +0000, andrey.arapov@nixaid.com wrote:
>>> Everyone using containerized postgresql image cannot use /var/lib/postgresql
>>> as the mountpoint but has to use /var/lib/postgresql/data instead due to this
>>> issue [4] due to [5].
>>
>> initdb had this behavior for almost 10 years, and for good reasons: it prevents
>> actual problems that were seen in the field.
>
> The long and the short of this is that one person losing their data
> outweighs thousands of people being very mildly inconvenienced by
> having to create an extra level of directory. I understand that you
> don't think so, but you'd change your mind PDQ if it was *your* data
> that got irretrievably corrupted.
>
> We are not going to remove this check.
>
> If anything, the fault I'd find with the existing code is that it's not
> sufficiently thorough about detecting what is a mount point. AFAICS,
> neither the lost+found check nor the extra-dot-files check will trigger
> on modern filesystems such as XFS. I wonder if we could do something
> like comparing the stat(2) st_dev numbers for the proposed data directory
> vs. its parent directory.
>
> regards, tom lane


Thank you for the explanation Tom & Julien!

This is not an issue for the containerized PostgreSQL where the mountpoint is a required dependency without which the
containerwon't even spawn nor start the scripts (such as initdb). 
This also makes it difficult to comprehend the reason initdb exits when it sees PGDATA is a mountpoint it at the first
hand.

That and the "Warn about initdb using mount-points" commit [4] made me thinking that this was a typo.

But I do understand that having it exiting on detecting PGDATA being a mountpoint have saved people from having their
DBcorrupted is a good enough reason for keeping thing as they are now. 


== Summarizing the issue ==

The problem is that PostgreSQL will irrecoverably corrupt when the `PGDATA` mountpoint gets accidentally unmounted
whilethe DB is up. [1] 

The `PGDATA` can't just accidentally get unmounted due to the files being locked by a process there, unless there is a
timewindow between the DB locking/unlocking the data in it of course. 
Or unless someone forcibly unmounts it.

(I'd expect the DB to detect this & immediately terminate with a fatal error)

Or more likely is when `PGDATA` directory gets mounted back again _while_ PostgreSQL is running with a newly
initializedcluster in the empty PGDATA directory (say, due to start-scripts running `initdb` or there was an old
instanceof a previously initialized cluster in the underlying PGDATA directory [with mountpoint being unmounted]), and
thengets closed which in turn causes it writing the wrong data into the correct pg_control file. 


== The solution (partial) ==

The solution to this was built into the `initdb` [4] - which makes sure it fails when it finds the `PGDATA` is a
mountpoint.
This is currently achieved by `initdb` finding the `lost+found` directory in PGDATA, which is not a robust solution,
sincethere are filesystems that do not place that file onto the mountpoint such as XFS, BTRFS, eCryptfs, ..., or one
canerase the `lost+found` directory. 
Hence, this is only a partial solution to the problem and looks to be more of a workaround nature IMHO.


== The recommendation ==

The recommendation is to mount an upper directory instead of PGDATA directly (one above the PGDATA, e.g.
`$PGDATA/../.`)to make sure the DB or a certain* PostgreSQL start script will fail due to a missing `PGDATA` directory
insteadof running the `initdb` to re-init the cluster. 

== The reason ==

This is because, as already mentioned before, running the `initdb` can lead to a situation where one can irrecoverably
corruptthe DB by mounting the original PGDATA backing device back over again and stopping/re-starting the DB which will
irrecoverablyoverwrite the good pg_control file. 
(* - I assume that on some systems the postgresql start-script runs `initdb` when detects PGDATA directory is empty
beforestarting the DB; like in the postgresql docker container [2] (by checking the presence of `$PGDATA/PG_VERSION`).
Andthe default postgresql start script [3] runs the DB directly without issuing `initdb`.) 


== Next steps? ==

I'm wondering whether that's the same issue for mariadb / mysql ...

And whether there could be a better way to handle this issue and protect the PostgreSQL from corrupting the DB in the
eventthe PGDATA gets accidentally unmounted / re-mounted leading the the issue described above. 
Solving this would allow people to actually mount the PGDATA directly, without the need to fix the `initdb`'s
mountpointdetection for filesystems without `lost+found` directory. 


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1247477#c1
[2] https://github.com/docker-library/postgres/blob/74e51d102aede/docker-entrypoint.sh#L317
[3] https://github.com/postgres/postgres/blob/REL_14_5/contrib/start-scripts/linux#L94
[4] https://github.com/postgres/postgres/commit/17f15239325a88581bb4f9cf91d38005f1f52d69


Kind regards,
Andrey Arapov



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Introduce wait_for_subscription_sync for TAP tests
Next
From: "kuroda.hayato@fujitsu.com"
Date:
Subject: RE: Perform streaming logical transactions by background workers and parallel apply