Thread: [PATCH] initdb: do not exit after warn_on_mount_point

[PATCH] initdb: do not exit after warn_on_mount_point

From
andrey.arapov@nixaid.com
Date:
Hello,

please find my first patch for PostgreSQL is attached.


Kind regards,
Andrey Arapov
Attachment

Re: [PATCH] initdb: do not exit after warn_on_mount_point

From
Tom Lane
Date:
andrey.arapov@nixaid.com writes:
> please find my first patch for PostgreSQL is attached.

You haven't explained why you think this would be a good
change, or even a safe one.

            regards, tom lane



Re: [PATCH] initdb: do not exit after warn_on_mount_point

From
andrey.arapov@nixaid.com
Date:
Hi Tom,

I've updated the patch by adding the explanation behind and more comments. (please see the attachment)

Have slightly improved the logic so that it does not report an error
"directory \"%s\" exists but is not empty"
when it is only supposed to warn the user about the mountpoint, without exiting.

To me, my patch looks like a typo fix where exit(1) should not be called on the warn_on_mount_point(),
but only warn and continue as more people are mounting the device at `/var/lib/postgresql/data` (PGDATA) in the
containerizedworld (K8s deployments, especially now in the Akash Network I am working for) for making sure their data
persist.

As a workaround, we either have to `rmdir /var/lib/postgresql/data/lost+found` before running `docker-entrypoint.sh
postgres`which in turn calls the `initdb`, or, alternatively we have to pass
`PGDATA=/var/lib/postgresql/data/<something>`while mounting persistent storage over `/var/lib/postgresql/data` path so
thatit won't exit on the very first run. 
To me, this in itself is an odd behavior, which led me to finding this typo (from my point of view) to which I've made
thispatch. 


Please let me know if it makes sense or requires more information / explanation.


Kind regards,
Andrey Arapov


September 10, 2022 5:10 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> andrey.arapov@nixaid.com writes:
>
>> please find my first patch for PostgreSQL is attached.
>
> You haven't explained why you think this would be a good
> change, or even a safe one.
>
> regards, tom lane

Attachment

Re: [PATCH] initdb: do not exit after warn_on_mount_point

From
Julien Rouhaud
Date:
Hi,

On Sat, Sep 10, 2022 at 07:14:59PM +0000, andrey.arapov@nixaid.com wrote:
>
> Have slightly improved the logic so that it does not report an error
> "directory \"%s\" exists but is not empty"
> when it is only supposed to warn the user about the mountpoint, without
> exiting.
>
> To me, my patch looks like a typo fix where exit(1) should not be called on
> the warn_on_mount_point(), but only warn and continue as more people are
> mounting the device at `/var/lib/postgresql/data` (PGDATA) in the
> containerized world (K8s deployments, especially now in the Akash Network I
> am working for) for making sure their data persist.

This definitely isn't a typo, as it's really strongly discouraged to use a
mount point for the data directory.  You can refer to this thread [1] for more
explanations.

[1] https://www.postgresql.org/message-id/flat/CAKoxK%2B6H40imynM5P31bf0DnpN-5f5zeROjcaj6BKVAjxdEA9w%40mail.gmail.com



Re: [PATCH] initdb: do not exit after warn_on_mount_point

From
Tom Lane
Date:
Julien Rouhaud <rjuju123@gmail.com> writes:
> On Sat, Sep 10, 2022 at 07:14:59PM +0000, andrey.arapov@nixaid.com wrote:
>> Have slightly improved the logic so that it does not report an error
>> "directory \"%s\" exists but is not empty"
>> when it is only supposed to warn the user about the mountpoint, without
>> exiting.

> This definitely isn't a typo, as it's really strongly discouraged to use a
> mount point for the data directory.

Absolutely.  I think maybe the problem here is that warn_on_mount_point()
is a pretty bad name for that helper function, as this is not "just a
warning".

            regards, tom lane



Re: [PATCH] initdb: do not exit after warn_on_mount_point

From
andrey.arapov@nixaid.com
Date:
September 11, 2022 12:18 PM, "Julien Rouhaud" <rjuju123@gmail.com> wrote:

> Hi,
>
> On Sat, Sep 10, 2022 at 07:14:59PM +0000, andrey.arapov@nixaid.com wrote:
>
>> Have slightly improved the logic so that it does not report an error
>> "directory \"%s\" exists but is not empty"
>> when it is only supposed to warn the user about the mountpoint, without
>> exiting.
>>
>> To me, my patch looks like a typo fix where exit(1) should not be called on
>> the warn_on_mount_point(), but only warn and continue as more people are
>> mounting the device at `/var/lib/postgresql/data` (PGDATA) in the
>> containerized world (K8s deployments, especially now in the Akash Network I
>> am working for) for making sure their data persist.
>
> This definitely isn't a typo, as it's really strongly discouraged to use a
> mount point for the data directory. You can refer to this thread [1] for more
> explanations.
>
> [1]
> https://www.postgresql.org/message-id/flat/CAKoxK+6H40imynM5P31bf0DnpN-5f5zeROjcaj6BKVAjxdEA9w@mail.
> mail.com


I've read the "why not using a mountpoint as PGDATA?" thread [1] as well as Bugzilla "postgresql-setup fails when
/var/lib/pgsql/datais mount point" thead [2] but could not find any good reason why not to mount the PGDATA directly, 
except probably for the NFS mount point, but who does that anyway?
And using NFS for PostgreSQL PGDATA is a way for finding problems starting with poor performance ending up with
corruptedDB. lol 

The only point that hooked my attention was the pg_upgrade, but then I've tried pg_upgrade'ing postgresql:13 to
postgresql:14,everything went without issues. I wrote a step-by-step doc for that purpose [3]. 

That leaves me unconvinced on why would `initdb` quit when detecting PGDATA being a mountpoint.

Everyone using containerized postgresql image cannot use /var/lib/postgresql as the mountpoint but has to use
/var/lib/postgresql/datainstead due to this issue [4] due to [5]. 
Hence, everyone using containerized version of postgresql with the device (say Ceph's RBD) mounted over
/var/lib/postgresql/datadirectory has to either specify: 

- PGDATA=/var/lib/postgresql/data/<some-dir>

OR

make sure to remove $PGDATA/lost+found directory.

Both of these hacks are only for the initdb to fail detect the mountpoint which, on its own, is supposed to be only a
warningwhich is seen from the function name warn_on_mount_point() and its messages [6] look to be of only advisory
natureto me. 

I would understand initdb to exit if it detects the mountpoint being is NFS, otherwise, there is no reason for not
lettingone use the mountpoint over PGDATA as neither it breaks the pg_upgrade functionality as shown in this message. 


[1] https://www.postgresql.org/message-id/flat/CAKoxK+6H40imynM5P31bf0DnpN-5f5zeROjcaj6BKVAjxdEA9w@mail.gmail.com
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1247477
[3] https://gist.github.com/andy108369/aa3bf1707054542f2fa944f6d39aef64
[4] https://github.com/docker-library/postgres/issues/404#issuecomment-773755801
[5]
https://github.com/docker-library/postgres/blob/3c20b7bdb915ecb1648fb468ab53080c58bb1716/Dockerfile-debian.template#L184
[6]
https://github.com/postgres/postgres/blob/7fed801135bae14d63b11ee4a10f6083767046d8/src/bin/initdb/initdb.c#L2616-L2626


Kind regards,
Andrey Arapov



Re: [PATCH] initdb: do not exit after warn_on_mount_point

From
Julien Rouhaud
Date:
On Sun, Sep 11, 2022 at 06:22:21PM +0000, andrey.arapov@nixaid.com wrote:
> September 11, 2022 12:18 PM, "Julien Rouhaud" <rjuju123@gmail.com> wrote:
>
> > Hi,
> >
> > On Sat, Sep 10, 2022 at 07:14:59PM +0000, andrey.arapov@nixaid.com wrote:
> >
> >> Have slightly improved the logic so that it does not report an error
> >> "directory \"%s\" exists but is not empty"
> >> when it is only supposed to warn the user about the mountpoint, without
> >> exiting.
> >>
> >> To me, my patch looks like a typo fix where exit(1) should not be called on
> >> the warn_on_mount_point(), but only warn and continue as more people are
> >> mounting the device at `/var/lib/postgresql/data` (PGDATA) in the
> >> containerized world (K8s deployments, especially now in the Akash Network I
> >> am working for) for making sure their data persist.
> >
> > This definitely isn't a typo, as it's really strongly discouraged to use a
> > mount point for the data directory. You can refer to this thread [1] for more
> > explanations.
> >
> > [1]
> > https://www.postgresql.org/message-id/flat/CAKoxK+6H40imynM5P31bf0DnpN-5f5zeROjcaj6BKVAjxdEA9w@mail.
> > mail.com
>
>
> I've read the "why not using a mountpoint as PGDATA?" thread [1] as well as
> Bugzilla "postgresql-setup fails when /var/lib/pgsql/data is mount point"
> thead [2] but could not find any good reason why not to mount the PGDATA
> directly, except probably for the NFS mount point, but who does that anyway?

What about this part in Tom's original answer:

3. If, some day, the filesystem is accidentally unmounted while the database is
up, it will continue to write into files that are now getting placed in the
mount-point directory on the parent volume.  This usually results in an
unrecoverably messed-up database by the time you realize what's going wrong.
(There are horror stories about such cases in the PG community mailing list
archives, dating from before we installed the don't-use-a-mount-point defenses
in initdb.)

> Everyone using containerized postgresql image cannot use /var/lib/postgresql
> as the mountpoint but has to use /var/lib/postgresql/data instead due to this
> issue [4] due to [5].  Hence, everyone using containerized version of
> postgresql with the device (say Ceph's RBD) mounted over
> /var/lib/postgresql/data directory has to either specify:
>
> - PGDATA=/var/lib/postgresql/data/<some-dir>
>
> OR
>
> make sure to remove $PGDATA/lost+found directory.

initdb had this behavior for almost 10 years, and for good reasons: it prevents
actual problems that were seen in the field.

It's unfortunate that the docker postgres images didn't take that behavior into
account, which was introduced more than a year before any work started on those
images, but if you're not happy with those workarounds it's something that
should be changed in the docker images.

> Both of these hacks are only for the initdb to fail detect the mountpoint
> which, on its own, is supposed to be only a warning which is seen from the
> function name warn_on_mount_point() and its messages [6] look to be of only
> advisory nature to me.

As Tom confirmed at [1], you shouldn't assume anything about the criticality
based on the function name.  If anything, the "warn" part is only saying that
the function itself won't exit(1).  And yes this function is only adding
information on the reason why the given directory can't be used, but it doesn't
change the fact that the given directory can't be used.

[1] https://www.postgresql.org/message-id/813162.1662908002@sss.pgh.pa.us



Re: [PATCH] initdb: do not exit after warn_on_mount_point

From
Tom Lane
Date:
Julien Rouhaud <rjuju123@gmail.com> writes:
> On Sun, Sep 11, 2022 at 06:22:21PM +0000, andrey.arapov@nixaid.com wrote:
>> Everyone using containerized postgresql image cannot use /var/lib/postgresql
>> as the mountpoint but has to use /var/lib/postgresql/data instead due to this
>> issue [4] due to [5].

> initdb had this behavior for almost 10 years, and for good reasons: it prevents
> actual problems that were seen in the field.

The long and the short of this is that one person losing their data
outweighs thousands of people being very mildly inconvenienced by
having to create an extra level of directory.  I understand that you
don't think so, but you'd change your mind PDQ if it was *your* data
that got irretrievably corrupted.

We are not going to remove this check.

If anything, the fault I'd find with the existing code is that it's not
sufficiently thorough about detecting what is a mount point.  AFAICS,
neither the lost+found check nor the extra-dot-files check will trigger
on modern filesystems such as XFS.  I wonder if we could do something
like comparing the stat(2) st_dev numbers for the proposed data directory
vs. its parent directory.

            regards, tom lane



Re: [PATCH] initdb: do not exit after warn_on_mount_point

From
andrey.arapov@nixaid.com
Date:
September 12, 2022 7:51 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> Julien Rouhaud <rjuju123@gmail.com> writes:
>
>> On Sun, Sep 11, 2022 at 06:22:21PM +0000, andrey.arapov@nixaid.com wrote:
>>> Everyone using containerized postgresql image cannot use /var/lib/postgresql
>>> as the mountpoint but has to use /var/lib/postgresql/data instead due to this
>>> issue [4] due to [5].
>>
>> initdb had this behavior for almost 10 years, and for good reasons: it prevents
>> actual problems that were seen in the field.
>
> The long and the short of this is that one person losing their data
> outweighs thousands of people being very mildly inconvenienced by
> having to create an extra level of directory. I understand that you
> don't think so, but you'd change your mind PDQ if it was *your* data
> that got irretrievably corrupted.
>
> We are not going to remove this check.
>
> If anything, the fault I'd find with the existing code is that it's not
> sufficiently thorough about detecting what is a mount point. AFAICS,
> neither the lost+found check nor the extra-dot-files check will trigger
> on modern filesystems such as XFS. I wonder if we could do something
> like comparing the stat(2) st_dev numbers for the proposed data directory
> vs. its parent directory.
>
> regards, tom lane


Thank you for the explanation Tom & Julien!

This is not an issue for the containerized PostgreSQL where the mountpoint is a required dependency without which the
containerwon't even spawn nor start the scripts (such as initdb). 
This also makes it difficult to comprehend the reason initdb exits when it sees PGDATA is a mountpoint it at the first
hand.

That and the "Warn about initdb using mount-points" commit [4] made me thinking that this was a typo.

But I do understand that having it exiting on detecting PGDATA being a mountpoint have saved people from having their
DBcorrupted is a good enough reason for keeping thing as they are now. 


== Summarizing the issue ==

The problem is that PostgreSQL will irrecoverably corrupt when the `PGDATA` mountpoint gets accidentally unmounted
whilethe DB is up. [1] 

The `PGDATA` can't just accidentally get unmounted due to the files being locked by a process there, unless there is a
timewindow between the DB locking/unlocking the data in it of course. 
Or unless someone forcibly unmounts it.

(I'd expect the DB to detect this & immediately terminate with a fatal error)

Or more likely is when `PGDATA` directory gets mounted back again _while_ PostgreSQL is running with a newly
initializedcluster in the empty PGDATA directory (say, due to start-scripts running `initdb` or there was an old
instanceof a previously initialized cluster in the underlying PGDATA directory [with mountpoint being unmounted]), and
thengets closed which in turn causes it writing the wrong data into the correct pg_control file. 


== The solution (partial) ==

The solution to this was built into the `initdb` [4] - which makes sure it fails when it finds the `PGDATA` is a
mountpoint.
This is currently achieved by `initdb` finding the `lost+found` directory in PGDATA, which is not a robust solution,
sincethere are filesystems that do not place that file onto the mountpoint such as XFS, BTRFS, eCryptfs, ..., or one
canerase the `lost+found` directory. 
Hence, this is only a partial solution to the problem and looks to be more of a workaround nature IMHO.


== The recommendation ==

The recommendation is to mount an upper directory instead of PGDATA directly (one above the PGDATA, e.g.
`$PGDATA/../.`)to make sure the DB or a certain* PostgreSQL start script will fail due to a missing `PGDATA` directory
insteadof running the `initdb` to re-init the cluster. 

== The reason ==

This is because, as already mentioned before, running the `initdb` can lead to a situation where one can irrecoverably
corruptthe DB by mounting the original PGDATA backing device back over again and stopping/re-starting the DB which will
irrecoverablyoverwrite the good pg_control file. 
(* - I assume that on some systems the postgresql start-script runs `initdb` when detects PGDATA directory is empty
beforestarting the DB; like in the postgresql docker container [2] (by checking the presence of `$PGDATA/PG_VERSION`).
Andthe default postgresql start script [3] runs the DB directly without issuing `initdb`.) 


== Next steps? ==

I'm wondering whether that's the same issue for mariadb / mysql ...

And whether there could be a better way to handle this issue and protect the PostgreSQL from corrupting the DB in the
eventthe PGDATA gets accidentally unmounted / re-mounted leading the the issue described above. 
Solving this would allow people to actually mount the PGDATA directly, without the need to fix the `initdb`'s
mountpointdetection for filesystems without `lost+found` directory. 


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1247477#c1
[2] https://github.com/docker-library/postgres/blob/74e51d102aede/docker-entrypoint.sh#L317
[3] https://github.com/postgres/postgres/blob/REL_14_5/contrib/start-scripts/linux#L94
[4] https://github.com/postgres/postgres/commit/17f15239325a88581bb4f9cf91d38005f1f52d69


Kind regards,
Andrey Arapov