Thread: Is it OK to ignore directory open failure in ResetUnloggedRelations?

Is it OK to ignore directory open failure in ResetUnloggedRelations?

From

Tom Lane

Date:

05 December 2017, 02:15:08

While working through Michael Paquier's patch to clean up inconsistent
usage of AllocateDir(), I noticed that ResetUnloggedRelations and its
subroutines are not consistent about whether a directory open failure
results in erroring out or just emitting a LOG message and continuing.
ResetUnloggedRelations itself throws a hard error if it fails to open
pg_tblspc, but all the rest of reinit.c thinks a LOG message is
sufficient.

My first thought was to change ResetUnloggedRelations to match the
rest, but on reflection I'm less sure about that.  What we've got
at the moment is that a possibly-transient directory open failure
can result in failure to reset an unlogged relation to empty,
which to me amounts to data corruption.  If the contents of the
unlogged relation are inconsistent, which is plenty likely after
a crash, we could end up crashing later because of that; and in
any case the user would not see what they expect in the tables.

So now I'm thinking we should do the reverse and change these functions
to give a hard error on AllocateDir failure.  That would result in
startup-process failure if we are unable to scan the database, which is
not great, but there's certainly something badly wrong if we can't.

Thoughts?

            regards, tom lane

Re: Is it OK to ignore directory open failure inResetUnloggedRelations?

From

David Steele

Date:

05 December 2017, 10:37:09

Hi Tom,

On 12/4/17 3:15 PM, Tom Lane wrote:
> While working through Michael Paquier's patch to clean up inconsistent
> usage of AllocateDir(), I noticed that ResetUnloggedRelations and its
> subroutines are not consistent about whether a directory open failure
> results in erroring out or just emitting a LOG message and continuing.
> ResetUnloggedRelations itself throws a hard error if it fails to open
> pg_tblspc, but all the rest of reinit.c thinks a LOG message is
> sufficient.

By a strange coincidence I spent a while today reading through this code...

> My first thought was to change ResetUnloggedRelations to match the
> rest, but on reflection I'm less sure about that.  What we've got
> at the moment is that a possibly-transient directory open failure
> can result in failure to reset an unlogged relation to empty,
> which to me amounts to data corruption.  

I'm wondering how this transient directory open failure is going to
happen without a bunch of other things going wrong, but I agree that if
it happens then corruption would be the likely result.

> If the contents of the
> unlogged relation are inconsistent, which is plenty likely after
> a crash, we could end up crashing later because of that; and in
> any case the user would not see what they expect in the tables.

Agreed.

> So now I'm thinking we should do the reverse and change these functions
> to give a hard error on AllocateDir failure.  That would result in
> startup-process failure if we are unable to scan the database, which is
> not great, but there's certainly something badly wrong if we can't.

+1.  If a tablespace or database directory cannot be opened then I don't
think it makes any sense to continue.

Regards,
-- 
-David
david@pgmasters.net

Re: Is it OK to ignore directory open failure inResetUnloggedRelations?

From

Justin Pryzby

Date:

05 December 2017, 11:01:27

On Mon, Dec 04, 2017 at 03:15:08PM -0500, Tom Lane wrote:
> While working through Michael Paquier's patch to clean up inconsistent
> usage of AllocateDir(), I noticed that ResetUnloggedRelations and its
> subroutines are not consistent about whether a directory open failure
> results in erroring out or just emitting a LOG message and continuing.
> ResetUnloggedRelations itself throws a hard error if it fails to open
> pg_tblspc, but all the rest of reinit.c thinks a LOG message is
> sufficient.
...
> So now I'm thinking we should do the reverse and change these functions
> to give a hard error on AllocateDir failure.  That would result in
> startup-process failure if we are unable to scan the database, which is
> not great, but there's certainly something badly wrong if we can't.

I can offer a data point unrelated to unlogged relations.

Sometimes, following a reboot, if there's a tablespace on ZFS, and if a ZPOOL
backing device is missing/renamed (especially under qemu), postgres (if it was
shutdown cleanly) will happily start even though a tablespace is missing (due
to unable to find backing device - ZFS wants it to be exported and imported
before it scans all devices for matching UUID).

That has been surprising to me in the past and lead me to believe that
"services are up" following a reboot only to notice a bunch of ERRORs in the
logs a handful of minutes later.

Maybe that counts for a tangential +1.

Justin