Thread: potential stuck lock in SaveSlotToPath()

potential stuck lock in SaveSlotToPath()

From
Peter Eisentraut
Date:
When SaveSlotToPath() is called with elevel=LOG, the early exits don't 
release the slot's io_in_progress_lock.  Fix attached.

This could result in a walsender being stuck on the lock forever.  A 
possible way to get into this situation is if the offending code paths 
are triggered in a low disk space situation.  (This is how it was found; 
maybe there are other ways.)

Pavan Deolasee and Craig Ringer worked on this issue.  I'm forwarding it 
on their behalf.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: potential stuck lock in SaveSlotToPath()

From
Andres Freund
Date:
Hi,

On 2020-03-18 16:46:23 +0100, Peter Eisentraut wrote:
> When SaveSlotToPath() is called with elevel=LOG, the early exits don't
> release the slot's io_in_progress_lock.  Fix attached.

I'm a bit confused as to why we we ever call it with elevel = LOG
(i.e. why we have the elevel parameter at all). That seems to have been
there from the start, so it's either me or Robert that's to blame. But I
can't immediately see a reason for it?

Greetings,

Andres Freund



Re: potential stuck lock in SaveSlotToPath()

From
Alvaro Herrera
Date:
On 2020-Mar-18, Andres Freund wrote:

> Hi,
> 
> On 2020-03-18 16:46:23 +0100, Peter Eisentraut wrote:
> > When SaveSlotToPath() is called with elevel=LOG, the early exits don't
> > release the slot's io_in_progress_lock.  Fix attached.
> 
> I'm a bit confused as to why we we ever call it with elevel = LOG
> (i.e. why we have the elevel parameter at all). That seems to have been
> there from the start, so it's either me or Robert that's to blame. But I
> can't immediately see a reason for it?

I guess you didn't want failure to save a slot be a reason to abort a
checkpoint.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: potential stuck lock in SaveSlotToPath()

From
Thomas Munro
Date:
On Thu, Mar 19, 2020 at 4:46 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> [patch]

+         * releaseing even in that case.

Typo.



Re: potential stuck lock in SaveSlotToPath()

From
Andres Freund
Date:
Hi,

On 2020-03-18 16:54:19 -0300, Alvaro Herrera wrote:
> On 2020-Mar-18, Andres Freund wrote:
> > On 2020-03-18 16:46:23 +0100, Peter Eisentraut wrote:
> > > When SaveSlotToPath() is called with elevel=LOG, the early exits don't
> > > release the slot's io_in_progress_lock.  Fix attached.
> > 
> > I'm a bit confused as to why we we ever call it with elevel = LOG
> > (i.e. why we have the elevel parameter at all). That seems to have been
> > there from the start, so it's either me or Robert that's to blame. But I
> > can't immediately see a reason for it?
> 
> I guess you didn't want failure to save a slot be a reason to abort a
> checkpoint.

I don't see a valid reason for that though - if anything it's dangerous,
because we're not persistently saving the slot. It should fail the
checkpoint imo. Robert, do you have an idea?

Greetings,

Andres Freund



Re: potential stuck lock in SaveSlotToPath()

From
Robert Haas
Date:
On Wed, Mar 18, 2020 at 4:25 PM Andres Freund <andres@anarazel.de> wrote:
> I don't see a valid reason for that though - if anything it's dangerous,
> because we're not persistently saving the slot. It should fail the
> checkpoint imo. Robert, do you have an idea?

Well, the comment atop SaveSlotToPath says:

 * This needn't actually be part of a checkpoint, but it's a convenient
 * location.

And I agree with that.

Incidentally, the wait-event handling in SaveSlotToPath() doesn't look
right for the early-exit cases either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: potential stuck lock in SaveSlotToPath()

From
Peter Eisentraut
Date:
On 2020-03-19 16:38, Robert Haas wrote:
> Incidentally, the wait-event handling in SaveSlotToPath() doesn't look
> right for the early-exit cases either.

There appear to be appropriate pgstat_report_wait_end() calls.  What are 
you seeing?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: potential stuck lock in SaveSlotToPath()

From
Robert Haas
Date:
On Fri, Mar 20, 2020 at 11:32 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 2020-03-19 16:38, Robert Haas wrote:
> > Incidentally, the wait-event handling in SaveSlotToPath() doesn't look
> > right for the early-exit cases either.
>
> There appear to be appropriate pgstat_report_wait_end() calls.  What are
> you seeing?

Oh, you're right. I think I got confused because the rename() and
close() don't have that, but those don't have a wait event set either.
Sorry for the noise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: potential stuck lock in SaveSlotToPath()

From
Peter Eisentraut
Date:
On 2020-03-20 16:38, Robert Haas wrote:
> On Fri, Mar 20, 2020 at 11:32 AM Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com> wrote:
>> On 2020-03-19 16:38, Robert Haas wrote:
>>> Incidentally, the wait-event handling in SaveSlotToPath() doesn't look
>>> right for the early-exit cases either.
>>
>> There appear to be appropriate pgstat_report_wait_end() calls.  What are
>> you seeing?
> 
> Oh, you're right. I think I got confused because the rename() and
> close() don't have that, but those don't have a wait event set either.
> Sorry for the noise.

Any concerns about applying and backpatching the patch I posted?

The talk about reorganizing this code doesn't seem very concrete at the 
moment and would probably not be backpatch material anyway.


-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: potential stuck lock in SaveSlotToPath()

From
Alvaro Herrera
Date:
On 2020-Mar-25, Peter Eisentraut wrote:

> On 2020-03-20 16:38, Robert Haas wrote:
> > On Fri, Mar 20, 2020 at 11:32 AM Peter Eisentraut
> > <peter.eisentraut@2ndquadrant.com> wrote:
> > > On 2020-03-19 16:38, Robert Haas wrote:
> > > > Incidentally, the wait-event handling in SaveSlotToPath() doesn't look
> > > > right for the early-exit cases either.
> > > 
> > > There appear to be appropriate pgstat_report_wait_end() calls.  What are
> > > you seeing?
> > 
> > Oh, you're right. I think I got confused because the rename() and
> > close() don't have that, but those don't have a wait event set either.
> > Sorry for the noise.
> 
> Any concerns about applying and backpatching the patch I posted?

It looks a straight bug fix to me, I agree it should be back-patched.

> The talk about reorganizing this code doesn't seem very concrete at the
> moment and would probably not be backpatch material anyway.

Agreed on both counts.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: potential stuck lock in SaveSlotToPath()

From
Robert Haas
Date:
On Wed, Mar 25, 2020 at 6:13 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> Any concerns about applying and backpatching the patch I posted?

Not from me.

> The talk about reorganizing this code doesn't seem very concrete at the
> moment and would probably not be backpatch material anyway.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: potential stuck lock in SaveSlotToPath()

From
Peter Eisentraut
Date:
On 2020-03-25 17:56, Robert Haas wrote:
> On Wed, Mar 25, 2020 at 6:13 AM Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com> wrote:
>> Any concerns about applying and backpatching the patch I posted?
> 
> Not from me.
> 
>> The talk about reorganizing this code doesn't seem very concrete at the
>> moment and would probably not be backpatch material anyway.
> 
> +1.

committed and backpatched

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: potential stuck lock in SaveSlotToPath()

From
Michael Paquier
Date:
On Thu, Mar 26, 2020 at 02:16:05PM +0100, Peter Eisentraut wrote:
> committed and backpatched

The patch committed does that in three places:
    /* rename to permanent file, fsync file and directory */
    if (rename(tmppath, path) != 0)
    {
+       LWLockRelease(&slot->io_in_progress_lock);
    ereport(elevel,
                (errcode_for_file_access(),
                 errmsg("could not rename file \"%s\" to \"%s\": %m",

But why do you assume that LWLockRelease() never changes errno?  It
seems to me that you should save errno before calling LWLockRelease(),
and then restore it back before using %m in the log message, no?  See
for example the case where trace_lwlocks is set.
--
Michael

Attachment

Re: potential stuck lock in SaveSlotToPath()

From
Peter Eisentraut
Date:
On 2020-03-27 08:48, Michael Paquier wrote:
> On Thu, Mar 26, 2020 at 02:16:05PM +0100, Peter Eisentraut wrote:
>> committed and backpatched
> 
> The patch committed does that in three places:
>      /* rename to permanent file, fsync file and directory */
>      if (rename(tmppath, path) != 0)
>      {
> +       LWLockRelease(&slot->io_in_progress_lock);
>     ereport(elevel,
>                  (errcode_for_file_access(),
>                   errmsg("could not rename file \"%s\" to \"%s\": %m",
> 
> But why do you assume that LWLockRelease() never changes errno?  It
> seems to me that you should save errno before calling LWLockRelease(),
> and then restore it back before using %m in the log message, no?  See
> for example the case where trace_lwlocks is set.

Good catch.  How about the attached patch?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: potential stuck lock in SaveSlotToPath()

From
Michael Paquier
Date:
On Wed, Apr 01, 2020 at 04:26:25PM +0200, Peter Eisentraut wrote:
> Good catch.  How about the attached patch?

WFM.  Another trick would be to call LWLockRelease() after generating
the log, but I find your patch more consistent with the surroundings.
--
Michael

Attachment

Re: potential stuck lock in SaveSlotToPath()

From
Peter Eisentraut
Date:
On 2020-04-02 08:21, Michael Paquier wrote:
> On Wed, Apr 01, 2020 at 04:26:25PM +0200, Peter Eisentraut wrote:
>> Good catch.  How about the attached patch?
> 
> WFM.  Another trick would be to call LWLockRelease() after generating
> the log, but I find your patch more consistent with the surroundings.

done

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services