RE: How can end users know the cause of LR slot sync delays? - Mailing list pgsql-hackers

From Hayato Kuroda (Fujitsu)
Subject RE: How can end users know the cause of LR slot sync delays?
Date
Msg-id OSCPR01MB14966F3E00A6E34625CE35AD2F5F5A@OSCPR01MB14966.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: How can end users know the cause of LR slot sync delays?  (Shlok Kyal <shlok.kyal.oss@gmail.com>)
List pgsql-hackers
Dear Shlok,

> > 01.
> > ```
> > +       /* Update the slot sync reason */
> > +       SpinLockAcquire(&slot->mutex);
> > +       if (slot->slot_sync_skip_reason != skip_reason)
> > +               slot->slot_sync_skip_reason = skip_reason;
> > +       SpinLockRelease(&slot->mutex);
> > ```
> >
> > Per my understanding, spinlock is acquired when the attribute on the shared
> memory
> > is updating. Can you check other parts and follow the rukle?
> >
> > 02.
> > ```
> > +                       SpinLockAcquire(&slot->mutex);
> > +                       synced = slot->data.synced;
> > +                       SpinLockRelease(&slot->mutex);
> > ```
> >
> > Same as 1.
> >
> I checked and found the following comment:
>  * - Individual fields are protected by mutex where only the backend owning
>  * the slot is authorized to update the fields from its own slot.  The
>  * backend owning the slot does not need to take this lock when reading its
>  * own fields, while concurrent backends not owning this slot should take the
>  * lock when reading this slot's data.
> 
> So for the above two cases we are updating the
> 'slot->slot_sync_skip_reason' and reading 'slot->data.synced' and this
> can happen before the slot sync worker acquires the slot or owns the
> slot.
> Also in the same code at a later stage we are again checking the
> synced flag and we do that while holding a spin lock. Based on these
> observations I think we should take Spinlock in both cases.

Hmm, regarding the update_slot_sync_skip_stats(), the replication slot has already been
acquired except synchronize_one_slot() case.
Can we avoid acquiring the spinlock as much as possible by adding an argument?
Or it just introduces additional complexity?

> > 09.
> > ```
> > +my $connstr_1 = $primary->connstr;
> > ```
> >
> > Since this is an only connection string in the test, suffix _1 is not needed.
> >
> Fixed

Same as the comment, can you replace "standby1" to "stanby"?

> 
> > 10.
> > ```
> > +# Simulate standby connection failure by modifying pg_hba.conf
> > +unlink($primary->data_dir . '/pg_hba.conf');
> > +$primary->append_conf('pg_hba.conf',
> > +       qq{local   all             all
> trust}
> > +);
> > ```
> >
> > What if the system does not have Unix domain socket? I'm afraid all connections
> > could be brocked in this case.
> >
> I have used an injection point to simulate this scenario instead of
> changing the contents of pg_hba.conf files.

Can you clarify the reason why you used the injection point? 
I'm not sure the injection point is beneficial here. I feel the point can be added
when we handle the timing issue, race condition etc, but walreceiver may not have
strong reasons to stop exact at that point.

Regarding the content of pg_hba.conf, I felt below lines might be enough:
```
local   all             all                                     trust 
host    all             all             127.0.0.1/32            trust 
```

Also, here are comments for v5.

```
+      <para>
+       Reason of the last slot synchronization skip.
+      </para></entry>
```

Possible values must be clarified. This was posted in [1] but seemed to be missed.

```
+       /* Update the slot sync reason */
```

It is better to clarify updating the *skip* reason


```
-       ReplicationSlot *slot;
+       ReplicationSlot *slot = NULL;
```

No need to initialize as NULL.

```
+#include "utils/injection_point.h"
...
+                               INJECTION_POINT("walreceiver", NULL);
```

As I told above, I have a concern to add the injection point. I want to hear
other's opinion as well.

```
+                       else
+                       {
+                               /* Update the slot sync stats */
+                               Assert(!found_consistent_snapshot ||
+                                          *found_consistent_snapshot);
+                               update_slot_sync_skip_stats(slot, SS_SKIP_NONE);
+                       }
```

Your patch may have another issue; if both confirmed_flush_lsn are the same
but we do not have the consistent snapshot yet, we would get the assertion failure.
(Again, not sure it can really happen)
Can we use the condition as another if part? At that time we must clarify why
it is OK to pass in case of found_consistent_snapshot == NULL.

```
+# Attach injection point to simulate wait
+$standby_psql->query_safe(
+       q(select injection_points_attach('slot-sync-skip','wait')));
```

I have been considering whether we can remove the injection point here or not.
I think the point is used because the being synchronized slot is still temporary
one; they would be cleaned up by ReplicationSlotCleanup().
Can we reproduce the skip event for the permanent slot? I cannot come up with,
but if possible no need to introduce the injection point.

[1]:
https://www.postgresql.org/message-id/OSCPR01MB14966A618A8C61EC3DEE486A4F517A%40OSCPR01MB14966.jpnprd01.prod.outlook.com

Best regards,
Hayato Kuroda
FUJITSU LIMITED


pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Use CompactAttribute more often, when possible
Next
From: Álvaro Herrera
Date:
Subject: Re: Use CompactAttribute more often, when possible