Re: Replication slot stats misgivings - Mailing list pgsql-hackers

From vignesh C
Subject Re: Replication slot stats misgivings
Date
Msg-id CALDaNm1hnu4GixD+fn+92A3NKGRV4n_iBk8xWzKbAt=gNatQDA@mail.gmail.com
Whole thread Raw
In response to Re: Replication slot stats misgivings  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Replication slot stats misgivings
List pgsql-hackers
On Fri, Apr 2, 2021 at 9:29 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Apr 2, 2021 at 1:55 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Thu, Apr 1, 2021 at 5:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Apr 1, 2021 at 3:43 PM vignesh C <vignesh21@gmail.com> wrote:
> > > >
> > > > On Wed, Mar 31, 2021 at 11:32 AM vignesh C <vignesh21@gmail.com> wrote:
> > > > >
> > > > > On Tue, Mar 30, 2021 at 11:00 AM Andres Freund <andres@anarazel.de> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > On 2021-03-30 10:13:29 +0530, vignesh C wrote:
> > > > > > > On Tue, Mar 30, 2021 at 6:28 AM Andres Freund <andres@anarazel.de> wrote:
> > > > > > > > Any chance you could write a tap test exercising a few of these cases?
> > > > > > >
> > > > > > > I can try to write a patch for this if nobody objects.
> > > > > >
> > > > > > Cool!
> > > > > >
> > > > >
> > > > > Attached a patch which has the test for the first scenario.
> > > > >
> > > > > > > > E.g. things like:
> > > > > > > >
> > > > > > > > - create a few slots, drop one of them, shut down, start up, verify
> > > > > > > >   stats are still sane
> > > > > > > > - create a few slots, shut down, manually remove a slot, lower
> > > > > > > >   max_replication_slots, start up
> > > > > > >
> > > > > > > Here by "manually remove a slot", do you mean to remove the slot
> > > > > > > manually from the pg_replslot folder?
> > > > > >
> > > > > > Yep - thereby allowing max_replication_slots after the shutdown/start to
> > > > > > be lower than the number of slots-stats objects.
> > > > >
> > > > > I have not included the 2nd test in the patch as the test fails with
> > > > > following warnings and also displays the statistics of the removed
> > > > > slot:
> > > > > WARNING:  problem in alloc set Statistics snapshot: detected write
> > > > > past chunk end in block 0x55d038b8e410, chunk 0x55d038b8e438
> > > > > WARNING:  problem in alloc set Statistics snapshot: detected write
> > > > > past chunk end in block 0x55d038b8e410, chunk 0x55d038b8e438
> > > > >
> > > > > This happens because the statistics file has an additional slot
> > > > > present even though the replication slot was removed.  I felt this
> > > > > issue should be fixed. I will try to fix this issue and send the
> > > > > second test along with the fix.
> > > >
> > > > I felt from the statistics collector process, there is no way in which
> > > > we can identify if the replication slot is present or not because the
> > > > statistic collector process does not have access to shared memory.
> > > > Anything that the statistic collector process does independently by
> > > > traversing and removing the statistics of the replication slot
> > > > exceeding the max_replication_slot has its drawback of removing some
> > > > valid replication slot's statistics data.
> > > > Any thoughts on how we can identify the replication slot which has been dropped?
> > > > Can someone point me to the shared stats patch link with which message
> > > > loss can be avoided. I wanted to see a scenario where something like
> > > > the slot is dropped but the statistics are not updated because of an
> > > > immediate shutdown or server going down abruptly can occur or not with
> > > > the shared stats patch.
> > > >
> > >
> > > I don't think it is easy to simulate a scenario where the 'drop'
> > > message is dropped and I think that is why the test contains the step
> > > to manually remove the slot. At this stage, you can probably provide a
> > > test patch and a code-fix patch where it just drops the extra slots
> > > from the stats file. That will allow us to test it with a shared
> > > memory stats patch on which Andres and Horiguchi-San are working. If
> > > we still continue to pursue with current approach then as Andres
> > > suggested we might send additional information from
> > > RestoreSlotFromDisk to keep it in sync.
> >
> > Thanks for your comments, Attached patch has the fix for the same.
> > Also attached a couple of more patches which addresses the comments
> > which Andres had listed i.e changing char to NameData type and also to
> > display the unspilled/unstreamed transaction information in the
> > replication statistics.
> > Thoughts?
>
> Thank you for the patches!
>
> I've looked at those patches and here are some comments on 0001, 0002,
> and 0003 patch:

Thanks for the comments.

> 0001 patch:
>
> -       values[0] = PointerGetDatum(cstring_to_text(s->slotname));
> +       values[0] = PointerGetDatum(cstring_to_text(s->slotname.data));
>
> We can use NameGetDatum() instead.

I felt we will not be able to use NameGetDatum because this function
will not have access to the value throughout the loop and NameGetDatum
must ensure the pointed-to value has adequate lifetime.

> ---
> 0002 patch:
>
> The patch uses logical replication to test replication slots
> statistics but I think it's necessarily necessary. It would be more
> simple to use logical decoding. Maybe we can add TAP tests to
> contrib/test_decoding.
>

I will try to change it to test_decoding if feasible and post in the
next version.

> ---
> 0003 patch:
>
>  void
>  pgstat_report_replslot(const char *slotname, int spilltxns, int spillcount,
> -                      int spillbytes, int streamtxns, int
> streamcount, int streambytes)
> +                      int spillbytes, int streamtxns, int streamcount,
> +                      int streambytes, int totaltxns, int totalbytes)
>  {
>
> As Andreas pointed out, we should use a struct of stats updates rather
> than adding more arguments to pgstat_report_replslot().
>

Modified as suggested.

> ---
> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +        <structfield>total_bytes</structfield><type>bigint</type>
> +       </para>
> +       <para>
> +        Amount of decoded in-progress transaction data replicated to
> the decoding
> +        output plugin while decoding changes from WAL for this slot.
> This and other
> +        counters for this slot can be used to gauge the network I/O
> which occurred
> +        during logical decoding and allow tuning
> <literal>logical_decoding_work_mem</literal>.
> +       </para>
> +      </entry>
> +     </row>
>
> As I mentioned in another reply, I think users should not gauge the
> network I/O which occurred during logical decoding using by those
> counters since the actual amount of network I/O is affected by table
> filtering and row filtering discussed on another thread[1]. Also,
> since this is total bytes I'm not sure how users can use this value to
> tune logical_decoding_work_mem. I agree to track both the total bytes
> and the total number of transactions passed to the decoding plugin but
> I think the description needs to be updated. How about the following
> description for example?
>
> Amount of decoded transaction data sent to the decoding output plugin
> while decoding changes from WAL for this slot. This and total_txn for
> this slot can be used to gauge the total amount of data during logical
> decoding.
>

Modified as suggested.

> ---
> I think we can merge 0001 and 0003 patches.

I have merged them.
Attached V2 patch which has the fixes for the same.
Thoughts?

Regards,
Vignesh

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: SP-GiST confusion: indexed column's type vs. index column type
Next
From: vignesh C
Date:
Subject: Re: Replication slot stats misgivings