Re: Skipping logical replication transactions on subscriber side - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Skipping logical replication transactions on subscriber side
Date
Msg-id CAD21AoBT+gx=2V-dKTdrqUeqrnm=6fdJ6r7KFoKNVgLrqDMGsg@mail.gmail.com
Whole thread Raw
In response to Re: Skipping logical replication transactions on subscriber side  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Skipping logical replication transactions on subscriber side  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Fri, Sep 24, 2021 at 8:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Sep 21, 2021 at 10:23 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I've attached the updated version patches. Please review them.
> >
>
> Review comments for v14-0001-Add-pg_stat_subscription_errors-statistics-view
> ==============================================================
> 1.
> <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>command</structfield> <type>text</type>
> +      </para>
> +      <para>
> +        Name of command being applied when the error occurred.  This
> +        field is always NULL if the error is reported by the
> +        <literal>tablesync</literal> worker.
> +      </para></entry>
> +     </row>
> ..
> ..
> +     <row>
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>xid</structfield> <type>xid</type>
> +      </para>
> +      <para>
> +        Transaction ID of the publisher node being applied when the error
> +        occurred.  This field is always NULL if the error is reported
> +        by the <literal>tablesync</literal> worker.
> +      </para></entry>
>
> Shouldn't we display command and transaction id even for table sync
> worker if it occurs during sync phase (syncing with apply worker
> position)

Right. I'll fix it.

>
> 2.
> + /*
> + * The number of not-ready relations can be high for example right
> + * after creating a subscription, so we load the list of
> + * SubscriptionRelState into the hash table for faster lookups.
> + */
>
> I am not sure this optimization of converting to not-ready relations
> list to hash table is worth it. Are we expecting thousands of
> relations per subscription? I think that will be a rare case even if
> it is there.

Yeah, it seems overkill. I'll use the simple list. If this becomes a
problem, we can add such optimization later.

>
> 3.
> +static void
> +pgstat_recv_subscription_purge(PgStat_MsgSubscriptionPurge *msg, int len)
> +{
> + if (subscriptionHash == NULL)
> + return;
> +
> + for (int i = 0; i < msg->m_nentries; i++)
> + {
> + PgStat_StatSubEntry *subent;
> +
> + subent = pgstat_get_subscription_entry(msg->m_subids[i], false);
> +
> + /*
> + * Nothing to do if the subscription entry is not found.  This could
> + * happen when the subscription is dropped and the message for
> + * dropping subscription entry arrived before the message for
> + * reporting the error.
> + */
> + if (subent == NULL)
>
> Is the above comment true even during the purge? I can think of this
> during normal processing but not during the purge.

Right, the comment is not true during the purge. Since subent could be
NULL if concurrent autovacuum workers do pgstat_vacuum_stat() I'll
change the comment.

>
> 4.
> +typedef struct PgStat_MsgSubscriptionErr
> +{
> + PgStat_MsgHdr m_hdr;
> +
> + /*
> + * m_subid and m_subrelid are used to determine the subscription and the
> + * reporter of this error.  m_subrelid is InvalidOid if reported by the
> + * apply worker, otherwise by the table sync worker.  In table sync worker
> + * case, m_subrelid must be the same as m_relid.
> + */
> + Oid m_subid;
> + Oid m_subrelid;
> +
> + /* Error information */
> + Oid m_relid;
>
> Is m_subrelid is used only to distinguish the type of worker? I think
> it could be InvalidOid during the syncing phase in the table sync
> worker.

Right. I'll fix it.

>
> 5.
> +/*
> + * Subscription error statistics kept in the stats collector, representing
> + * an error that occurred during application of logical replication or
>
> The part of the message " ... application of logical replication ..."
> sounds a little unclear. Shall we instead write: " ... application of
> logical message ..."?

Will fix.

>
> 6.
> +typedef struct PgStat_StatSubEntry
> +{
> + Oid subid; /* hash table key */
> +
> + /*
> + * Statistics of errors that occurred during logical replication.  While
> + * having the hash table for table sync errors we have a separate
> + * statistics value for apply error (apply_error), because we can avoid
> + * building a nested hash table for table sync errors in the case where
> + * there is no table sync error, which is the common case in practice.
> + *
>
> The above comment is not clear to me. Why do you need to have a
> separate hash table for table sync errors? And what makes it avoid
> building nested hash table?

In the previous patch, a subscription stats entry
(PgStat_StatSubEntry) had one hash table that had error entries of
both apply and table sync. Since a subscription can have one apply
worker and multiple table sync workers it makes sense to me to have
the subscription entry have a hash table for them. The reason why we
have one error entry for an apply error and a hash table for table
sync errors is that there is the common case where an apply error
happens whereas any table sync error doesn’t. With this optimization,
if the subscription has only apply error, since we can store it into
aply_error field, we can avoid building a hash table for sync errors.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



pgsql-hackers by date:

Previous
From: Ranier Vilela
Date:
Subject: Re: Empty string in lexeme for tsvector
Next
From: Masahiko Sawada
Date:
Subject: Re: Skipping logical replication transactions on subscriber side