Thread: Logical replication wal sender timestamp bug

Logical replication wal sender timestamp bug

From
Jeff Janes
Date:
While monitoring pg_stat_subscription, I noticed that last_msg_send_time was usually NULL, which doesn't make sense.  Why would we have a message, but not know when it was sent?

Looking in src/backend/replication/walsender.c, there is this:

    /* output previously gathered data in a CopyData packet */
    pq_putmessage_noblock('d', ctx->out->data, ctx->out->len);

    /*
     * Fill the send timestamp last, so that it is taken as late as possible.
     * This is somewhat ugly, but the protocol is set as it's already used for
     * several releases by streaming physical replication.
     */
    resetStringInfo(&tmpbuf);
    now = GetCurrentTimestamp();
    pq_sendint64(&tmpbuf, now);
    memcpy(&ctx->out->data[1 + sizeof(int64) + sizeof(int64)],
           tmpbuf.data, sizeof(int64));

Filling out the timestamp after the message has already been sent is taking "as late as possible" a little too far.  It results in every message having a zero timestamp, other than keep-alives which go through a different path.

Re-ordering the code blocks as in the attached seems to fix it.  But I have to wonder, if this has been broken from the start and no one ever noticed, is this even valuable information to relay in the first place?  We could just take the column out of the view, and not bother calling  GetCurrentTimestamp() for each message.

Cheers,

Jeff
Attachment

Re: Logical replication wal sender timestamp bug

From
Michael Paquier
Date:
On Sat, Nov 02, 2019 at 09:54:54PM -0400, Jeff Janes wrote:
> While monitoring pg_stat_subscription, I noticed that last_msg_send_time
> was usually NULL, which doesn't make sense.  Why would we have a message,
> but not know when it was sent?

So...  The timestamp is received and stored in LogicalRepApplyLoop()
with send_time when a 'w' message is received in the subscription
cluster.  And it gets computed with a two-phase process:
- WalSndPrepareWrite() reserves the space in the message for the
timestamp.
- WalSndWriteData() computes the timestamp in the reserved space once
the write message is computed and ready to go.

> Filling out the timestamp after the message has already been sent is taking
> "as late as possible" a little too far.  It results in every message having
> a zero timestamp, other than keep-alives which go through a different path.

It seems to me that you are right: the timestamp is computed too
late.

> Re-ordering the code blocks as in the attached seems to fix it.  But I have
> to wonder, if this has been broken from the start and no one ever noticed,
> is this even valuable information to relay in the first place?  We could
> just take the column out of the view, and not bother calling
> GetCurrentTimestamp() for each message.

I think that there are use cases for such monitoring capabilities,
see for example 7fee252.  So I would rather keep it.
--
Michael

Attachment

Re: Logical replication wal sender timestamp bug

From
Michael Paquier
Date:
On Tue, Nov 05, 2019 at 01:19:37PM +0900, Michael Paquier wrote:
> On Sat, Nov 02, 2019 at 09:54:54PM -0400, Jeff Janes wrote:
>> Filling out the timestamp after the message has already been sent is taking
>> "as late as possible" a little too far.  It results in every message having
>> a zero timestamp, other than keep-alives which go through a different path.
>
> It seems to me that you are right: the timestamp is computed too
> late.

It is easy enough to reproduce the problem by setting for example
logical replication between two nodes and pgbench to produce some
load and then monitor pg_stat_subscription periodically.  However, it
is a problem since logical decoding has been introduced (5a991ef) so
committed your fix down to 9.4.
--
Michael

Attachment

Re: Logical replication wal sender timestamp bug

From
Jeff Janes
Date:


On Wed, Nov 6, 2019 at 2:15 AM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Nov 05, 2019 at 01:19:37PM +0900, Michael Paquier wrote:
> On Sat, Nov 02, 2019 at 09:54:54PM -0400, Jeff Janes wrote:
>> Filling out the timestamp after the message has already been sent is taking
>> "as late as possible" a little too far.  It results in every message having
>> a zero timestamp, other than keep-alives which go through a different path.
>
> It seems to me that you are right: the timestamp is computed too
> late.

It is easy enough to reproduce the problem by setting for example
logical replication between two nodes and pgbench to produce some
load and then monitor pg_stat_subscription periodically.  However, it
is a problem since logical decoding has been introduced (5a991ef) so
committed your fix down to 9.4.

Thanks.  This column looks much more reasonable now.

Cheers,

Jeff