Thread: BUG #15888: Bogus "idle in transaction" state for logical decoding client after creating a slot

The following bug has been logged on the website:

Bug reference:      15888
Logged by:          Marko Tiikkaja
Email address:      marko@joh.to
PostgreSQL version: 9.6.14
Operating system:   OS X 10.12.6
Description:

Run:

  pg_recvlogical -S foo -f /dev/null -d dbname --create-slot --start

and pg_stat_activity will show something like this until that session
disconnects:

-[ RECORD 1

]----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
datid            | 9672230
datname          | dbname
pid              | 81996
usesysid         | 10
usename          | marko
application_name | pg_recvlogical
client_addr      |
client_hostname  |
client_port      | -1
backend_start    | 2019-07-03 10:02:00.459737+02
xact_start       |
query_start      |
state_change     | 2019-07-03 10:02:00.726645+02
wait_event_type  |
wait_event       |
state            | idle in transaction
backend_xid      |
backend_xmin     |
query            |

which is quite obviously bogus.


On Wed, Jul 03, 2019 at 08:03:24AM +0000, PG Bug reporting form wrote:
> Run:
>
>   pg_recvlogical -S foo -f /dev/null -d dbname --create-slot --start
>
> and pg_stat_activity will show something like this until that session
> disconnects:
>
> which is quite obviously bogus.

I think that's the transaction which is started when exporting the
snapshot at logical slot creation.  Or in short:
#2  0x000055b5f8e5720f in StartTransactionCommand () at xact.c:2699
#3  0x000055b5f90fab4e in SnapBuildExportSnapshot
(builder=0x55b5fa3427b0) at snapbuild.c:575
#4  0x000055b5f90fe5df in CreateReplicationSlot (cmd=0x55b5fa2914e8)
at walsender.c:843

SnapBuildExportSnapshot() needs to keep a transaction context opened
to keep the wanted xmin around.  In PG >= 10, this problem gets
"solved" by using NOEXPORT_SNAPSHOT in the CREATE_REPLICATION_SLOT
command.  One idea would be to enforce a commit in this case from
CreateReplicationSlot/streamutil.c?  We know that pg_recvlogical is
not going to use this snapshot anyway...
--
Michael

Attachment
Hi,

On 2019-07-04 13:35:12 +0900, Michael Paquier wrote:
> On Wed, Jul 03, 2019 at 08:03:24AM +0000, PG Bug reporting form wrote:
> > Run:
> > 
> >   pg_recvlogical -S foo -f /dev/null -d dbname --create-slot --start
> > 
> > and pg_stat_activity will show something like this until that session
> > disconnects:
> > 
> > which is quite obviously bogus.
> 
> I think that's the transaction which is started when exporting the
> snapshot at logical slot creation.  Or in short:
> #2  0x000055b5f8e5720f in StartTransactionCommand () at xact.c:2699
> #3  0x000055b5f90fab4e in SnapBuildExportSnapshot
> (builder=0x55b5fa3427b0) at snapbuild.c:575
> #4  0x000055b5f90fe5df in CreateReplicationSlot (cmd=0x55b5fa2914e8)
> at walsender.c:843
> 
> SnapBuildExportSnapshot() needs to keep a transaction context opened
> to keep the wanted xmin around.  In PG >= 10, this problem gets
> "solved" by using NOEXPORT_SNAPSHOT in the CREATE_REPLICATION_SLOT
> command.  One idea would be to enforce a commit in this case from
> CreateReplicationSlot/streamutil.c?  We know that pg_recvlogical is
> not going to use this snapshot anyway...

The transaction ought to be finished by exec_replication_command()
calling SnapBuildClearExportedSnapshot(), which then does
AbortCurrentTransaction(). So something's wrong if that's not working
anymore.

I just tested this on a slightly older v12 checkout, and the general
mechanism works:

S1(replication):
postgres[13464][1]=# CREATE_REPLICATION_SLOT foo LOGICAL test_decoding;

S2(plain):
postgres[13317][1]=# SELECT state, wait_event_type, wait_event, backend_xid, backend_xmin FROM pg_stat_activity WHERE
pid= 13464;
 
┌─────────────────────┬─────────────────┬────────────┬─────────────┬──────────────┐
│        state        │ wait_event_type │ wait_event │ backend_xid │ backend_xmin │
├─────────────────────┼─────────────────┼────────────┼─────────────┼──────────────┤
│ idle in transaction │ Client          │ ClientRead │      (null) │   1494339463 │
└─────────────────────┴─────────────────┴────────────┴─────────────┴──────────────┘
(1 row)

S1(replication):
postgres[13464][1]=# IDENTIFY_SYSTEM;
┌─────────────────────┬──────────┬──────────────┬──────────┐
│      systemid       │ timeline │   xlogpos    │  dbname  │
├─────────────────────┼──────────┼──────────────┼──────────┤
│ 6704433050907079638 │        1 │ 3AC/1A25B700 │ postgres │
└─────────────────────┴──────────┴──────────────┴──────────┘

S2(plain):
postgres[13317][1]=# SELECT state, wait_event_type, wait_event, backend_xid, backend_xmin FROM pg_stat_activity WHERE
pid= 13464;
 
┌───────┬─────────────────┬────────────┬─────────────┬──────────────┐
│ state │ wait_event_type │ wait_event │ backend_xid │ backend_xmin │
├───────┼─────────────────┼────────────┼─────────────┼──────────────┤
│ idle  │ Client          │ ClientRead │      (null) │       (null) │
└───────┴─────────────────┴────────────┴─────────────┴──────────────┘
(1 row)


I *can* reproduce with 9.6.  It's worthwhile to note that the
transaction *is* actually aborted, via the path above.  As far as I can
tell the problem is "solely" that we never reach a set_ps_display() that
resets the 'idle in transaction' bit, because START_REPLICATION "takes
over" the connection once started, and we don't go through
PostgresMain() again.

The problem doesn't exist in later versions, because there
exec_replication_command() does an explicit
    /* Report to pgstat that this process is now idle */
    pgstat_report_activity(STATE_IDLE, NULL);
That's due to

commit d02974e32e028fc078d8f5eca1d6a4516efb0aa6
Author: Magnus Hagander <magnus@hagander.net>
Date:   2017-12-29 16:19:51 +0100

    Properly set base backup backends to active in pg_stat_activity

which is post 9.6.  Not sure I really feel a huge appetite for whacking
this around in the back branches.

Greetings,

Andres Freund



On 2019-Jul-09, Andres Freund wrote:

> The problem doesn't exist in later versions, because there
> exec_replication_command() does an explicit
>     /* Report to pgstat that this process is now idle */
>     pgstat_report_activity(STATE_IDLE, NULL);
> That's due to
> 
> commit d02974e32e028fc078d8f5eca1d6a4516efb0aa6
> Author: Magnus Hagander <magnus@hagander.net>
> Date:   2017-12-29 16:19:51 +0100
> 
>     Properly set base backup backends to active in pg_stat_activity
> 
> which is post 9.6.  Not sure I really feel a huge appetite for whacking
> this around in the back branches.

Hmm ... so what that commit did is precisely to fix this bug.
Magnus thought at the time he was fixing a pg10 bug,
https://postgr.es/m/CABUevEwX4g8y=gmgfPzxFKS7gqjSYNR949Xc96OQm=YXJmh_Og@mail.gmail.com
but apparently now we see that the bug was older than that.  Maybe it's
okay to backpatch further?

(Booh to missing "Discussion:" tags in commit messages)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Tue, Jul 09, 2019 at 05:46:30PM -0400, Alvaro Herrera wrote:
> Hmm ... so what that commit did is precisely to fix this bug.
> Magnus thought at the time he was fixing a pg10 bug,
> https://postgr.es/m/CABUevEwX4g8y=gmgfPzxFKS7gqjSYNR949Xc96OQm=YXJmh_Og@mail.gmail.com
> but apparently now we see that the bug was older than that.  Maybe it's
> okay to backpatch further?

Yes, I would not be against a back-patch in this case.  There is a
minor conflict because pre-9.6 WAL senders cannot run SQL queries but
that looks simple enough to solve.

> (Booh to missing "Discussion:" tags in commit messages)

Booh++
--
Michael

Attachment
Hi,

On 2019-07-10 13:51:47 +0900, Michael Paquier wrote:
> On Tue, Jul 09, 2019 at 05:46:30PM -0400, Alvaro Herrera wrote:
> > Hmm ... so what that commit did is precisely to fix this bug.
> > Magnus thought at the time he was fixing a pg10 bug,
> > https://postgr.es/m/CABUevEwX4g8y=gmgfPzxFKS7gqjSYNR949Xc96OQm=YXJmh_Og@mail.gmail.com
> > but apparently now we see that the bug was older than that.  Maybe it's
> > okay to backpatch further?
> 
> Yes, I would not be against a back-patch in this case.  There is a
> minor conflict because pre-9.6 WAL senders cannot run SQL queries but
> that looks simple enough to solve.

I'm not worried about backpatching that in isolation - but I'm worried
that just backpatching that indidividual commit isn't going to yield
particularly satisfactory behaviour. The code around this isn't super
robust, and pg10+ are noticably different due to the changes needed to
allow to execute normal queries over a replication connection.

Greetings,

Andres Freund



On Tue, Jul 9, 2019 at 11:46 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2019-Jul-09, Andres Freund wrote:

> The problem doesn't exist in later versions, because there
> exec_replication_command() does an explicit
>       /* Report to pgstat that this process is now idle */
>       pgstat_report_activity(STATE_IDLE, NULL);
> That's due to
>
> commit d02974e32e028fc078d8f5eca1d6a4516efb0aa6
> Author: Magnus Hagander <magnus@hagander.net>
> Date:   2017-12-29 16:19:51 +0100
>
>     Properly set base backup backends to active in pg_stat_activity
>
> which is post 9.6.  Not sure I really feel a huge appetite for whacking
> this around in the back branches.

Hmm ... so what that commit did is precisely to fix this bug.
Magnus thought at the time he was fixing a pg10 bug,
https://postgr.es/m/CABUevEwX4g8y=gmgfPzxFKS7gqjSYNR949Xc96OQm=YXJmh_Og@mail.gmail.com
but apparently now we see that the bug was older than that.  Maybe it's
okay to backpatch further?

I don't recall any particular reason around the backpatching other than as you say, probably not realizing it. At the time we didn't have logical replication yet (it landed later), so the actual uses of the logical decoding were nowhere near as frequent...



(Booh to missing "Discussion:" tags in commit messages)


If only we could agree on a template/standard for those :P

//Magnus

On Wed, Jul 10, 2019 at 05:24:44PM +0200, Magnus Hagander wrote:
> On Tue, Jul 9, 2019 at 11:46 PM Alvaro Herrera <alvherre@2ndquadrant.com>
> wrote:
> (Booh to missing "Discussion:" tags in commit messages)
>>
>>
> If only we could agree on a template/standard for those :P

This would make an excellent topic for a developper meeting with most
of the main committers around the table.
--
Michael

Attachment


On Thu, Jul 11, 2019 at 1:58 AM Michael Paquier <michael@paquier.xyz> wrote:
On Wed, Jul 10, 2019 at 05:24:44PM +0200, Magnus Hagander wrote:
> On Tue, Jul 9, 2019 at 11:46 PM Alvaro Herrera <alvherre@2ndquadrant.com>
> wrote:
> (Booh to missing "Discussion:" tags in commit messages)
>>
>>
> If only we could agree on a template/standard for those :P

This would make an excellent topic for a developper meeting with most
of the main committers around the table.

We've tried that once, and I think even twice, and failed miserably at any form of agreement. But we could certainly try again :) 

--
On Thu, Jul 11, 2019 at 01:21:39PM +0200, Magnus Hagander wrote:
> We've tried that once, and I think even twice, and failed miserably at any
> form of agreement. But we could certainly try again :)

Or we could just drop a thread about that..
--
Michael

Attachment
Magnus Hagander <magnus@hagander.net> writes:
> We've tried that once, and I think even twice, and failed miserably at any
> form of agreement. But we could certainly try again :)

I thought we *did* have an agreement, to wit using

Discussion: https://postgr.es/m/<message-id>

to link to relevant mail thread(s).  Some people use more tags
but that seems inessential to me.

            regards, tom lane



On Thu, Jul 11, 2019 at 09:44:07AM -0400, Tom Lane wrote:
> I thought we *did* have an agreement, to wit using
>
> Discussion: https://postgr.es/m/<message-id>
>
> to link to relevant mail thread(s).  Some people use more tags
> but that seems inessential to me.

Hehe.  I actually was thinking about advocating for having more of
them in the commit logs.  I'll just start a new thread about what I
had in mind.  Perhaps that will lead us nowhere, but let's see.
--
Michael

Attachment