Thread: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

Hi,

Pgjdbc test suite identified a SIGSEGV in the recent HEAD builds of PostgreSQL, Ubuntu 14.04.5 LTS

The crash is consistent, and it reproduces 100% of the cases so far.

The CI history shows that HEAD was good at 11 May 13:27 UTC, and it became bad by 19 May 14:00 UTC,
so the regression was introduced somewhere in-between.

Does that ring any bells?

In case you wonder:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  XLogSendPhysical () at walsender.c:2762
2762 if (!WALRead(xlogreader,
(gdb) #0  XLogSendPhysical () at walsender.c:2762
        SendRqstPtr = 133473640
        startptr = 133473240
        endptr = 133473640
        nbytes = 400
        segno = 1
        errinfo = {wre_errno = 988942240, wre_off = 2, wre_req = -1,
          wre_read = -1, wre_seg = {ws_file = 4714224,
            ws_segno = 140729887364688, ws_tli = 0}}
        __func__ = "XLogSendPhysical"
#1  0x000000000087fa8a in WalSndLoop (send_data=0x88009d <XLogSendPhysical>)
    at walsender.c:2300
        __func__ = "WalSndLoop"
#2  0x000000000087d65a in StartReplication (cmd=0x299bda8) at walsender.c:750
        buf = {data = 0x0, len = 3, maxlen = 1024, cursor = 87}
        FlushPtr = 133473640
        __func__ = "StartReplication"
#3  0x000000000087eddc in exec_replication_command (
    cmd_string=0x2916e68 "START_REPLICATION 0/7F4A3D8") at walsender.c:1643
        cmd = 0x299bda8
        parse_rc = 0
        cmd_node = 0x299bda8
        cmd_context = 0x299bc60
        old_context = 0x2916d50
        qc = {commandTag = 988942640, nprocessed = 140729887363520}
        __func__ = "exec_replication_command"


Vladimir
On Thu, May 28, 2020 at 09:07:04AM +0300, Vladimir Sitnikov wrote:
> The CI history shows that HEAD was good at 11 May 13:27 UTC, and it became
> bad by 19 May 14:00 UTC,
> so the regression was introduced somewhere in-between.
>
> Does that ring any bells?

It does, thanks!  This would map with 1d374302 or 850196b6 that
reworked this area of the code, so it seems like we are not quite done
with this work yet.  Do you still see the problem as of 55ca50d
(today's latest HEAD)?

Also, just wondering..  If I use more or less the same commands as
your travis job I should be able to reproduce the problem with a fresh
JDBC repository, right?  Or do you a sub-portion of your regression
tests to run that easily?
--
Michael

Attachment
At Thu, 28 May 2020 16:22:33 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Thu, May 28, 2020 at 09:07:04AM +0300, Vladimir Sitnikov wrote:
> > The CI history shows that HEAD was good at 11 May 13:27 UTC, and it became
> > bad by 19 May 14:00 UTC,
> > so the regression was introduced somewhere in-between.
> > 
> > Does that ring any bells?
> 
> It does, thanks!  This would map with 1d374302 or 850196b6 that
> reworked this area of the code, so it seems like we are not quite done
> with this work yet.  Do you still see the problem as of 55ca50d
> (today's latest HEAD)?

I think that's not the case.  I think I cause this crash with the
HEAD.  I'll post a fix soon.

> Also, just wondering..  If I use more or less the same commands as
> your travis job I should be able to reproduce the problem with a fresh
> JDBC repository, right?  Or do you a sub-portion of your regression
> tests to run that easily?

Pgjdbc seems initiating physical replication on a logical replication
session.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



On Thu, May 28, 2020 at 04:32:22PM +0900, Kyotaro Horiguchi wrote:
> I think that's not the case.  I think I cause this crash with the
> HEAD.  I'll post a fix soon.
>
> Pgjdbc seems initiating physical replication on a logical replication
> session.

Let me see...  Indeed:
$ psql "replication=database dbname=postgres"
=# START_REPLICATION SLOT test_slot PHYSICAL 0/0;
[one <boom> later]

(gdb) bt
#0  XLogSendPhysical () at walsender.c:2762
#1  0x000055d5f7803318 in WalSndLoop (send_data=0x55d5f78039c7
<XLogSendPhysical>) at walsender.c:2300
#2  0x000055d5f7800d70 in StartReplication (cmd=0x55d5f919bc60) at
walsender.c:750
#3  0x000055d5f78025ad in exec_replication_command
(cmd_string=0x55d5f9119a80 "START_REPLICATION SLOT test_slot PHYSICAL
0/0;") at walsender.c:1643
#4  0x000055d5f786a7ea in PostgresMain (argc=1, argv=0x55d5f91472c8,
dbname=0x55d5f9147210 "ioltas", username=0x55d5f91471f0 "ioltas") at
postgres.c:4311
#5  0x000055d5f77b4e9c in BackendRun (port=0x55d5f913dcd0) at
postmaster.c:4523
#6  0x000055d5f77b4606 in BackendStartup (port=0x55d5f913dcd0) at
postmaster.c:4215
#7  0x000055d5f77b08ad in ServerLoop () at postmaster.c:1727
#8  0x000055d5f77b00fc in PostmasterMain (argc=3, argv=0x55d5f9113260)
at postmaster.c:1400
#9  0x000055d5f76b3736 in main (argc=3, argv=0x55d5f9113260) at
main.c:210

No need for the JDBC test suite then.
--
Michael

Attachment
At Thu, 28 May 2020 09:07:04 +0300, Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote in 
> Pgjdbc test suite identified a SIGSEGV in the recent HEAD builds of
> PostgreSQL, Ubuntu 14.04.5 LTS
> 
> Here's a call stack:
> https://travis-ci.org/github/pgjdbc/pgjdbc/jobs/691794110#L7484
> The crash is consistent, and it reproduces 100% of the cases so far.
> 
> The CI history shows that HEAD was good at 11 May 13:27 UTC, and it became
> bad by 19 May 14:00 UTC,
> so the regression was introduced somewhere in-between.
> 
> Does that ring any bells?

Thanks for the report.  It is surely a bug since the server crashes,
on the other hand Pgjdbc seems doing bad, too.

It seems to me that that crash means Pgjdbc is initiating a logical
replication connection to start physical replication.

> In case you wonder:
> 
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  XLogSendPhysical () at walsender.c:2762
> 2762 if (!WALRead(xlogreader,
> (gdb) #0  XLogSendPhysical () at walsender.c:2762
>         SendRqstPtr = 133473640
>         startptr = 133473240
>         endptr = 133473640
>         nbytes = 400
>         segno = 1
>         errinfo = {wre_errno = 988942240, wre_off = 2, wre_req = -1,
>           wre_read = -1, wre_seg = {ws_file = 4714224,
>             ws_segno = 140729887364688, ws_tli = 0}}
>         __func__ = "XLogSendPhysical"

I see the probably the same symptom by the following steps with the
current HEAD.

psql 'host=/tmp replication=database'
=# START_REPLICATION 0/1;
<serer crashes>

Physical replication is not assumed to be started on a logical
replication connection. The attached would fix that.  The patch adds
two tests.  One for this case and another for the reverse.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 1f94c98f7459ca8a4942246325815a3e0a91caa4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 28 May 2020 15:55:30 +0900
Subject: [PATCH v1] Fix crash when starting physical replication on logical
 connection

It is an illegal operation to try starting physical replication on a
logical replication session.  We should properly warn the client
instead of crashing.
---
 src/backend/replication/walsender.c         |  5 +++++
 src/test/recovery/t/001_stream_rep.pl       | 14 +++++++++++---
 src/test/recovery/t/006_logical_decoding.pl | 10 +++++++++-
 3 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 86847cbb54..7b79c75311 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -589,6 +589,11 @@ StartReplication(StartReplicationCmd *cmd)
     StringInfoData buf;
     XLogRecPtr    FlushPtr;
 
+    if (am_db_walsender)
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("cannot initiate physical replication on a logical replication connection")));
+
     if (ThisTimeLineID == 0)
         ereport(ERROR,
                 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 0c316c1808..0b69b7d8d1 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 35;
+use Test::More tests => 36;
 
 # Initialize master node
 my $node_master = get_new_node('master');
@@ -18,6 +18,14 @@ my $backup_name = 'my_backup';
 # Take backup
 $node_master->backup($backup_name);
 
+# Check if logical-rep session properly refuses to start physical-rep
+my ($ret, $stdout, $stderr) =
+  $node_master->psql('template1',
+                     qq[START_REPLICATION PHYSICAL 0/1],
+                     replication=>'database');
+ok($stderr =~ /ERROR:  cannot initiate physical replication on a logical replication connection/,
+   "check if physical replication is rejected on logical-rep session");
+
 # Create streaming standby linking to master
 my $node_standby_1 = get_new_node('standby_1');
 $node_standby_1->init_from_backup($node_master, $backup_name,
@@ -94,7 +102,7 @@ sub test_target_session_attrs
 
     # The client used for the connection does not matter, only the backend
     # point does.
-    my ($ret, $stdout, $stderr) =
+    ($ret, $stdout, $stderr) =
       $node1->psql('postgres', 'SHOW port;',
         extra_params => [ '-d', $connstr ]);
     is( $status == $ret && $stdout eq $target_node->port,
@@ -136,7 +144,7 @@ my $connstr_rep    = "$connstr_common replication=1";
 my $connstr_db     = "$connstr_common replication=database dbname=postgres";
 
 # Test SHOW ALL
-my ($ret, $stdout, $stderr) = $node_master->psql(
+($ret, $stdout, $stderr) = $node_master->psql(
     'postgres', 'SHOW ALL;',
     on_error_die => 1,
     extra_params => [ '-d', $connstr_rep ]);
diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl
index ee05535b1c..eefde8c3f1 100644
--- a/src/test/recovery/t/006_logical_decoding.pl
+++ b/src/test/recovery/t/006_logical_decoding.pl
@@ -7,7 +7,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 13;
+use Test::More tests => 14;
 use Config;
 
 # Initialize master node
@@ -36,6 +36,14 @@ ok( $stderr =~
       m/replication slot "test_slot" was not created in this database/,
     "Logical decoding correctly fails to start");
 
+# Check if physical-rep session properly refuses to start logical-decoding
+($result, $stdout, $stderr) =
+  $node_master->psql('template1',
+                     qq[START_REPLICATION SLOT s1 LOGICAL 0/1],
+                     replication=>'true');
+ok($stderr =~ /ERROR:  logical decoding requires a database connection/,
+   "check if logical decoding is refused on physical-rep connection");
+
 $node_master->safe_psql('postgres',
     qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;]
 );
-- 
2.18.2


Kyotaro>It seems to me that that crash means Pgjdbc is initiating a logical
Kyotaro>replication connection to start physical replication.

Well, it used to work previously, so it might be a breaking change from the client/application point of view.

Vladimir

Hello, Vladimir.

At Thu, 28 May 2020 11:57:23 +0300, Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote in 
> Kyotaro>It seems to me that that crash means Pgjdbc is initiating a logical
> Kyotaro>replication connection to start physical replication.
> 
> Well, it used to work previously, so it might be a breaking change from the
> client/application point of view.

Mmm. It is not the proper way to use physical replication and it's
totally accidental that that worked (or even it might be a bug). The
documentation is saying as the follows, as more-or-less the same for
all versions since 9.4.

https://www.postgresql.org/docs/13/protocol-replication.html

> To initiate streaming replication, the frontend sends the replication
> parameter in the startup message. A Boolean value of true (or on, yes,
> 1) tells the backend to go into physical replication walsender mode,
> wherein a small set of replication commands, shown below, can be
> issued instead of SQL statements.
> 
> Passing database as the value for the replication parameter instructs
> the backend to go into logical replication walsender mode, connecting
> to the database specified in the dbname parameter. In logical
> replication walsender mode, the replication commands shown below as
> well as normal SQL commands can be issued.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





On Thu, 28 May 2020 at 05:11, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
Hello, Vladimir.

At Thu, 28 May 2020 11:57:23 +0300, Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote in
> Kyotaro>It seems to me that that crash means Pgjdbc is initiating a logical
> Kyotaro>replication connection to start physical replication.
>
> Well, it used to work previously, so it might be a breaking change from the
> client/application point of view.

Mmm. It is not the proper way to use physical replication and it's
totally accidental that that worked (or even it might be a bug). The
documentation is saying as the follows, as more-or-less the same for
all versions since 9.4.

https://www.postgresql.org/docs/13/protocol-replication.html

> To initiate streaming replication, the frontend sends the replication
> parameter in the startup message. A Boolean value of true (or on, yes,
> 1) tells the backend to go into physical replication walsender mode,
> wherein a small set of replication commands, shown below, can be
> issued instead of SQL statements.
>
> Passing database as the value for the replication parameter instructs
> the backend to go into logical replication walsender mode, connecting
> to the database specified in the dbname parameter. In logical
> replication walsender mode, the replication commands shown below as
> well as normal SQL commands can be issued.

regards.

While the documentation does indeed say that there is quite a bit of additional confusion added by:

and 
START_REPLICATION [ SLOT slot_name ] [ PHYSICAL ] XXX/XXX [ TIMELINE tli ]

If we already have a physical replication slot according to the startup message why do we need to specify it in the START REPLICATION message ?

Dave
At Thu, 28 May 2020 09:08:19 -0400, Dave Cramer <davecramer@postgres.rocks> wrote in 
> On Thu, 28 May 2020 at 05:11, Kyotaro Horiguchi <horikyota.ntt@gmail.com>
> wrote:
> > Mmm. It is not the proper way to use physical replication and it's
> > totally accidental that that worked (or even it might be a bug). The
> > documentation is saying as the follows, as more-or-less the same for
> > all versions since 9.4.
> >
> > https://www.postgresql.org/docs/13/protocol-replication.html
...
> >
> While the documentation does indeed say that there is quite a bit of
> additional confusion added by:
> 
> and
> START_REPLICATION [ SLOT *slot_name* ] [ PHYSICAL ] *XXX/XXX* [ TIMELINE
> *tli* ]
> 
> If we already have a physical replication slot according to the startup
> message why do we need to specify it in the START REPLICATION message ?

I don't know, but physical replication has worked that way since
before the replication slots was introduced so we haven't needed to do
so.  Physical replication slots are not assumed as more than
memorandum for the oldest required WAL segment (and oldest xmin).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



On Thu, May 28, 2020 at 06:11:39PM +0900, Kyotaro Horiguchi wrote:
> Mmm. It is not the proper way to use physical replication and it's
> totally accidental that that worked (or even it might be a bug). The
> documentation is saying as the follows, as more-or-less the same for
> all versions since 9.4.
>
> https://www.postgresql.org/docs/13/protocol-replication.html

+       if (am_db_walsender)
+               ereport(ERROR,
+                       (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                errmsg("cannot initiate physical
replication on a logical replication connection")));

I don't agree with this change.  The only restriction that we have in
place now in walsender.c regarding MyDatabaseId not being set is to
prevent the execution of SQL commands.  Note that it is possible to
start physical replication even if MyDatabaseId is set in a
replication connection, so you could break cases that have been valid
until now.

I think that we actually should be much more careful with the
initialization of the WAL reader used in the context of a WAL sender
before calling WALRead() and attempting to read a new WAL page.
--
Michael

Attachment
At Fri, 29 May 2020 16:21:38 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Thu, May 28, 2020 at 06:11:39PM +0900, Kyotaro Horiguchi wrote:
> > Mmm. It is not the proper way to use physical replication and it's
> > totally accidental that that worked (or even it might be a bug). The
> > documentation is saying as the follows, as more-or-less the same for
> > all versions since 9.4.
> > 
> > https://www.postgresql.org/docs/13/protocol-replication.html
> 
> +       if (am_db_walsender)
> +               ereport(ERROR,
> +                       (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                                errmsg("cannot initiate physical
> replication on a logical replication connection")));
> 
> I don't agree with this change.  The only restriction that we have in
> place now in walsender.c regarding MyDatabaseId not being set is to
> prevent the execution of SQL commands.  Note that it is possible to
> start physical replication even if MyDatabaseId is set in a
> replication connection, so you could break cases that have been valid
> until now.

It donesn't check MyDatabase, but whether the connection parameter
"repliation" is "true" or "database".  The documentation is telling
that "replication" should be "true" for a connection that is to be
used for physical replication, and "replication" should literally be
"database" for a connection that is for logical replication.  We need
to revise the documentation if we are going to allow physical
replication on a conection with "replication = database".

> I think that we actually should be much more careful with the
> initialization of the WAL reader used in the context of a WAL sender
> before calling WALRead() and attempting to read a new WAL page.

I agree that the initialization can be improved, but the current code
is no problem if we don't allow to run both logical and physical
replication on a single session.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



On Fri, 29 May 2020 at 17:57, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>
> At Fri, 29 May 2020 16:21:38 +0900, Michael Paquier <michael@paquier.xyz> wrote in
> > On Thu, May 28, 2020 at 06:11:39PM +0900, Kyotaro Horiguchi wrote:
> > > Mmm. It is not the proper way to use physical replication and it's
> > > totally accidental that that worked (or even it might be a bug). The
> > > documentation is saying as the follows, as more-or-less the same for
> > > all versions since 9.4.
> > >
> > > https://www.postgresql.org/docs/13/protocol-replication.html
> >
> > +       if (am_db_walsender)
> > +               ereport(ERROR,
> > +                       (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > +                                errmsg("cannot initiate physical
> > replication on a logical replication connection")));
> >
> > I don't agree with this change.  The only restriction that we have in
> > place now in walsender.c regarding MyDatabaseId not being set is to
> > prevent the execution of SQL commands.  Note that it is possible to
> > start physical replication even if MyDatabaseId is set in a
> > replication connection, so you could break cases that have been valid
> > until now.
>
> It donesn't check MyDatabase, but whether the connection parameter
> "repliation" is "true" or "database".  The documentation is telling
> that "replication" should be "true" for a connection that is to be
> used for physical replication, and "replication" should literally be
> "database" for a connection that is for logical replication.  We need
> to revise the documentation if we are going to allow physical
> replication on a conection with "replication = database".
>

Yes. Conversely, if we start logical replication in a physical
replication connection (i.g. replication=true) we got an error before
staring replication:

ERROR:  logical decoding requires a database connection

I think we can prevent that SEGV in a similar way.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Fri, May 29, 2020 at 06:09:06PM +0900, Masahiko Sawada wrote:
> Yes. Conversely, if we start logical replication in a physical
> replication connection (i.g. replication=true) we got an error before
> staring replication:
>
> ERROR:  logical decoding requires a database connection
>
> I think we can prevent that SEGV in a similar way.

Still unconvinced as this restriction stands for logical decoding
requiring a database connection but it is not necessarily true now as
physical replication has less restrictions than a logical one.

Looking at the code, I think that there is some confusion with the
fake WAL reader used as base reference in InitWalSender() where we
assume that it could only be used in the context of a non-database WAL
sender.  However, this initialization happens when the WAL sender
connection is initialized, and what I think this misses is that we
should try to initialize a WAL reader when actually going through a
START_REPLICATION command.

I can note as well that StartLogicalReplication() moves in this sense
by setting xlogreader to be the one from logical_decoding_ctx once the
decoding context has been created.

This results in the attached.  The extra test from upthread to check
that logical decoding is not allowed in a non-database WAL sender is a
good idea, so I have kept it.
--
Michael

Attachment

On 2020/06/02 13:24, Michael Paquier wrote:
> On Fri, May 29, 2020 at 06:09:06PM +0900, Masahiko Sawada wrote:
>> Yes. Conversely, if we start logical replication in a physical
>> replication connection (i.g. replication=true) we got an error before
>> staring replication:
>>
>> ERROR:  logical decoding requires a database connection
>>
>> I think we can prevent that SEGV in a similar way.
> 
> Still unconvinced as this restriction stands for logical decoding
> requiring a database connection but it is not necessarily true now as
> physical replication has less restrictions than a logical one.

Could you tell me what the benefit for supporting physical replication on
logical rep connection is? If it's only for "undocumented"
backward-compatibility, IMO it's better to reject such "tricky" set up.
But if there are some use cases for that, I'm ok to support that.

> Looking at the code, I think that there is some confusion with the
> fake WAL reader used as base reference in InitWalSender() where we
> assume that it could only be used in the context of a non-database WAL
> sender.  However, this initialization happens when the WAL sender
> connection is initialized, and what I think this misses is that we
> should try to initialize a WAL reader when actually going through a
> START_REPLICATION command.
> 
> I can note as well that StartLogicalReplication() moves in this sense
> by setting xlogreader to be the one from logical_decoding_ctx once the
> decoding context has been created.
> 
> This results in the attached.  The extra test from upthread to check
> that logical decoding is not allowed in a non-database WAL sender is a
> good idea, so I have kept it.

Yes. Also we should add the test to check if physical replication can work
fine even on logical rep connection?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



At Tue, 2 Jun 2020 13:24:56 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Fri, May 29, 2020 at 06:09:06PM +0900, Masahiko Sawada wrote:
> > Yes. Conversely, if we start logical replication in a physical
> > replication connection (i.g. replication=true) we got an error before
> > staring replication:
> > 
> > ERROR:  logical decoding requires a database connection
> > 
> > I think we can prevent that SEGV in a similar way.
> 
> Still unconvinced as this restriction stands for logical decoding
> requiring a database connection but it is not necessarily true now as
> physical replication has less restrictions than a logical one.

If we deliberately allow physical replication on a
database-replication connection, we should revise the documentation
that way. On the other hand physical replication has wider access to a
database cluster than logical replication.  Thus allowing to start
physical replication on a logical replication connection could
introduce a problem related to privileges.  So I think it might be
better that physical and logical replication have separate pg_hba
lines.

Once we explicitly allow physical replication on a logical replication
connection in documentation, it would be far harder to change the
behavior than now.

If we are sure that that cannot be a problem, I don't object the
change in documented behavior.

> Looking at the code, I think that there is some confusion with the
> fake WAL reader used as base reference in InitWalSender() where we
> assume that it could only be used in the context of a non-database WAL
> sender.  However, this initialization happens when the WAL sender
> connection is initialized, and what I think this misses is that we 
> should try to initialize a WAL reader when actually going through a
> START_REPLICATION command.

At first fake_xlogreader was really a fake one that only provides
callback routines, but it should have been changed to a real
xlogreader at the time it began to store segment information. In that
sense moving to real xlogreader makes sense to me separately from
whether we allow physicalrep on logicalrep connections.

> I can note as well that StartLogicalReplication() moves in this sense
> by setting xlogreader to be the one from logical_decoding_ctx once the
> decoding context has been created.
> 
> This results in the attached.  The extra test from upthread to check
> that logical decoding is not allowed in a non-database WAL sender is a
> good idea, so I have kept it.

+        ereport(ERROR,
+                (errcode(ERRCODE_OUT_OF_MEMORY),
+                 errmsg("out of memory")));

The same error message is accompanied by a DETAILS in some other
places. Don't we need one for this?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



On Tue, Jun 02, 2020 at 02:23:50PM +0900, Fujii Masao wrote:
> On 2020/06/02 13:24, Michael Paquier wrote:
>> Still unconvinced as this restriction stands for logical decoding
>> requiring a database connection but it is not necessarily true now as
>> physical replication has less restrictions than a logical one.
>
> Could you tell me what the benefit for supporting physical replication on
> logical rep connection is? If it's only for "undocumented"
> backward-compatibility, IMO it's better to reject such "tricky" set up.
> But if there are some use cases for that, I'm ok to support that.

Well, I don't really think that we can just break a behavior that
exists since 9.4 as you could break applications relying on the
existing behavior, and that's also the point of Vladimir upthread.

On top of it, the issue is actually unrelated to if we want to
restrict things more or not when starting replication in a WAL sender
because the xlogreader creation just needs to happen when starting
replication.  Now we have a static "fake" one created when a WAL
sender process starts, something that it would not need in most cases
like answering to a BASE_BACKUP command for example.

>> I can note as well that StartLogicalReplication() moves in this sense
>> by setting xlogreader to be the one from logical_decoding_ctx once the
>> decoding context has been created.
>>
>> This results in the attached.  The extra test from upthread to check
>> that logical decoding is not allowed in a non-database WAL sender is a
>> good idea, so I have kept it.
>
> Yes. Also we should add the test to check if physical replication can work
> fine even on logical rep connection?

I found confusing the use of psql to confirm that it actually works,
because we'd just return a protocol-level error in this case with psql
bumping on COPY_BOTH and it is not reliable to do just an error
message match.  Note as well that GetConnection() discards
automatically the database name for pg_basebackup and pg_receivewal as
well as libpqrcv_connect() for standbys so we cannot use that.
Perhaps using psql is better than nothing, but that makes me
uncomfortable.
--
Michael

Attachment



On Wed, 3 Jun 2020 at 01:19, Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Jun 02, 2020 at 02:23:50PM +0900, Fujii Masao wrote:
> On 2020/06/02 13:24, Michael Paquier wrote:
>> Still unconvinced as this restriction stands for logical decoding
>> requiring a database connection but it is not necessarily true now as
>> physical replication has less restrictions than a logical one.
>
> Could you tell me what the benefit for supporting physical replication on
> logical rep connection is? If it's only for "undocumented"
> backward-compatibility, IMO it's better to reject such "tricky" set up.
> But if there are some use cases for that, I'm ok to support that.

Well, I don't really think that we can just break a behavior that
exists since 9.4 as you could break applications relying on the
existing behavior, and that's also the point of Vladimir upthread.

I don't see this is a valid reason to keep doing something. If it is broken then fix it.
Clients can deal with the change. 


On 2020/06/03 20:33, Dave Cramer wrote:
> 
> 
> 
> On Wed, 3 Jun 2020 at 01:19, Michael Paquier <michael@paquier.xyz <mailto:michael@paquier.xyz>> wrote:
> 
>     On Tue, Jun 02, 2020 at 02:23:50PM +0900, Fujii Masao wrote:
>      > On 2020/06/02 13:24, Michael Paquier wrote:
>      >> Still unconvinced as this restriction stands for logical decoding
>      >> requiring a database connection but it is not necessarily true now as
>      >> physical replication has less restrictions than a logical one.
>      >
>      > Could you tell me what the benefit for supporting physical replication on
>      > logical rep connection is? If it's only for "undocumented"
>      > backward-compatibility, IMO it's better to reject such "tricky" set up.
>      > But if there are some use cases for that, I'm ok to support that.
> 
>     Well, I don't really think that we can just break a behavior that
>     exists since 9.4 as you could break applications relying on the
>     existing behavior, and that's also the point of Vladimir upthread.

For the back branches, I agree with you. Even if it's undocumented behavior,
basically we should not get rid of it from the back branches unless there is
very special reason.

For v13, if it has no functional merit, I don't think it's so bad to get rid of
that undocumented (and maybe not-fully tested) behavior. If there are
applications depending it, I think that they can be updated.

> I don't see this is a valid reason to keep doing something. If it is broken then fix it.
> Clients can deal with the change.

+1

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



On 2020-Jun-02, Michael Paquier wrote:

> I can note as well that StartLogicalReplication() moves in this sense
> by setting xlogreader to be the one from logical_decoding_ctx once the
> decoding context has been created.
> 
> This results in the attached.  The extra test from upthread to check
> that logical decoding is not allowed in a non-database WAL sender is a
> good idea, so I have kept it.

I don't particularly disagree with your proposed patch -- in fact, it
seems to make things cleaner.  It is a little wasteful, but I don't
really mind that.  It's just some memory, and it's not a significant
amount.

That said, I would *also* apply Kyotaro's proposed patch to prohibit a
physical standby running with a logical slot, if only because that
reduces the number of combinations that we need to test and keep our
collective heads straight about.  Just reject the weird case and have
one type of slot for each type of replication.  I didn't even think this
was at all possible.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Hi,

On 2020-06-02 14:23:50 +0900, Fujii Masao wrote:
> On 2020/06/02 13:24, Michael Paquier wrote:
> > On Fri, May 29, 2020 at 06:09:06PM +0900, Masahiko Sawada wrote:
> > > Yes. Conversely, if we start logical replication in a physical
> > > replication connection (i.g. replication=true) we got an error before
> > > staring replication:
> > > 
> > > ERROR:  logical decoding requires a database connection
> > > 
> > > I think we can prevent that SEGV in a similar way.
> > 
> > Still unconvinced as this restriction stands for logical decoding
> > requiring a database connection but it is not necessarily true now as
> > physical replication has less restrictions than a logical one.
> 
> Could you tell me what the benefit for supporting physical replication on
> logical rep connection is? If it's only for "undocumented"
> backward-compatibility, IMO it's better to reject such "tricky" set up.
> But if there are some use cases for that, I'm ok to support that.

I don't think we should prohibit this. For one, it'd probably break some
clients, without a meaningful need.

But I think it's also actually quite useful to be able to access
catalogs before streaming data. You e.g. can look up configuration of
the primary before streaming WAL. With a second connection that's
actually harder to do reliably in some cases, because you need to be
sure that you actually reached the right server (consider a pooler,
automatic failover etc).

Greetings,

Andres Freund



On 2020-Jun-03, Andres Freund wrote:

> I don't think we should prohibit this. For one, it'd probably break some
> clients, without a meaningful need.

There *is* a need, namely to keep complexity down.  This is quite
convoluted, it's got a lot of historical baggage because of the way it
was implemented, and it's very difficult to understand.  The greatest
motive I see is to make this easier to understand, so that it is easier
to modify and improve in the future.

> But I think it's also actually quite useful to be able to access
> catalogs before streaming data. You e.g. can look up configuration of
> the primary before streaming WAL. With a second connection that's
> actually harder to do reliably in some cases, because you need to be
> sure that you actually reached the right server (consider a pooler,
> automatic failover etc).

I don't think having a physical replication connection access catalog
data directly is a great idea.  We already have gadgets like
IDENTIFY_SYSTEM for physical replication that can do that, and if you
need particular settings you can use SHOW (commit d1ecd539477).  If
there was a strong need for even more than that, we can add something to
the grammar.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Hi,

On 2020-06-03 18:27:12 -0400, Alvaro Herrera wrote:
> On 2020-Jun-03, Andres Freund wrote:
> > I don't think we should prohibit this. For one, it'd probably break some
> > clients, without a meaningful need.
> 
> There *is* a need, namely to keep complexity down.  This is quite
> convoluted, it's got a lot of historical baggage because of the way it
> was implemented, and it's very difficult to understand.  The greatest
> motive I see is to make this easier to understand, so that it is easier
> to modify and improve in the future.

That seems like a possibly convincing argument for not introducing the
capability, but doesn't seem strong enough to remove it. Especially not
if it was just broken as part of effectively a refactoring, as far as I
understand?


> > But I think it's also actually quite useful to be able to access
> > catalogs before streaming data. You e.g. can look up configuration of
> > the primary before streaming WAL. With a second connection that's
> > actually harder to do reliably in some cases, because you need to be
> > sure that you actually reached the right server (consider a pooler,
> > automatic failover etc).
> 
> I don't think having a physical replication connection access catalog
> data directly is a great idea.  We already have gadgets like
> IDENTIFY_SYSTEM for physical replication that can do that, and if you
> need particular settings you can use SHOW (commit d1ecd539477).  If
> there was a strong need for even more than that, we can add something to
> the grammar.

Those special case things are a bad idea, and we shouldn't introduce
more. It's unrealistic that we can ever make that support everything,
and since we already have to support the database connected thing, I
don't see the point.

Greetings,

Andres Freund



On Wed, Jun 03, 2020 at 06:33:11PM -0700, Andres Freund wrote:
> On 2020-06-03 18:27:12 -0400, Alvaro Herrera wrote:
>> There *is* a need, namely to keep complexity down.  This is quite
>> convoluted, it's got a lot of historical baggage because of the way it
>> was implemented, and it's very difficult to understand.  The greatest
>> motive I see is to make this easier to understand, so that it is easier
>> to modify and improve in the future.
>
> That seems like a possibly convincing argument for not introducing the
> capability, but doesn't seem strong enough to remove it. Especially not
> if it was just broken as part of effectively a refactoring, as far as I
> understand?

Are there any objections in fixing the issue first then?  As far as I
can see there is no objection to this part, like here:
https://www.postgresql.org/message-id/20200603214448.GA901@alvherre.pgsql

>> I don't think having a physical replication connection access catalog
>> data directly is a great idea.  We already have gadgets like
>> IDENTIFY_SYSTEM for physical replication that can do that, and if you
>> need particular settings you can use SHOW (commit d1ecd539477).  If
>> there was a strong need for even more than that, we can add something to
>> the grammar.
>
> Those special case things are a bad idea, and we shouldn't introduce
> more. It's unrealistic that we can ever make that support everything,
> and since we already have to support the database connected thing, I
> don't see the point.

Let's continue discussing this part as well.
--
Michael

Attachment
On 2020-Jun-03, Andres Freund wrote:

> On 2020-06-03 18:27:12 -0400, Alvaro Herrera wrote:
> > On 2020-Jun-03, Andres Freund wrote:
> > > I don't think we should prohibit this. For one, it'd probably break some
> > > clients, without a meaningful need.
> > 
> > There *is* a need, namely to keep complexity down.  This is quite
> > convoluted, it's got a lot of historical baggage because of the way it
> > was implemented, and it's very difficult to understand.  The greatest
> > motive I see is to make this easier to understand, so that it is easier
> > to modify and improve in the future.
> 
> That seems like a possibly convincing argument for not introducing the
> capability, but doesn't seem strong enough to remove it.

This "capability" has never been introduced.  The fact that it's there
is just an accident.  In fact, it's not a capability, since the feature
(physical replication) is invoked differently -- namely, using a
physical replication connection.  JDBC uses a logical replication
connection for it only because they never realized that they were
supposed to do differently, because we failed to throw the correct
error message in the first place.

> > I don't think having a physical replication connection access catalog
> > data directly is a great idea.  We already have gadgets like
> > IDENTIFY_SYSTEM for physical replication that can do that, and if you
> > need particular settings you can use SHOW (commit d1ecd539477).  If
> > there was a strong need for even more than that, we can add something to
> > the grammar.
> 
> Those special case things are a bad idea, and we shouldn't introduce
> more.

What special case things?  The replication connection has never been
supposed to run SQL.  That's why we have SHOW in the replication
grammar.

> It's unrealistic that we can ever make that support everything,
> and since we already have to support the database connected thing, I
> don't see the point.

A logical replication connection is not supposed to be used for physical
replication.  That's just going to make more bugs appear.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On 2020-Jun-04, Michael Paquier wrote:

> On Wed, Jun 03, 2020 at 06:33:11PM -0700, Andres Freund wrote:

> >> I don't think having a physical replication connection access catalog
> >> data directly is a great idea.  We already have gadgets like
> >> IDENTIFY_SYSTEM for physical replication that can do that, and if you
> >> need particular settings you can use SHOW (commit d1ecd539477).  If
> >> there was a strong need for even more than that, we can add something to
> >> the grammar.
> > 
> > Those special case things are a bad idea, and we shouldn't introduce
> > more. It's unrealistic that we can ever make that support everything,
> > and since we already have to support the database connected thing, I
> > don't see the point.
> 
> Let's continue discussing this part as well.

A logical replication connection cannot run SQL anyway, can it?  it's
limited to the replication grammar.  So it's not like you can run
arbitrary queries to access catalog data.  So even if we do need to
access the catalogs, we'd have to add stuff to the replication grammar
in order to support that.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Hi,

On 2020-06-04 16:44:53 -0400, Alvaro Herrera wrote:
> A logical replication connection cannot run SQL anyway, can it?

You can:

andres@awork3:~/src/postgresql$ psql 'replication=database'

postgres[52656][1]=# IDENTIFY_SYSTEM;
┌─────────────────────┬──────────┬────────────┬──────────┐
│      systemid       │ timeline │  xlogpos   │  dbname  │
├─────────────────────┼──────────┼────────────┼──────────┤
│ 6821634567571961151 │        1 │ 1/D256EC40 │ postgres │
└─────────────────────┴──────────┴────────────┴──────────┘
(1 row)

postgres[52656][1]=# SELECT 1;
┌──────────┐
│ ?column? │
├──────────┤
│        1 │
└──────────┘
(1 row)


I am very much not in love with the way that was implemented, but it's
there, and it's used as far as I know (cf tablesync.c).

Greetings,

Andres Freund



On 2020-Jun-04, Andres Freund wrote:

> postgres[52656][1]=# SELECT 1;
> ┌──────────┐
> │ ?column? │
> ├──────────┤
> │        1 │
> └──────────┘
> (1 row)
> 
> 
> I am very much not in love with the way that was implemented, but it's
> there, and it's used as far as I know (cf tablesync.c).

Ouch ... so they made IDENT in the replication grammar be a trigger to
enter the regular grammar.  Crazy.  No way to put those worms back in
the tin now, I guess.

It is still my opinion that we should prohibit a logical replication
connection from being used to do physical replication.  Horiguchi-san,
Sawada-san and Masao-san are all of the same opinion.  Dave Cramer (of
the JDBC team) is not opposed to the change -- he says they're just
using it because they didn't realize they should be doing differently.

Both Michael P. and you are saying we shouldn't break it because it
works today, but there isn't a real use-case for it.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





On Thu, 4 Jun 2020 at 19:46, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2020-Jun-04, Andres Freund wrote:

> postgres[52656][1]=# SELECT 1;
> ┌──────────┐
> │ ?column? │
> ├──────────┤
> │        1 │
> └──────────┘
> (1 row)
>
>
> I am very much not in love with the way that was implemented, but it's
> there, and it's used as far as I know (cf tablesync.c).

Ouch ... so they made IDENT in the replication grammar be a trigger to
enter the regular grammar.  Crazy.  No way to put those worms back in
the tin now, I guess.

Is that documented ? 

It is still my opinion that we should prohibit a logical replication
connection from being used to do physical replication.  Horiguchi-san,
Sawada-san and Masao-san are all of the same opinion.  Dave Cramer (of
the JDBC team) is not opposed to the change -- he says they're just
using it because they didn't realize they should be doing differently.

I think my exact words were 

"I don't see this is a valid reason to keep doing something. If it is broken then fix it.
Clients can deal with the change."

in response to:

Well, I don't really think that we can just break a behavior that
exists since 9.4 as you could break applications relying on the
existing behavior, and that's also the point of Vladimir upthread.

Which is different than not being opposed to the change. I don't see this as broken,
and it's quite possible that some of our users are using it. It certainly needs to be documented
 
Dave
On 2020-Jun-05, Dave Cramer wrote:

> On Thu, 4 Jun 2020 at 19:46, Alvaro Herrera <alvherre@2ndquadrant.com>
> wrote:

> > Ouch ... so they made IDENT in the replication grammar be a trigger to
> > enter the regular grammar.  Crazy.  No way to put those worms back in
> > the tin now, I guess.
> 
> Is that documented ?

I don't think it is.

> > It is still my opinion that we should prohibit a logical replication
> > connection from being used to do physical replication.  Horiguchi-san,
> > Sawada-san and Masao-san are all of the same opinion.  Dave Cramer (of
> > the JDBC team) is not opposed to the change -- he says they're just
> > using it because they didn't realize they should be doing differently.
> 
> I think my exact words were
> 
> "I don't see this is a valid reason to keep doing something. If it is
> broken then fix it.
> Clients can deal with the change."
> 
> in response to:
> 
> > Well, I don't really think that we can just break a behavior that
> > exists since 9.4 as you could break applications relying on the
> > existing behavior, and that's also the point of Vladimir upthread.
> 
> Which is different than not being opposed to the change. I don't see this
> as broken, and it's quite possible that some of our users are using
> it.

Apologies for misinterpreting.

> It certainly needs to be documented

I'd rather not.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Thu, Jun 04, 2020 at 11:07:29AM +0900, Michael Paquier wrote:
> Are there any objections in fixing the issue first then?  As far as I
> can see there is no objection to this part, like here:
> https://www.postgresql.org/message-id/20200603214448.GA901@alvherre.pgsql

Hearing nothing, I have applied this part and fixed the crash to take
care of the open item.
--
Michael

Attachment
Hi,

On 6/5/20 11:51 AM, Alvaro Herrera wrote:
> On 2020-Jun-05, Dave Cramer wrote:
>
>> On Thu, 4 Jun 2020 at 19:46, Alvaro Herrera <alvherre@2ndquadrant.com>
>> wrote:
>
>>> Ouch ... so they made IDENT in the replication grammar be a trigger to
>>> enter the regular grammar.  Crazy.  No way to put those worms back in
>>> the tin now, I guess.
>>
>> Is that documented ?
>
> I don't think it is.
>
>>> It is still my opinion that we should prohibit a logical replication
>>> connection from being used to do physical replication.  Horiguchi-san,
>>> Sawada-san and Masao-san are all of the same opinion.  Dave Cramer (of
>>> the JDBC team) is not opposed to the change -- he says they're just
>>> using it because they didn't realize they should be doing differently.
>>
>> I think my exact words were
>>
>> "I don't see this is a valid reason to keep doing something. If it is
>> broken then fix it.
>> Clients can deal with the change."
>>
>> in response to:
>>
>>> Well, I don't really think that we can just break a behavior that
>>> exists since 9.4 as you could break applications relying on the
>>> existing behavior, and that's also the point of Vladimir upthread.
>>
>> Which is different than not being opposed to the change. I don't see this
>> as broken, and it's quite possible that some of our users are using
>> it.
>
> Apologies for misinterpreting.
>
>> It certainly needs to be documented
>
> I'd rather not.

The PG13 RMT had a discussion about this thread, and while the initial
crash has been fixed, we decided to re-open the Open Item around whether
we should allow physical replication to be initiated in a logical
replication session.

We anticipate a resolution for PG13, whether it is explicitly
disallowing physical replication from occurring on a logical replication
slot, maintaining the status quo, or something else such that there is
consensus on the approach.

Thanks,

Jonathan


Attachment
Hi,

On 2020-06-21 13:45:36 -0400, Jonathan S. Katz wrote:
> The PG13 RMT had a discussion about this thread, and while the initial
> crash has been fixed, we decided to re-open the Open Item around whether
> we should allow physical replication to be initiated in a logical
> replication session.

Since this is a long-time issue, this doesn't quite seem like an issue
for the RMT?


> We anticipate a resolution for PG13, whether it is explicitly
> disallowing physical replication from occurring on a logical replication
> slot, maintaining the status quo, or something else such that there is
> consensus on the approach.

s/logical replication slot/logical replication connection/?


I still maintain that adding restrictions here is a bad idea. Even
disregarding the discussion of running normal queries interspersed, it's
useful to be able to both request WAL and receive logical changes over
the same connection. E.g. for creating a logical replica by first doing
a physical base backup (vastly faster), or fetching WAL for decoding
large transactions onto a standby.

And I just don't see any reasons to disallow it. There's basically no
reduction in complexity by doing so.

Greetings,

Andres Freund



On Sun, Jun 21, 2020 at 01:02:34PM -0700, Andres Freund wrote:
> I still maintain that adding restrictions here is a bad idea. Even
> disregarding the discussion of running normal queries interspersed, it's
> useful to be able to both request WAL and receive logical changes over
> the same connection. E.g. for creating a logical replica by first doing
> a physical base backup (vastly faster), or fetching WAL for decoding
> large transactions onto a standby.
>
> And I just don't see any reasons to disallow it. There's basically no
> reduction in complexity by doing so.

Yeah, I still stand by the same opinion here to do nothing.  I suspect
that we have good chances to annoy people and some cases we are
overlooking here, that used to work.
--
Michael

Attachment
At Tue, 23 Jun 2020 10:51:40 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Sun, Jun 21, 2020 at 01:02:34PM -0700, Andres Freund wrote:
> > I still maintain that adding restrictions here is a bad idea. Even
> > disregarding the discussion of running normal queries interspersed, it's
> > useful to be able to both request WAL and receive logical changes over
> > the same connection. E.g. for creating a logical replica by first doing
> > a physical base backup (vastly faster), or fetching WAL for decoding
> > large transactions onto a standby.
> > 
> > And I just don't see any reasons to disallow it. There's basically no
> > reduction in complexity by doing so.
> 
> Yeah, I still stand by the same opinion here to do nothing.  I suspect
> that we have good chances to annoy people and some cases we are
> overlooking here, that used to work.

In logical replication, a replication role is intended to be
accessible only to the GRANTed databases.  On the other hand the same
role can create a dead copy of the whole cluster, including
non-granted databases.  It seems like a sieve missing a mesh screen.

I agree that that doesn't harm as far as roles are strictly managed so
I don't insist so strongly on inhibiting the behavior. However, the
documentation at least needs amendment.

https://www.postgresql.org/docs/13/protocol-replication.html

====
To initiate streaming replication, the frontend sends the replication
parameter in the startup message. A Boolean value of true (or on, yes,
1) tells the backend to go into physical replication walsender mode,
wherein a small set of replication commands, shown below, can be
issued instead of SQL statements.

Passing database as the value for the replication parameter instructs
the backend to go into logical replication walsender mode, connecting
to the database specified in the dbname parameter. In logical
replication walsender mode, the replication commands shown below as
well as normal SQL commands can be issued.
====

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




On 2020/06/24 11:56, Kyotaro Horiguchi wrote:
> At Tue, 23 Jun 2020 10:51:40 +0900, Michael Paquier <michael@paquier.xyz> wrote in
>> On Sun, Jun 21, 2020 at 01:02:34PM -0700, Andres Freund wrote:
>>> I still maintain that adding restrictions here is a bad idea. Even
>>> disregarding the discussion of running normal queries interspersed, it's
>>> useful to be able to both request WAL and receive logical changes over
>>> the same connection. E.g. for creating a logical replica by first doing
>>> a physical base backup (vastly faster), or fetching WAL for decoding
>>> large transactions onto a standby.
>>>
>>> And I just don't see any reasons to disallow it. There's basically no
>>> reduction in complexity by doing so.
>>
>> Yeah, I still stand by the same opinion here to do nothing.  I suspect
>> that we have good chances to annoy people and some cases we are
>> overlooking here, that used to work.
> 
> In logical replication, a replication role is intended to be
> accessible only to the GRANTed databases.  On the other hand the same
> role can create a dead copy of the whole cluster, including
> non-granted databases.  It seems like a sieve missing a mesh screen.

Personally I'd like to disallow physical replication commands
when I explicitly reject physical replication connection
(i.e., set "host replication user x.x.x.x/x reject") in pg_hba.conf,
whether on physical or logical replication connection.


> I agree that that doesn't harm as far as roles are strictly managed so
> I don't insist so strongly on inhibiting the behavior. However, the
> documentation at least needs amendment.

+1

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



On 2020-Jun-24, Kyotaro Horiguchi wrote:

> In logical replication, a replication role is intended to be
> accessible only to the GRANTed databases.  On the other hand the same
> role can create a dead copy of the whole cluster, including
> non-granted databases.

In other words -- essentially, if you grant replication access to a role
only to a specific database, they can steal the whole cluster.

I don't see what's so great about that, but apparently people like it.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Greetings,

* Alvaro Herrera (alvherre@2ndquadrant.com) wrote:
> On 2020-Jun-24, Kyotaro Horiguchi wrote:
>
> > In logical replication, a replication role is intended to be
> > accessible only to the GRANTed databases.  On the other hand the same
> > role can create a dead copy of the whole cluster, including
> > non-granted databases.
>
> In other words -- essentially, if you grant replication access to a role
> only to a specific database, they can steal the whole cluster.
>
> I don't see what's so great about that, but apparently people like it.

Sure, people who aren't in charge of security I'm sure like the ease of
use.

Doesn't mean it makes sense or that we should be supporting that.  What
we should have is a way to allow administrators to configure a system
for exactly what they want to allow, and it doesn't seem like we're
doing that today and therefore we should fix it.  This isn't the only
area we have that issue in.

Thanks,

Stephen

Attachment
On 2020-Jun-24, Stephen Frost wrote:

> Doesn't mean it makes sense or that we should be supporting that.  What
> we should have is a way to allow administrators to configure a system
> for exactly what they want to allow, and it doesn't seem like we're
> doing that today and therefore we should fix it.  This isn't the only
> area we have that issue in.

The way to do that, for the case under discussion, is to reject using a
logical replication connection for physical replication commands.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Wed, Jun 24, 2020 at 1:06 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> On 2020-Jun-24, Stephen Frost wrote:
> > Doesn't mean it makes sense or that we should be supporting that.  What
> > we should have is a way to allow administrators to configure a system
> > for exactly what they want to allow, and it doesn't seem like we're
> > doing that today and therefore we should fix it.  This isn't the only
> > area we have that issue in.
>
> The way to do that, for the case under discussion, is to reject using a
> logical replication connection for physical replication commands.

Reading over this discussion, I see basically three arguments:

1. Andres argues that being able to execute physical replication
commands from the same connection as SQL queries is useful, and that
people may be relying on it, and that we shouldn't break it without
need.

2. Fujii Masao argues that the current situation makes it impossible
to write a pg_hba.conf rule that disallows all physical replication
connections, because people could get around it by using a logical
replication connection for physical replication.

3. Various people argue that it's only accidental that physical
replication on a replication=database connection ever worked at all,
and therefore we ought to block it.

I find argument #1 most convincing, #2 less convincing, and #3 least
convincing. In my view, the problem with argument #3 is that just
because some feature combination was unintentional doesn't mean it's
unuseful or unused. As for #2, suppose someone were to propose a
design for logical replication that allowed it to take place without a
database connection, so that it could be done with just a regular
replication connection. Such a feature would create the same problem
Fujii Masao mentions here, but it seems inconceivable that we would
for that reason reject it; we make decisions about features based on
their usefulness, not their knock-on effects on pg_hba.conf rules. We
can always add new kinds of access control restrictions if they are
needed; that is a better approach than removing features so that the
existing pg_hba.conf facilities can be used to accomplish some
particular goal. So really I think this turns on #1: is it plausible
that people are using this feature, however inadvertent it may be, and
is it potentially useful? I don't see that anybody's made an argument
against either of those things. Unless someone can do so, I think we
shouldn't disable this.

That having been said, I think that the fact that you can execute SQL
queries in replication=database connections is horrifying. I really
hate that feature. I think it's a bad design, and a bad
implementation, and a recipe for tons of bugs. But, blocking physical
replication commands on such connections isn't going to solve any of
that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On 2020-Jun-24, Robert Haas wrote:

> So really I think this turns on #1: is it plausible
> that people are using this feature, however inadvertent it may be, and
> is it potentially useful? I don't see that anybody's made an argument
> against either of those things. Unless someone can do so, I think we
> shouldn't disable this.

People (specifically the jdbc driver) *are* using this feature in this
way, but they didn't realize they were doing it.  It was an accident and
they didn't notice.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





On Wed, 24 Jun 2020 at 15:41, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2020-Jun-24, Robert Haas wrote:

> So really I think this turns on #1: is it plausible
> that people are using this feature, however inadvertent it may be, and
> is it potentially useful? I don't see that anybody's made an argument
> against either of those things. Unless someone can do so, I think we
> shouldn't disable this.

People (specifically the jdbc driver) *are* using this feature in this
way, but they didn't realize they were doing it.  It was an accident and
they didn't notice.


Not sure we are using it as much as we accidently did it that way. It would be trivial to fix.

That said I think we should fix the security hole this opens and leave the functionality.

Dave
Hi,

On 2020-06-24 15:41:14 -0400, Alvaro Herrera wrote:
> On 2020-Jun-24, Robert Haas wrote:
> 
> > So really I think this turns on #1: is it plausible
> > that people are using this feature, however inadvertent it may be, and
> > is it potentially useful? I don't see that anybody's made an argument
> > against either of those things. Unless someone can do so, I think we
> > shouldn't disable this.
> 
> People (specifically the jdbc driver) *are* using this feature in this
> way, but they didn't realize they were doing it.  It was an accident and
> they didn't notice.

As I said before, I've utilized being able to do both over a single
connection (among others to initialize a logical replica using a base
backup). And I've seen at least one other codebase (developed without my
input) doing so. I really don't understand how you just dismiss this
without any sort of actual argument.  Yes, those uses can be fixed to
reconnect with a different replication parameter, but that's code that
needs to be adjusted and it requires adjustments to pg_hba.conf etc.

And obviously you'd lock out older versions of jdbc, and possibly other
drivers.

Obviously we should allow more granular permissions here, I don't think
anybody is arguing against that.

Greetings,

Andres Freund



On Wed, Jun 24, 2020 at 3:41 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> People (specifically the jdbc driver) *are* using this feature in this
> way, but they didn't realize they were doing it.  It was an accident and
> they didn't notice.

But you don't know that that's true of everyone using this feature,
and even if it were, so what? Breaking a feature that someone didn't
know they were using is just as much of a break as breaking a feature
someone DID know they were using.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On 2020-Jun-24, Andres Freund wrote:

> As I said before, I've utilized being able to do both over a single
> connection (among others to initialize a logical replica using a base
> backup). And I've seen at least one other codebase (developed without my
> input) doing so. I really don't understand how you just dismiss this
> without any sort of actual argument.  Yes, those uses can be fixed to
> reconnect with a different replication parameter, but that's code that
> needs to be adjusted and it requires adjustments to pg_hba.conf etc.
> 
> And obviously you'd lock out older versions of jdbc, and possibly other
> drivers.

Well, I had understood that you were talking from a hypothetical
position, not that you were already using the thing that way.  After
these arguments, I agree to leave things alone, and nobody else seems to
be arguing in that direction, so I'll mark the open item as closed.

Thanks,

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services