Thread: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
From
Vladimir Sitnikov
Date:
Hi,
Pgjdbc test suite identified a SIGSEGV in the recent HEAD builds of PostgreSQL, Ubuntu 14.04.5 LTS
Here's a call stack: https://travis-ci.org/github/pgjdbc/pgjdbc/jobs/691794110#L7484
The crash is consistent, and it reproduces 100% of the cases so far.
The CI history shows that HEAD was good at 11 May 13:27 UTC, and it became bad by 19 May 14:00 UTC,
so the regression was introduced somewhere in-between.
Does that ring any bells?
In case you wonder:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 XLogSendPhysical () at walsender.c:2762
2762 if (!WALRead(xlogreader,
(gdb) #0 XLogSendPhysical () at walsender.c:2762
SendRqstPtr = 133473640
startptr = 133473240
endptr = 133473640
nbytes = 400
segno = 1
errinfo = {wre_errno = 988942240, wre_off = 2, wre_req = -1,
wre_read = -1, wre_seg = {ws_file = 4714224,
ws_segno = 140729887364688, ws_tli = 0}}
__func__ = "XLogSendPhysical"
#1 0x000000000087fa8a in WalSndLoop (send_data=0x88009d <XLogSendPhysical>)
at walsender.c:2300
__func__ = "WalSndLoop"
#2 0x000000000087d65a in StartReplication (cmd=0x299bda8) at walsender.c:750
buf = {data = 0x0, len = 3, maxlen = 1024, cursor = 87}
FlushPtr = 133473640
__func__ = "StartReplication"
#3 0x000000000087eddc in exec_replication_command (
cmd_string=0x2916e68 "START_REPLICATION 0/7F4A3D8") at walsender.c:1643
cmd = 0x299bda8
parse_rc = 0
cmd_node = 0x299bda8
cmd_context = 0x299bc60
old_context = 0x2916d50
qc = {commandTag = 988942640, nprocessed = 140729887363520}
__func__ = "exec_replication_command"
#0 XLogSendPhysical () at walsender.c:2762
2762 if (!WALRead(xlogreader,
(gdb) #0 XLogSendPhysical () at walsender.c:2762
SendRqstPtr = 133473640
startptr = 133473240
endptr = 133473640
nbytes = 400
segno = 1
errinfo = {wre_errno = 988942240, wre_off = 2, wre_req = -1,
wre_read = -1, wre_seg = {ws_file = 4714224,
ws_segno = 140729887364688, ws_tli = 0}}
__func__ = "XLogSendPhysical"
#1 0x000000000087fa8a in WalSndLoop (send_data=0x88009d <XLogSendPhysical>)
at walsender.c:2300
__func__ = "WalSndLoop"
#2 0x000000000087d65a in StartReplication (cmd=0x299bda8) at walsender.c:750
buf = {data = 0x0, len = 3, maxlen = 1024, cursor = 87}
FlushPtr = 133473640
__func__ = "StartReplication"
#3 0x000000000087eddc in exec_replication_command (
cmd_string=0x2916e68 "START_REPLICATION 0/7F4A3D8") at walsender.c:1643
cmd = 0x299bda8
parse_rc = 0
cmd_node = 0x299bda8
cmd_context = 0x299bc60
old_context = 0x2916d50
qc = {commandTag = 988942640, nprocessed = 140729887363520}
__func__ = "exec_replication_command"
Vladimir
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Michael Paquier
Date:
On Thu, May 28, 2020 at 09:07:04AM +0300, Vladimir Sitnikov wrote: > The CI history shows that HEAD was good at 11 May 13:27 UTC, and it became > bad by 19 May 14:00 UTC, > so the regression was introduced somewhere in-between. > > Does that ring any bells? It does, thanks! This would map with 1d374302 or 850196b6 that reworked this area of the code, so it seems like we are not quite done with this work yet. Do you still see the problem as of 55ca50d (today's latest HEAD)? Also, just wondering.. If I use more or less the same commands as your travis job I should be able to reproduce the problem with a fresh JDBC repository, right? Or do you a sub-portion of your regression tests to run that easily? -- Michael
Attachment
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical() at walsender.c:2762
From
Kyotaro Horiguchi
Date:
At Thu, 28 May 2020 16:22:33 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Thu, May 28, 2020 at 09:07:04AM +0300, Vladimir Sitnikov wrote: > > The CI history shows that HEAD was good at 11 May 13:27 UTC, and it became > > bad by 19 May 14:00 UTC, > > so the regression was introduced somewhere in-between. > > > > Does that ring any bells? > > It does, thanks! This would map with 1d374302 or 850196b6 that > reworked this area of the code, so it seems like we are not quite done > with this work yet. Do you still see the problem as of 55ca50d > (today's latest HEAD)? I think that's not the case. I think I cause this crash with the HEAD. I'll post a fix soon. > Also, just wondering.. If I use more or less the same commands as > your travis job I should be able to reproduce the problem with a fresh > JDBC repository, right? Or do you a sub-portion of your regression > tests to run that easily? Pgjdbc seems initiating physical replication on a logical replication session. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Michael Paquier
Date:
On Thu, May 28, 2020 at 04:32:22PM +0900, Kyotaro Horiguchi wrote: > I think that's not the case. I think I cause this crash with the > HEAD. I'll post a fix soon. > > Pgjdbc seems initiating physical replication on a logical replication > session. Let me see... Indeed: $ psql "replication=database dbname=postgres" =# START_REPLICATION SLOT test_slot PHYSICAL 0/0; [one <boom> later] (gdb) bt #0 XLogSendPhysical () at walsender.c:2762 #1 0x000055d5f7803318 in WalSndLoop (send_data=0x55d5f78039c7 <XLogSendPhysical>) at walsender.c:2300 #2 0x000055d5f7800d70 in StartReplication (cmd=0x55d5f919bc60) at walsender.c:750 #3 0x000055d5f78025ad in exec_replication_command (cmd_string=0x55d5f9119a80 "START_REPLICATION SLOT test_slot PHYSICAL 0/0;") at walsender.c:1643 #4 0x000055d5f786a7ea in PostgresMain (argc=1, argv=0x55d5f91472c8, dbname=0x55d5f9147210 "ioltas", username=0x55d5f91471f0 "ioltas") at postgres.c:4311 #5 0x000055d5f77b4e9c in BackendRun (port=0x55d5f913dcd0) at postmaster.c:4523 #6 0x000055d5f77b4606 in BackendStartup (port=0x55d5f913dcd0) at postmaster.c:4215 #7 0x000055d5f77b08ad in ServerLoop () at postmaster.c:1727 #8 0x000055d5f77b00fc in PostmasterMain (argc=3, argv=0x55d5f9113260) at postmaster.c:1400 #9 0x000055d5f76b3736 in main (argc=3, argv=0x55d5f9113260) at main.c:210 No need for the JDBC test suite then. -- Michael
Attachment
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical() at walsender.c:2762
From
Kyotaro Horiguchi
Date:
At Thu, 28 May 2020 09:07:04 +0300, Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote in > Pgjdbc test suite identified a SIGSEGV in the recent HEAD builds of > PostgreSQL, Ubuntu 14.04.5 LTS > > Here's a call stack: > https://travis-ci.org/github/pgjdbc/pgjdbc/jobs/691794110#L7484 > The crash is consistent, and it reproduces 100% of the cases so far. > > The CI history shows that HEAD was good at 11 May 13:27 UTC, and it became > bad by 19 May 14:00 UTC, > so the regression was introduced somewhere in-between. > > Does that ring any bells? Thanks for the report. It is surely a bug since the server crashes, on the other hand Pgjdbc seems doing bad, too. It seems to me that that crash means Pgjdbc is initiating a logical replication connection to start physical replication. > In case you wonder: > > Program terminated with signal SIGSEGV, Segmentation fault. > #0 XLogSendPhysical () at walsender.c:2762 > 2762 if (!WALRead(xlogreader, > (gdb) #0 XLogSendPhysical () at walsender.c:2762 > SendRqstPtr = 133473640 > startptr = 133473240 > endptr = 133473640 > nbytes = 400 > segno = 1 > errinfo = {wre_errno = 988942240, wre_off = 2, wre_req = -1, > wre_read = -1, wre_seg = {ws_file = 4714224, > ws_segno = 140729887364688, ws_tli = 0}} > __func__ = "XLogSendPhysical" I see the probably the same symptom by the following steps with the current HEAD. psql 'host=/tmp replication=database' =# START_REPLICATION 0/1; <serer crashes> Physical replication is not assumed to be started on a logical replication connection. The attached would fix that. The patch adds two tests. One for this case and another for the reverse. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 1f94c98f7459ca8a4942246325815a3e0a91caa4 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Thu, 28 May 2020 15:55:30 +0900 Subject: [PATCH v1] Fix crash when starting physical replication on logical connection It is an illegal operation to try starting physical replication on a logical replication session. We should properly warn the client instead of crashing. --- src/backend/replication/walsender.c | 5 +++++ src/test/recovery/t/001_stream_rep.pl | 14 +++++++++++--- src/test/recovery/t/006_logical_decoding.pl | 10 +++++++++- 3 files changed, 25 insertions(+), 4 deletions(-) diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 86847cbb54..7b79c75311 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -589,6 +589,11 @@ StartReplication(StartReplicationCmd *cmd) StringInfoData buf; XLogRecPtr FlushPtr; + if (am_db_walsender) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot initiate physical replication on a logical replication connection"))); + if (ThisTimeLineID == 0) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl index 0c316c1808..0b69b7d8d1 100644 --- a/src/test/recovery/t/001_stream_rep.pl +++ b/src/test/recovery/t/001_stream_rep.pl @@ -3,7 +3,7 @@ use strict; use warnings; use PostgresNode; use TestLib; -use Test::More tests => 35; +use Test::More tests => 36; # Initialize master node my $node_master = get_new_node('master'); @@ -18,6 +18,14 @@ my $backup_name = 'my_backup'; # Take backup $node_master->backup($backup_name); +# Check if logical-rep session properly refuses to start physical-rep +my ($ret, $stdout, $stderr) = + $node_master->psql('template1', + qq[START_REPLICATION PHYSICAL 0/1], + replication=>'database'); +ok($stderr =~ /ERROR: cannot initiate physical replication on a logical replication connection/, + "check if physical replication is rejected on logical-rep session"); + # Create streaming standby linking to master my $node_standby_1 = get_new_node('standby_1'); $node_standby_1->init_from_backup($node_master, $backup_name, @@ -94,7 +102,7 @@ sub test_target_session_attrs # The client used for the connection does not matter, only the backend # point does. - my ($ret, $stdout, $stderr) = + ($ret, $stdout, $stderr) = $node1->psql('postgres', 'SHOW port;', extra_params => [ '-d', $connstr ]); is( $status == $ret && $stdout eq $target_node->port, @@ -136,7 +144,7 @@ my $connstr_rep = "$connstr_common replication=1"; my $connstr_db = "$connstr_common replication=database dbname=postgres"; # Test SHOW ALL -my ($ret, $stdout, $stderr) = $node_master->psql( +($ret, $stdout, $stderr) = $node_master->psql( 'postgres', 'SHOW ALL;', on_error_die => 1, extra_params => [ '-d', $connstr_rep ]); diff --git a/src/test/recovery/t/006_logical_decoding.pl b/src/test/recovery/t/006_logical_decoding.pl index ee05535b1c..eefde8c3f1 100644 --- a/src/test/recovery/t/006_logical_decoding.pl +++ b/src/test/recovery/t/006_logical_decoding.pl @@ -7,7 +7,7 @@ use strict; use warnings; use PostgresNode; use TestLib; -use Test::More tests => 13; +use Test::More tests => 14; use Config; # Initialize master node @@ -36,6 +36,14 @@ ok( $stderr =~ m/replication slot "test_slot" was not created in this database/, "Logical decoding correctly fails to start"); +# Check if physical-rep session properly refuses to start logical-decoding +($result, $stdout, $stderr) = + $node_master->psql('template1', + qq[START_REPLICATION SLOT s1 LOGICAL 0/1], + replication=>'true'); +ok($stderr =~ /ERROR: logical decoding requires a database connection/, + "check if logical decoding is refused on physical-rep connection"); + $node_master->safe_psql('postgres', qq[INSERT INTO decoding_test(x,y) SELECT s, s::text FROM generate_series(1,10) s;] ); -- 2.18.2
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Vladimir Sitnikov
Date:
Kyotaro>It seems to me that that crash means Pgjdbc is initiating a logical
Kyotaro>replication connection to start physical replication.Well, it used to work previously, so it might be a breaking change from the client/application point of view.
Vladimir
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical() at walsender.c:2762
From
Kyotaro Horiguchi
Date:
Hello, Vladimir. At Thu, 28 May 2020 11:57:23 +0300, Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote in > Kyotaro>It seems to me that that crash means Pgjdbc is initiating a logical > Kyotaro>replication connection to start physical replication. > > Well, it used to work previously, so it might be a breaking change from the > client/application point of view. Mmm. It is not the proper way to use physical replication and it's totally accidental that that worked (or even it might be a bug). The documentation is saying as the follows, as more-or-less the same for all versions since 9.4. https://www.postgresql.org/docs/13/protocol-replication.html > To initiate streaming replication, the frontend sends the replication > parameter in the startup message. A Boolean value of true (or on, yes, > 1) tells the backend to go into physical replication walsender mode, > wherein a small set of replication commands, shown below, can be > issued instead of SQL statements. > > Passing database as the value for the replication parameter instructs > the backend to go into logical replication walsender mode, connecting > to the database specified in the dbname parameter. In logical > replication walsender mode, the replication commands shown below as > well as normal SQL commands can be issued. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Dave Cramer
Date:
On Thu, 28 May 2020 at 05:11, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
Hello, Vladimir.
At Thu, 28 May 2020 11:57:23 +0300, Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote in
> Kyotaro>It seems to me that that crash means Pgjdbc is initiating a logical
> Kyotaro>replication connection to start physical replication.
>
> Well, it used to work previously, so it might be a breaking change from the
> client/application point of view.
Mmm. It is not the proper way to use physical replication and it's
totally accidental that that worked (or even it might be a bug). The
documentation is saying as the follows, as more-or-less the same for
all versions since 9.4.
https://www.postgresql.org/docs/13/protocol-replication.html
> To initiate streaming replication, the frontend sends the replication
> parameter in the startup message. A Boolean value of true (or on, yes,
> 1) tells the backend to go into physical replication walsender mode,
> wherein a small set of replication commands, shown below, can be
> issued instead of SQL statements.
>
> Passing database as the value for the replication parameter instructs
> the backend to go into logical replication walsender mode, connecting
> to the database specified in the dbname parameter. In logical
> replication walsender mode, the replication commands shown below as
> well as normal SQL commands can be issued.
regards.
While the documentation does indeed say that there is quite a bit of additional confusion added by:
and
START_REPLICATION
[ SLOT
slot_name
] [ PHYSICAL
] XXX/XXX
[ TIMELINE
tli
]If we already have a physical replication slot according to the startup message why do we need to specify it in the START REPLICATION message ?
Dave
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical() at walsender.c:2762
From
Kyotaro Horiguchi
Date:
At Thu, 28 May 2020 09:08:19 -0400, Dave Cramer <davecramer@postgres.rocks> wrote in > On Thu, 28 May 2020 at 05:11, Kyotaro Horiguchi <horikyota.ntt@gmail.com> > wrote: > > Mmm. It is not the proper way to use physical replication and it's > > totally accidental that that worked (or even it might be a bug). The > > documentation is saying as the follows, as more-or-less the same for > > all versions since 9.4. > > > > https://www.postgresql.org/docs/13/protocol-replication.html ... > > > While the documentation does indeed say that there is quite a bit of > additional confusion added by: > > and > START_REPLICATION [ SLOT *slot_name* ] [ PHYSICAL ] *XXX/XXX* [ TIMELINE > *tli* ] > > If we already have a physical replication slot according to the startup > message why do we need to specify it in the START REPLICATION message ? I don't know, but physical replication has worked that way since before the replication slots was introduced so we haven't needed to do so. Physical replication slots are not assumed as more than memorandum for the oldest required WAL segment (and oldest xmin). regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Michael Paquier
Date:
On Thu, May 28, 2020 at 06:11:39PM +0900, Kyotaro Horiguchi wrote: > Mmm. It is not the proper way to use physical replication and it's > totally accidental that that worked (or even it might be a bug). The > documentation is saying as the follows, as more-or-less the same for > all versions since 9.4. > > https://www.postgresql.org/docs/13/protocol-replication.html + if (am_db_walsender) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot initiate physical replication on a logical replication connection"))); I don't agree with this change. The only restriction that we have in place now in walsender.c regarding MyDatabaseId not being set is to prevent the execution of SQL commands. Note that it is possible to start physical replication even if MyDatabaseId is set in a replication connection, so you could break cases that have been valid until now. I think that we actually should be much more careful with the initialization of the WAL reader used in the context of a WAL sender before calling WALRead() and attempting to read a new WAL page. -- Michael
Attachment
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical() at walsender.c:2762
From
Kyotaro Horiguchi
Date:
At Fri, 29 May 2020 16:21:38 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Thu, May 28, 2020 at 06:11:39PM +0900, Kyotaro Horiguchi wrote: > > Mmm. It is not the proper way to use physical replication and it's > > totally accidental that that worked (or even it might be a bug). The > > documentation is saying as the follows, as more-or-less the same for > > all versions since 9.4. > > > > https://www.postgresql.org/docs/13/protocol-replication.html > > + if (am_db_walsender) > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot initiate physical > replication on a logical replication connection"))); > > I don't agree with this change. The only restriction that we have in > place now in walsender.c regarding MyDatabaseId not being set is to > prevent the execution of SQL commands. Note that it is possible to > start physical replication even if MyDatabaseId is set in a > replication connection, so you could break cases that have been valid > until now. It donesn't check MyDatabase, but whether the connection parameter "repliation" is "true" or "database". The documentation is telling that "replication" should be "true" for a connection that is to be used for physical replication, and "replication" should literally be "database" for a connection that is for logical replication. We need to revise the documentation if we are going to allow physical replication on a conection with "replication = database". > I think that we actually should be much more careful with the > initialization of the WAL reader used in the context of a WAL sender > before calling WALRead() and attempting to read a new WAL page. I agree that the initialization can be improved, but the current code is no problem if we don't allow to run both logical and physical replication on a single session. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Masahiko Sawada
Date:
On Fri, 29 May 2020 at 17:57, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Fri, 29 May 2020 16:21:38 +0900, Michael Paquier <michael@paquier.xyz> wrote in > > On Thu, May 28, 2020 at 06:11:39PM +0900, Kyotaro Horiguchi wrote: > > > Mmm. It is not the proper way to use physical replication and it's > > > totally accidental that that worked (or even it might be a bug). The > > > documentation is saying as the follows, as more-or-less the same for > > > all versions since 9.4. > > > > > > https://www.postgresql.org/docs/13/protocol-replication.html > > > > + if (am_db_walsender) > > + ereport(ERROR, > > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > + errmsg("cannot initiate physical > > replication on a logical replication connection"))); > > > > I don't agree with this change. The only restriction that we have in > > place now in walsender.c regarding MyDatabaseId not being set is to > > prevent the execution of SQL commands. Note that it is possible to > > start physical replication even if MyDatabaseId is set in a > > replication connection, so you could break cases that have been valid > > until now. > > It donesn't check MyDatabase, but whether the connection parameter > "repliation" is "true" or "database". The documentation is telling > that "replication" should be "true" for a connection that is to be > used for physical replication, and "replication" should literally be > "database" for a connection that is for logical replication. We need > to revise the documentation if we are going to allow physical > replication on a conection with "replication = database". > Yes. Conversely, if we start logical replication in a physical replication connection (i.g. replication=true) we got an error before staring replication: ERROR: logical decoding requires a database connection I think we can prevent that SEGV in a similar way. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Michael Paquier
Date:
On Fri, May 29, 2020 at 06:09:06PM +0900, Masahiko Sawada wrote: > Yes. Conversely, if we start logical replication in a physical > replication connection (i.g. replication=true) we got an error before > staring replication: > > ERROR: logical decoding requires a database connection > > I think we can prevent that SEGV in a similar way. Still unconvinced as this restriction stands for logical decoding requiring a database connection but it is not necessarily true now as physical replication has less restrictions than a logical one. Looking at the code, I think that there is some confusion with the fake WAL reader used as base reference in InitWalSender() where we assume that it could only be used in the context of a non-database WAL sender. However, this initialization happens when the WAL sender connection is initialized, and what I think this misses is that we should try to initialize a WAL reader when actually going through a START_REPLICATION command. I can note as well that StartLogicalReplication() moves in this sense by setting xlogreader to be the one from logical_decoding_ctx once the decoding context has been created. This results in the attached. The extra test from upthread to check that logical decoding is not allowed in a non-database WAL sender is a good idea, so I have kept it. -- Michael
Attachment
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Fujii Masao
Date:
On 2020/06/02 13:24, Michael Paquier wrote: > On Fri, May 29, 2020 at 06:09:06PM +0900, Masahiko Sawada wrote: >> Yes. Conversely, if we start logical replication in a physical >> replication connection (i.g. replication=true) we got an error before >> staring replication: >> >> ERROR: logical decoding requires a database connection >> >> I think we can prevent that SEGV in a similar way. > > Still unconvinced as this restriction stands for logical decoding > requiring a database connection but it is not necessarily true now as > physical replication has less restrictions than a logical one. Could you tell me what the benefit for supporting physical replication on logical rep connection is? If it's only for "undocumented" backward-compatibility, IMO it's better to reject such "tricky" set up. But if there are some use cases for that, I'm ok to support that. > Looking at the code, I think that there is some confusion with the > fake WAL reader used as base reference in InitWalSender() where we > assume that it could only be used in the context of a non-database WAL > sender. However, this initialization happens when the WAL sender > connection is initialized, and what I think this misses is that we > should try to initialize a WAL reader when actually going through a > START_REPLICATION command. > > I can note as well that StartLogicalReplication() moves in this sense > by setting xlogreader to be the one from logical_decoding_ctx once the > decoding context has been created. > > This results in the attached. The extra test from upthread to check > that logical decoding is not allowed in a non-database WAL sender is a > good idea, so I have kept it. Yes. Also we should add the test to check if physical replication can work fine even on logical rep connection? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical() at walsender.c:2762
From
Kyotaro Horiguchi
Date:
At Tue, 2 Jun 2020 13:24:56 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Fri, May 29, 2020 at 06:09:06PM +0900, Masahiko Sawada wrote: > > Yes. Conversely, if we start logical replication in a physical > > replication connection (i.g. replication=true) we got an error before > > staring replication: > > > > ERROR: logical decoding requires a database connection > > > > I think we can prevent that SEGV in a similar way. > > Still unconvinced as this restriction stands for logical decoding > requiring a database connection but it is not necessarily true now as > physical replication has less restrictions than a logical one. If we deliberately allow physical replication on a database-replication connection, we should revise the documentation that way. On the other hand physical replication has wider access to a database cluster than logical replication. Thus allowing to start physical replication on a logical replication connection could introduce a problem related to privileges. So I think it might be better that physical and logical replication have separate pg_hba lines. Once we explicitly allow physical replication on a logical replication connection in documentation, it would be far harder to change the behavior than now. If we are sure that that cannot be a problem, I don't object the change in documented behavior. > Looking at the code, I think that there is some confusion with the > fake WAL reader used as base reference in InitWalSender() where we > assume that it could only be used in the context of a non-database WAL > sender. However, this initialization happens when the WAL sender > connection is initialized, and what I think this misses is that we > should try to initialize a WAL reader when actually going through a > START_REPLICATION command. At first fake_xlogreader was really a fake one that only provides callback routines, but it should have been changed to a real xlogreader at the time it began to store segment information. In that sense moving to real xlogreader makes sense to me separately from whether we allow physicalrep on logicalrep connections. > I can note as well that StartLogicalReplication() moves in this sense > by setting xlogreader to be the one from logical_decoding_ctx once the > decoding context has been created. > > This results in the attached. The extra test from upthread to check > that logical decoding is not allowed in a non-database WAL sender is a > good idea, so I have kept it. + ereport(ERROR, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("out of memory"))); The same error message is accompanied by a DETAILS in some other places. Don't we need one for this? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Michael Paquier
Date:
On Tue, Jun 02, 2020 at 02:23:50PM +0900, Fujii Masao wrote: > On 2020/06/02 13:24, Michael Paquier wrote: >> Still unconvinced as this restriction stands for logical decoding >> requiring a database connection but it is not necessarily true now as >> physical replication has less restrictions than a logical one. > > Could you tell me what the benefit for supporting physical replication on > logical rep connection is? If it's only for "undocumented" > backward-compatibility, IMO it's better to reject such "tricky" set up. > But if there are some use cases for that, I'm ok to support that. Well, I don't really think that we can just break a behavior that exists since 9.4 as you could break applications relying on the existing behavior, and that's also the point of Vladimir upthread. On top of it, the issue is actually unrelated to if we want to restrict things more or not when starting replication in a WAL sender because the xlogreader creation just needs to happen when starting replication. Now we have a static "fake" one created when a WAL sender process starts, something that it would not need in most cases like answering to a BASE_BACKUP command for example. >> I can note as well that StartLogicalReplication() moves in this sense >> by setting xlogreader to be the one from logical_decoding_ctx once the >> decoding context has been created. >> >> This results in the attached. The extra test from upthread to check >> that logical decoding is not allowed in a non-database WAL sender is a >> good idea, so I have kept it. > > Yes. Also we should add the test to check if physical replication can work > fine even on logical rep connection? I found confusing the use of psql to confirm that it actually works, because we'd just return a protocol-level error in this case with psql bumping on COPY_BOTH and it is not reliable to do just an error message match. Note as well that GetConnection() discards automatically the database name for pg_basebackup and pg_receivewal as well as libpqrcv_connect() for standbys so we cannot use that. Perhaps using psql is better than nothing, but that makes me uncomfortable. -- Michael
Attachment
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Dave Cramer
Date:
On Wed, 3 Jun 2020 at 01:19, Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Jun 02, 2020 at 02:23:50PM +0900, Fujii Masao wrote:
> On 2020/06/02 13:24, Michael Paquier wrote:
>> Still unconvinced as this restriction stands for logical decoding
>> requiring a database connection but it is not necessarily true now as
>> physical replication has less restrictions than a logical one.
>
> Could you tell me what the benefit for supporting physical replication on
> logical rep connection is? If it's only for "undocumented"
> backward-compatibility, IMO it's better to reject such "tricky" set up.
> But if there are some use cases for that, I'm ok to support that.
Well, I don't really think that we can just break a behavior that
exists since 9.4 as you could break applications relying on the
existing behavior, and that's also the point of Vladimir upthread.
I don't see this is a valid reason to keep doing something. If it is broken then fix it.
Clients can deal with the change.
Dave Cramer
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Fujii Masao
Date:
On 2020/06/03 20:33, Dave Cramer wrote: > > > > On Wed, 3 Jun 2020 at 01:19, Michael Paquier <michael@paquier.xyz <mailto:michael@paquier.xyz>> wrote: > > On Tue, Jun 02, 2020 at 02:23:50PM +0900, Fujii Masao wrote: > > On 2020/06/02 13:24, Michael Paquier wrote: > >> Still unconvinced as this restriction stands for logical decoding > >> requiring a database connection but it is not necessarily true now as > >> physical replication has less restrictions than a logical one. > > > > Could you tell me what the benefit for supporting physical replication on > > logical rep connection is? If it's only for "undocumented" > > backward-compatibility, IMO it's better to reject such "tricky" set up. > > But if there are some use cases for that, I'm ok to support that. > > Well, I don't really think that we can just break a behavior that > exists since 9.4 as you could break applications relying on the > existing behavior, and that's also the point of Vladimir upthread. For the back branches, I agree with you. Even if it's undocumented behavior, basically we should not get rid of it from the back branches unless there is very special reason. For v13, if it has no functional merit, I don't think it's so bad to get rid of that undocumented (and maybe not-fully tested) behavior. If there are applications depending it, I think that they can be updated. > I don't see this is a valid reason to keep doing something. If it is broken then fix it. > Clients can deal with the change. +1 Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Alvaro Herrera
Date:
On 2020-Jun-02, Michael Paquier wrote: > I can note as well that StartLogicalReplication() moves in this sense > by setting xlogreader to be the one from logical_decoding_ctx once the > decoding context has been created. > > This results in the attached. The extra test from upthread to check > that logical decoding is not allowed in a non-database WAL sender is a > good idea, so I have kept it. I don't particularly disagree with your proposed patch -- in fact, it seems to make things cleaner. It is a little wasteful, but I don't really mind that. It's just some memory, and it's not a significant amount. That said, I would *also* apply Kyotaro's proposed patch to prohibit a physical standby running with a logical slot, if only because that reduces the number of combinations that we need to test and keep our collective heads straight about. Just reject the weird case and have one type of slot for each type of replication. I didn't even think this was at all possible. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Andres Freund
Date:
Hi, On 2020-06-02 14:23:50 +0900, Fujii Masao wrote: > On 2020/06/02 13:24, Michael Paquier wrote: > > On Fri, May 29, 2020 at 06:09:06PM +0900, Masahiko Sawada wrote: > > > Yes. Conversely, if we start logical replication in a physical > > > replication connection (i.g. replication=true) we got an error before > > > staring replication: > > > > > > ERROR: logical decoding requires a database connection > > > > > > I think we can prevent that SEGV in a similar way. > > > > Still unconvinced as this restriction stands for logical decoding > > requiring a database connection but it is not necessarily true now as > > physical replication has less restrictions than a logical one. > > Could you tell me what the benefit for supporting physical replication on > logical rep connection is? If it's only for "undocumented" > backward-compatibility, IMO it's better to reject such "tricky" set up. > But if there are some use cases for that, I'm ok to support that. I don't think we should prohibit this. For one, it'd probably break some clients, without a meaningful need. But I think it's also actually quite useful to be able to access catalogs before streaming data. You e.g. can look up configuration of the primary before streaming WAL. With a second connection that's actually harder to do reliably in some cases, because you need to be sure that you actually reached the right server (consider a pooler, automatic failover etc). Greetings, Andres Freund
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Alvaro Herrera
Date:
On 2020-Jun-03, Andres Freund wrote: > I don't think we should prohibit this. For one, it'd probably break some > clients, without a meaningful need. There *is* a need, namely to keep complexity down. This is quite convoluted, it's got a lot of historical baggage because of the way it was implemented, and it's very difficult to understand. The greatest motive I see is to make this easier to understand, so that it is easier to modify and improve in the future. > But I think it's also actually quite useful to be able to access > catalogs before streaming data. You e.g. can look up configuration of > the primary before streaming WAL. With a second connection that's > actually harder to do reliably in some cases, because you need to be > sure that you actually reached the right server (consider a pooler, > automatic failover etc). I don't think having a physical replication connection access catalog data directly is a great idea. We already have gadgets like IDENTIFY_SYSTEM for physical replication that can do that, and if you need particular settings you can use SHOW (commit d1ecd539477). If there was a strong need for even more than that, we can add something to the grammar. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Andres Freund
Date:
Hi, On 2020-06-03 18:27:12 -0400, Alvaro Herrera wrote: > On 2020-Jun-03, Andres Freund wrote: > > I don't think we should prohibit this. For one, it'd probably break some > > clients, without a meaningful need. > > There *is* a need, namely to keep complexity down. This is quite > convoluted, it's got a lot of historical baggage because of the way it > was implemented, and it's very difficult to understand. The greatest > motive I see is to make this easier to understand, so that it is easier > to modify and improve in the future. That seems like a possibly convincing argument for not introducing the capability, but doesn't seem strong enough to remove it. Especially not if it was just broken as part of effectively a refactoring, as far as I understand? > > But I think it's also actually quite useful to be able to access > > catalogs before streaming data. You e.g. can look up configuration of > > the primary before streaming WAL. With a second connection that's > > actually harder to do reliably in some cases, because you need to be > > sure that you actually reached the right server (consider a pooler, > > automatic failover etc). > > I don't think having a physical replication connection access catalog > data directly is a great idea. We already have gadgets like > IDENTIFY_SYSTEM for physical replication that can do that, and if you > need particular settings you can use SHOW (commit d1ecd539477). If > there was a strong need for even more than that, we can add something to > the grammar. Those special case things are a bad idea, and we shouldn't introduce more. It's unrealistic that we can ever make that support everything, and since we already have to support the database connected thing, I don't see the point. Greetings, Andres Freund
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Michael Paquier
Date:
On Wed, Jun 03, 2020 at 06:33:11PM -0700, Andres Freund wrote: > On 2020-06-03 18:27:12 -0400, Alvaro Herrera wrote: >> There *is* a need, namely to keep complexity down. This is quite >> convoluted, it's got a lot of historical baggage because of the way it >> was implemented, and it's very difficult to understand. The greatest >> motive I see is to make this easier to understand, so that it is easier >> to modify and improve in the future. > > That seems like a possibly convincing argument for not introducing the > capability, but doesn't seem strong enough to remove it. Especially not > if it was just broken as part of effectively a refactoring, as far as I > understand? Are there any objections in fixing the issue first then? As far as I can see there is no objection to this part, like here: https://www.postgresql.org/message-id/20200603214448.GA901@alvherre.pgsql >> I don't think having a physical replication connection access catalog >> data directly is a great idea. We already have gadgets like >> IDENTIFY_SYSTEM for physical replication that can do that, and if you >> need particular settings you can use SHOW (commit d1ecd539477). If >> there was a strong need for even more than that, we can add something to >> the grammar. > > Those special case things are a bad idea, and we shouldn't introduce > more. It's unrealistic that we can ever make that support everything, > and since we already have to support the database connected thing, I > don't see the point. Let's continue discussing this part as well. -- Michael
Attachment
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Alvaro Herrera
Date:
On 2020-Jun-03, Andres Freund wrote: > On 2020-06-03 18:27:12 -0400, Alvaro Herrera wrote: > > On 2020-Jun-03, Andres Freund wrote: > > > I don't think we should prohibit this. For one, it'd probably break some > > > clients, without a meaningful need. > > > > There *is* a need, namely to keep complexity down. This is quite > > convoluted, it's got a lot of historical baggage because of the way it > > was implemented, and it's very difficult to understand. The greatest > > motive I see is to make this easier to understand, so that it is easier > > to modify and improve in the future. > > That seems like a possibly convincing argument for not introducing the > capability, but doesn't seem strong enough to remove it. This "capability" has never been introduced. The fact that it's there is just an accident. In fact, it's not a capability, since the feature (physical replication) is invoked differently -- namely, using a physical replication connection. JDBC uses a logical replication connection for it only because they never realized that they were supposed to do differently, because we failed to throw the correct error message in the first place. > > I don't think having a physical replication connection access catalog > > data directly is a great idea. We already have gadgets like > > IDENTIFY_SYSTEM for physical replication that can do that, and if you > > need particular settings you can use SHOW (commit d1ecd539477). If > > there was a strong need for even more than that, we can add something to > > the grammar. > > Those special case things are a bad idea, and we shouldn't introduce > more. What special case things? The replication connection has never been supposed to run SQL. That's why we have SHOW in the replication grammar. > It's unrealistic that we can ever make that support everything, > and since we already have to support the database connected thing, I > don't see the point. A logical replication connection is not supposed to be used for physical replication. That's just going to make more bugs appear. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Alvaro Herrera
Date:
On 2020-Jun-04, Michael Paquier wrote: > On Wed, Jun 03, 2020 at 06:33:11PM -0700, Andres Freund wrote: > >> I don't think having a physical replication connection access catalog > >> data directly is a great idea. We already have gadgets like > >> IDENTIFY_SYSTEM for physical replication that can do that, and if you > >> need particular settings you can use SHOW (commit d1ecd539477). If > >> there was a strong need for even more than that, we can add something to > >> the grammar. > > > > Those special case things are a bad idea, and we shouldn't introduce > > more. It's unrealistic that we can ever make that support everything, > > and since we already have to support the database connected thing, I > > don't see the point. > > Let's continue discussing this part as well. A logical replication connection cannot run SQL anyway, can it? it's limited to the replication grammar. So it's not like you can run arbitrary queries to access catalog data. So even if we do need to access the catalogs, we'd have to add stuff to the replication grammar in order to support that. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Andres Freund
Date:
Hi, On 2020-06-04 16:44:53 -0400, Alvaro Herrera wrote: > A logical replication connection cannot run SQL anyway, can it? You can: andres@awork3:~/src/postgresql$ psql 'replication=database' postgres[52656][1]=# IDENTIFY_SYSTEM; ┌─────────────────────┬──────────┬────────────┬──────────┐ │ systemid │ timeline │ xlogpos │ dbname │ ├─────────────────────┼──────────┼────────────┼──────────┤ │ 6821634567571961151 │ 1 │ 1/D256EC40 │ postgres │ └─────────────────────┴──────────┴────────────┴──────────┘ (1 row) postgres[52656][1]=# SELECT 1; ┌──────────┐ │ ?column? │ ├──────────┤ │ 1 │ └──────────┘ (1 row) I am very much not in love with the way that was implemented, but it's there, and it's used as far as I know (cf tablesync.c). Greetings, Andres Freund
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Alvaro Herrera
Date:
On 2020-Jun-04, Andres Freund wrote: > postgres[52656][1]=# SELECT 1; > ┌──────────┐ > │ ?column? │ > ├──────────┤ > │ 1 │ > └──────────┘ > (1 row) > > > I am very much not in love with the way that was implemented, but it's > there, and it's used as far as I know (cf tablesync.c). Ouch ... so they made IDENT in the replication grammar be a trigger to enter the regular grammar. Crazy. No way to put those worms back in the tin now, I guess. It is still my opinion that we should prohibit a logical replication connection from being used to do physical replication. Horiguchi-san, Sawada-san and Masao-san are all of the same opinion. Dave Cramer (of the JDBC team) is not opposed to the change -- he says they're just using it because they didn't realize they should be doing differently. Both Michael P. and you are saying we shouldn't break it because it works today, but there isn't a real use-case for it. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Dave Cramer
Date:
On Thu, 4 Jun 2020 at 19:46, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2020-Jun-04, Andres Freund wrote:
> postgres[52656][1]=# SELECT 1;
> ┌──────────┐
> │ ?column? │
> ├──────────┤
> │ 1 │
> └──────────┘
> (1 row)
>
>
> I am very much not in love with the way that was implemented, but it's
> there, and it's used as far as I know (cf tablesync.c).
Ouch ... so they made IDENT in the replication grammar be a trigger to
enter the regular grammar. Crazy. No way to put those worms back in
the tin now, I guess.
Is that documented ?
It is still my opinion that we should prohibit a logical replication
connection from being used to do physical replication. Horiguchi-san,
Sawada-san and Masao-san are all of the same opinion. Dave Cramer (of
the JDBC team) is not opposed to the change -- he says they're just
using it because they didn't realize they should be doing differently.
I think my exact words were
"I don't see this is a valid reason to keep doing something. If it is broken then fix it.
Clients can deal with the change."
in response to:
Well, I don't really think that we can just break a behavior that
exists since 9.4 as you could break applications relying on the
existing behavior, and that's also the point of Vladimir upthread.
Which is different than not being opposed to the change. I don't see this as broken,
and it's quite possible that some of our users are using it. It certainly needs to be documented
Dave
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Alvaro Herrera
Date:
On 2020-Jun-05, Dave Cramer wrote: > On Thu, 4 Jun 2020 at 19:46, Alvaro Herrera <alvherre@2ndquadrant.com> > wrote: > > Ouch ... so they made IDENT in the replication grammar be a trigger to > > enter the regular grammar. Crazy. No way to put those worms back in > > the tin now, I guess. > > Is that documented ? I don't think it is. > > It is still my opinion that we should prohibit a logical replication > > connection from being used to do physical replication. Horiguchi-san, > > Sawada-san and Masao-san are all of the same opinion. Dave Cramer (of > > the JDBC team) is not opposed to the change -- he says they're just > > using it because they didn't realize they should be doing differently. > > I think my exact words were > > "I don't see this is a valid reason to keep doing something. If it is > broken then fix it. > Clients can deal with the change." > > in response to: > > > Well, I don't really think that we can just break a behavior that > > exists since 9.4 as you could break applications relying on the > > existing behavior, and that's also the point of Vladimir upthread. > > Which is different than not being opposed to the change. I don't see this > as broken, and it's quite possible that some of our users are using > it. Apologies for misinterpreting. > It certainly needs to be documented I'd rather not. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Michael Paquier
Date:
On Thu, Jun 04, 2020 at 11:07:29AM +0900, Michael Paquier wrote: > Are there any objections in fixing the issue first then? As far as I > can see there is no objection to this part, like here: > https://www.postgresql.org/message-id/20200603214448.GA901@alvherre.pgsql Hearing nothing, I have applied this part and fixed the crash to take care of the open item. -- Michael
Attachment
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
"Jonathan S. Katz"
Date:
Hi, On 6/5/20 11:51 AM, Alvaro Herrera wrote: > On 2020-Jun-05, Dave Cramer wrote: > >> On Thu, 4 Jun 2020 at 19:46, Alvaro Herrera <alvherre@2ndquadrant.com> >> wrote: > >>> Ouch ... so they made IDENT in the replication grammar be a trigger to >>> enter the regular grammar. Crazy. No way to put those worms back in >>> the tin now, I guess. >> >> Is that documented ? > > I don't think it is. > >>> It is still my opinion that we should prohibit a logical replication >>> connection from being used to do physical replication. Horiguchi-san, >>> Sawada-san and Masao-san are all of the same opinion. Dave Cramer (of >>> the JDBC team) is not opposed to the change -- he says they're just >>> using it because they didn't realize they should be doing differently. >> >> I think my exact words were >> >> "I don't see this is a valid reason to keep doing something. If it is >> broken then fix it. >> Clients can deal with the change." >> >> in response to: >> >>> Well, I don't really think that we can just break a behavior that >>> exists since 9.4 as you could break applications relying on the >>> existing behavior, and that's also the point of Vladimir upthread. >> >> Which is different than not being opposed to the change. I don't see this >> as broken, and it's quite possible that some of our users are using >> it. > > Apologies for misinterpreting. > >> It certainly needs to be documented > > I'd rather not. The PG13 RMT had a discussion about this thread, and while the initial crash has been fixed, we decided to re-open the Open Item around whether we should allow physical replication to be initiated in a logical replication session. We anticipate a resolution for PG13, whether it is explicitly disallowing physical replication from occurring on a logical replication slot, maintaining the status quo, or something else such that there is consensus on the approach. Thanks, Jonathan
Attachment
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Andres Freund
Date:
Hi, On 2020-06-21 13:45:36 -0400, Jonathan S. Katz wrote: > The PG13 RMT had a discussion about this thread, and while the initial > crash has been fixed, we decided to re-open the Open Item around whether > we should allow physical replication to be initiated in a logical > replication session. Since this is a long-time issue, this doesn't quite seem like an issue for the RMT? > We anticipate a resolution for PG13, whether it is explicitly > disallowing physical replication from occurring on a logical replication > slot, maintaining the status quo, or something else such that there is > consensus on the approach. s/logical replication slot/logical replication connection/? I still maintain that adding restrictions here is a bad idea. Even disregarding the discussion of running normal queries interspersed, it's useful to be able to both request WAL and receive logical changes over the same connection. E.g. for creating a logical replica by first doing a physical base backup (vastly faster), or fetching WAL for decoding large transactions onto a standby. And I just don't see any reasons to disallow it. There's basically no reduction in complexity by doing so. Greetings, Andres Freund
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Michael Paquier
Date:
On Sun, Jun 21, 2020 at 01:02:34PM -0700, Andres Freund wrote: > I still maintain that adding restrictions here is a bad idea. Even > disregarding the discussion of running normal queries interspersed, it's > useful to be able to both request WAL and receive logical changes over > the same connection. E.g. for creating a logical replica by first doing > a physical base backup (vastly faster), or fetching WAL for decoding > large transactions onto a standby. > > And I just don't see any reasons to disallow it. There's basically no > reduction in complexity by doing so. Yeah, I still stand by the same opinion here to do nothing. I suspect that we have good chances to annoy people and some cases we are overlooking here, that used to work. -- Michael
Attachment
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical() at walsender.c:2762
From
Kyotaro Horiguchi
Date:
At Tue, 23 Jun 2020 10:51:40 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Sun, Jun 21, 2020 at 01:02:34PM -0700, Andres Freund wrote: > > I still maintain that adding restrictions here is a bad idea. Even > > disregarding the discussion of running normal queries interspersed, it's > > useful to be able to both request WAL and receive logical changes over > > the same connection. E.g. for creating a logical replica by first doing > > a physical base backup (vastly faster), or fetching WAL for decoding > > large transactions onto a standby. > > > > And I just don't see any reasons to disallow it. There's basically no > > reduction in complexity by doing so. > > Yeah, I still stand by the same opinion here to do nothing. I suspect > that we have good chances to annoy people and some cases we are > overlooking here, that used to work. In logical replication, a replication role is intended to be accessible only to the GRANTed databases. On the other hand the same role can create a dead copy of the whole cluster, including non-granted databases. It seems like a sieve missing a mesh screen. I agree that that doesn't harm as far as roles are strictly managed so I don't insist so strongly on inhibiting the behavior. However, the documentation at least needs amendment. https://www.postgresql.org/docs/13/protocol-replication.html ==== To initiate streaming replication, the frontend sends the replication parameter in the startup message. A Boolean value of true (or on, yes, 1) tells the backend to go into physical replication walsender mode, wherein a small set of replication commands, shown below, can be issued instead of SQL statements. Passing database as the value for the replication parameter instructs the backend to go into logical replication walsender mode, connecting to the database specified in the dbname parameter. In logical replication walsender mode, the replication commands shown below as well as normal SQL commands can be issued. ==== regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Fujii Masao
Date:
On 2020/06/24 11:56, Kyotaro Horiguchi wrote: > At Tue, 23 Jun 2020 10:51:40 +0900, Michael Paquier <michael@paquier.xyz> wrote in >> On Sun, Jun 21, 2020 at 01:02:34PM -0700, Andres Freund wrote: >>> I still maintain that adding restrictions here is a bad idea. Even >>> disregarding the discussion of running normal queries interspersed, it's >>> useful to be able to both request WAL and receive logical changes over >>> the same connection. E.g. for creating a logical replica by first doing >>> a physical base backup (vastly faster), or fetching WAL for decoding >>> large transactions onto a standby. >>> >>> And I just don't see any reasons to disallow it. There's basically no >>> reduction in complexity by doing so. >> >> Yeah, I still stand by the same opinion here to do nothing. I suspect >> that we have good chances to annoy people and some cases we are >> overlooking here, that used to work. > > In logical replication, a replication role is intended to be > accessible only to the GRANTed databases. On the other hand the same > role can create a dead copy of the whole cluster, including > non-granted databases. It seems like a sieve missing a mesh screen. Personally I'd like to disallow physical replication commands when I explicitly reject physical replication connection (i.e., set "host replication user x.x.x.x/x reject") in pg_hba.conf, whether on physical or logical replication connection. > I agree that that doesn't harm as far as roles are strictly managed so > I don't insist so strongly on inhibiting the behavior. However, the > documentation at least needs amendment. +1 Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Alvaro Herrera
Date:
On 2020-Jun-24, Kyotaro Horiguchi wrote: > In logical replication, a replication role is intended to be > accessible only to the GRANTed databases. On the other hand the same > role can create a dead copy of the whole cluster, including > non-granted databases. In other words -- essentially, if you grant replication access to a role only to a specific database, they can steal the whole cluster. I don't see what's so great about that, but apparently people like it. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Stephen Frost
Date:
Greetings, * Alvaro Herrera (alvherre@2ndquadrant.com) wrote: > On 2020-Jun-24, Kyotaro Horiguchi wrote: > > > In logical replication, a replication role is intended to be > > accessible only to the GRANTed databases. On the other hand the same > > role can create a dead copy of the whole cluster, including > > non-granted databases. > > In other words -- essentially, if you grant replication access to a role > only to a specific database, they can steal the whole cluster. > > I don't see what's so great about that, but apparently people like it. Sure, people who aren't in charge of security I'm sure like the ease of use. Doesn't mean it makes sense or that we should be supporting that. What we should have is a way to allow administrators to configure a system for exactly what they want to allow, and it doesn't seem like we're doing that today and therefore we should fix it. This isn't the only area we have that issue in. Thanks, Stephen
Attachment
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Alvaro Herrera
Date:
On 2020-Jun-24, Stephen Frost wrote: > Doesn't mean it makes sense or that we should be supporting that. What > we should have is a way to allow administrators to configure a system > for exactly what they want to allow, and it doesn't seem like we're > doing that today and therefore we should fix it. This isn't the only > area we have that issue in. The way to do that, for the case under discussion, is to reject using a logical replication connection for physical replication commands. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Robert Haas
Date:
On Wed, Jun 24, 2020 at 1:06 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > On 2020-Jun-24, Stephen Frost wrote: > > Doesn't mean it makes sense or that we should be supporting that. What > > we should have is a way to allow administrators to configure a system > > for exactly what they want to allow, and it doesn't seem like we're > > doing that today and therefore we should fix it. This isn't the only > > area we have that issue in. > > The way to do that, for the case under discussion, is to reject using a > logical replication connection for physical replication commands. Reading over this discussion, I see basically three arguments: 1. Andres argues that being able to execute physical replication commands from the same connection as SQL queries is useful, and that people may be relying on it, and that we shouldn't break it without need. 2. Fujii Masao argues that the current situation makes it impossible to write a pg_hba.conf rule that disallows all physical replication connections, because people could get around it by using a logical replication connection for physical replication. 3. Various people argue that it's only accidental that physical replication on a replication=database connection ever worked at all, and therefore we ought to block it. I find argument #1 most convincing, #2 less convincing, and #3 least convincing. In my view, the problem with argument #3 is that just because some feature combination was unintentional doesn't mean it's unuseful or unused. As for #2, suppose someone were to propose a design for logical replication that allowed it to take place without a database connection, so that it could be done with just a regular replication connection. Such a feature would create the same problem Fujii Masao mentions here, but it seems inconceivable that we would for that reason reject it; we make decisions about features based on their usefulness, not their knock-on effects on pg_hba.conf rules. We can always add new kinds of access control restrictions if they are needed; that is a better approach than removing features so that the existing pg_hba.conf facilities can be used to accomplish some particular goal. So really I think this turns on #1: is it plausible that people are using this feature, however inadvertent it may be, and is it potentially useful? I don't see that anybody's made an argument against either of those things. Unless someone can do so, I think we shouldn't disable this. That having been said, I think that the fact that you can execute SQL queries in replication=database connections is horrifying. I really hate that feature. I think it's a bad design, and a bad implementation, and a recipe for tons of bugs. But, blocking physical replication commands on such connections isn't going to solve any of that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Alvaro Herrera
Date:
On 2020-Jun-24, Robert Haas wrote: > So really I think this turns on #1: is it plausible > that people are using this feature, however inadvertent it may be, and > is it potentially useful? I don't see that anybody's made an argument > against either of those things. Unless someone can do so, I think we > shouldn't disable this. People (specifically the jdbc driver) *are* using this feature in this way, but they didn't realize they were doing it. It was an accident and they didn't notice. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Dave Cramer
Date:
On Wed, 24 Jun 2020 at 15:41, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2020-Jun-24, Robert Haas wrote:
> So really I think this turns on #1: is it plausible
> that people are using this feature, however inadvertent it may be, and
> is it potentially useful? I don't see that anybody's made an argument
> against either of those things. Unless someone can do so, I think we
> shouldn't disable this.
People (specifically the jdbc driver) *are* using this feature in this
way, but they didn't realize they were doing it. It was an accident and
they didn't notice.
Not sure we are using it as much as we accidently did it that way. It would be trivial to fix.
That said I think we should fix the security hole this opens and leave the functionality.
Dave
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Andres Freund
Date:
Hi, On 2020-06-24 15:41:14 -0400, Alvaro Herrera wrote: > On 2020-Jun-24, Robert Haas wrote: > > > So really I think this turns on #1: is it plausible > > that people are using this feature, however inadvertent it may be, and > > is it potentially useful? I don't see that anybody's made an argument > > against either of those things. Unless someone can do so, I think we > > shouldn't disable this. > > People (specifically the jdbc driver) *are* using this feature in this > way, but they didn't realize they were doing it. It was an accident and > they didn't notice. As I said before, I've utilized being able to do both over a single connection (among others to initialize a logical replica using a base backup). And I've seen at least one other codebase (developed without my input) doing so. I really don't understand how you just dismiss this without any sort of actual argument. Yes, those uses can be fixed to reconnect with a different replication parameter, but that's code that needs to be adjusted and it requires adjustments to pg_hba.conf etc. And obviously you'd lock out older versions of jdbc, and possibly other drivers. Obviously we should allow more granular permissions here, I don't think anybody is arguing against that. Greetings, Andres Freund
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical ()at walsender.c:2762
From
Robert Haas
Date:
On Wed, Jun 24, 2020 at 3:41 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > People (specifically the jdbc driver) *are* using this feature in this > way, but they didn't realize they were doing it. It was an accident and > they didn't notice. But you don't know that that's true of everyone using this feature, and even if it were, so what? Breaking a feature that someone didn't know they were using is just as much of a break as breaking a feature someone DID know they were using. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
From
Alvaro Herrera
Date:
On 2020-Jun-24, Andres Freund wrote: > As I said before, I've utilized being able to do both over a single > connection (among others to initialize a logical replica using a base > backup). And I've seen at least one other codebase (developed without my > input) doing so. I really don't understand how you just dismiss this > without any sort of actual argument. Yes, those uses can be fixed to > reconnect with a different replication parameter, but that's code that > needs to be adjusted and it requires adjustments to pg_hba.conf etc. > > And obviously you'd lock out older versions of jdbc, and possibly other > drivers. Well, I had understood that you were talking from a hypothetical position, not that you were already using the thing that way. After these arguments, I agree to leave things alone, and nobody else seems to be arguing in that direction, so I'll mark the open item as closed. Thanks, -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services