Thread: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      15367
Logged by:          Jeremy Evans
Email address:      code@jeremyevans.net
PostgreSQL version: 10.5
Operating system:   OpenBSD 6.3 amd64
Description:

One of the postgres processes in our production environment crashed with the
following backtrace:

(gdb) bt
#0  thrkill () at -:3
#1  0x000004a14ca47fbe in _libc_abort () at
/usr/src/lib/libc/stdlib/abort.c:51
#2  0x000004a14c9fd029 in wrterror (d=Variable "d" is not available.
) at /usr/src/lib/libc/stdlib/malloc.c:288
#3  0x000004a14c9fd34f in ofree (argpool=Variable "argpool" is not
available.
) at /usr/src/lib/libc/stdlib/malloc.c:1298
#4  0x000004a14c9fd109 in free (ptr=0x4a115aec398) at
/usr/src/lib/libc/stdlib/malloc.c:1416
#5  0x000004a09a7301e8 in pg_fe_scram_free () from
/usr/local/lib/libpq.so.6.10
#6  0x000004a09a73561f in pqPacketSend () from
/usr/local/lib/libpq.so.6.10
#7  0x000004a09a731507 in PQfinish () from /usr/local/lib/libpq.so.6.10
#8  0x000004a074dc5549 in GetConnection () from
/usr/local/lib/postgresql/postgres_fdw.so
#9  0x000004a074dbc02a in postgresBeginForeignScan () from
/usr/local/lib/postgresql/postgres_fdw.so
#10 0x0000049e62eab412 in ExecInitForeignScan () from
/usr/local/bin/postgres
#11 0x0000049e62e89ace in ExecInitNode () from /usr/local/bin/postgres
#12 0x0000049e62e9aae8 in ExecInitHashJoin () from /usr/local/bin/postgres
#13 0x0000049e62e89b3e in ExecInitNode () from /usr/local/bin/postgres
#14 0x0000049e62ea7a06 in ExecInitSort () from /usr/local/bin/postgres
#15 0x0000049e62e89aee in ExecInitNode () from /usr/local/bin/postgres
#16 0x0000049e62e85379 in standard_ExecutorStart () from
/usr/local/bin/postgres
#17 0x0000049e62fc4ffb in PortalStart () from /usr/local/bin/postgres
#18 0x0000049e62fc43d5 in exec_simple_query () from
/usr/local/bin/postgres
#19 0x0000049e62fc250b in PostgresMain () from /usr/local/bin/postgres
#20 0x0000049e62f473c5 in PostmasterMain () from /usr/local/bin/postgres
#21 0x0000049e62ec986b in main () from /usr/local/bin/postgres

The PostgreSQL logfile showed:

postgres(64978) in free(): bogus pointer (double free?) 0x4a115aec398
2018-09-06 12:01:52.202 PDT [45953] LOG:  server process (PID 64978) was
terminated by signal 6: Abort trap

We are using foreign tables to other databases in the same cluster, and all
databases are set to require SCRAM authentication.  We've been using SCRAM
authentication for about 2 months, and foreign tables for a few years
without problems.

PostgreSQL version: PostgreSQL 10.5 on x86_64-unknown-openbsd6.3, compiled
by OpenBSD clang version 5.0.1 (tags/RELEASE_501/final) (based on LLVM
5.0.1), 64-bit

PostgreSQL was installed via OpenBSD ports, using the 6.3-stable branch.

PostgreSQL configuration file changes:

 DateStyle                  | ISO, MDY           | configuration file
 default_text_search_config | pg_catalog.english | configuration file
 dynamic_shared_memory_type | posix              | configuration file
 lc_messages                | C                  | configuration file
 lc_monetary                | C                  | configuration file
 lc_numeric                 | C                  | configuration file
 lc_time                    | C                  | configuration file
 listen_addresses           | *                  | configuration file
 log_timezone               | US/Pacific         | configuration file
 max_connections            | 100                | configuration file
 max_stack_depth            | 2MB                | environment variable
 password_encryption        | scram-sha-256      | configuration file
 shared_buffers             | 128MB              | configuration file
 ssl                        | on                 | configuration file
 TimeZone                   | US/Pacific         | configuration file

Operating system: OpenBSD 6.3 amd64, with OpenBSD syspatches up to 016:
OpenBSD bethpage.bsa.ca.gov 6.3 GENERIC.MP#8 amd64

As the backtrace shows the error was during the closing of a connection,
this appears to be unrelated to any specific query.  The backtrace also
shows withis is due to the use of foreign tables, and thus likely
independent of what client program was used to connect.  In case it does
matter, it was using libpq via the ruby-pg library, and no connection pool
or load balancer was being used.

Nothing unusual in the PostgreSQL logs other than the the crash and the
typical crash recovery.

Hardware is a HP DL380p Gen8 server with 256GB of ECC memory and 2 800GB
SSDs in hardware RAID-1, with 2 Xeon E5-2670 CPUs. Due to the crash, this is
unlikely to be disk/RAID related.  General memory corruption is also
unlikely as the memory is ECC.

If necessary I can build a debug version of PostgreSQL and try using that in
production so I can get a better backtrace if it crashes again. However,
considering that the crash is rare in my environment, it's unlikely I will
be able to produce a better backtrace for the error quickly.


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Tom Lane
Date:
=?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes:
> One of the postgres processes in our production environment crashed with the
> following backtrace:

> (gdb) bt
> #0  thrkill () at -:3
> #1  0x000004a14ca47fbe in _libc_abort () at
> /usr/src/lib/libc/stdlib/abort.c:51
> #2  0x000004a14c9fd029 in wrterror (d=Variable "d" is not available.
> ) at /usr/src/lib/libc/stdlib/malloc.c:288
> #3  0x000004a14c9fd34f in ofree (argpool=Variable "argpool" is not
> available.
> ) at /usr/src/lib/libc/stdlib/malloc.c:1298
> #4  0x000004a14c9fd109 in free (ptr=0x4a115aec398) at
> /usr/src/lib/libc/stdlib/malloc.c:1416
> #5  0x000004a09a7301e8 in pg_fe_scram_free () from
> /usr/local/lib/libpq.so.6.10
> #6  0x000004a09a73561f in pqPacketSend () from
> /usr/local/lib/libpq.so.6.10
> #7  0x000004a09a731507 in PQfinish () from /usr/local/lib/libpq.so.6.10
> #8  0x000004a074dc5549 in GetConnection () from
> /usr/local/lib/postgresql/postgres_fdw.so

Hmm.  Unfortunately, I don't think I believe this backtrace.  For one
thing, there's no direct pathway from pqPacketSend to pg_fe_scram_free.
Also, if we ignore that and figure that somehow pg_fe_scram_free is
doing a duplicate free, it's still really hard to see how that would
happen.

Given that you didn't have debug symbols installed, it's not very
surprising for the trace to be misleading.  Probably, what we're seeing
here is the nearest exported functions' names rather than where control
really went.

> If necessary I can build a debug version of PostgreSQL and try using that in
> production so I can get a better backtrace if it crashes again.

Please.  I don't doubt that there's an issue here, but we're going to
need a higher-quality fix on where it is before we can do much.

            regards, tom lane


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Michael Paquier
Date:
On Thu, Sep 06, 2018 at 08:35:39PM +0000, PG Bug reporting form wrote:
> If necessary I can build a debug version of PostgreSQL and try using that in
> production so I can get a better backtrace if it crashes again. However,
> considering that the crash is rare in my environment, it's unlikely I will
> be able to produce a better backtrace for the error quickly.

That would be nice.  From what I can see this would be a race condition,
which is not obvious by looking at the code.  Testing with a two-node
deployment where the first node has a foreign table which connects to a
second node, using SCRAM authentication, holding the physical table,
then doing many foreign scans across many clients don't show any
problem.  Did libpq complain at some point in the session where the
crash happened about any error?
--
Michael

Attachment

Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Jeremy Evans
Date:
On 09/06 02:58, Michael Paquier wrote:
> On Thu, Sep 06, 2018 at 08:35:39PM +0000, PG Bug reporting form wrote:
> > If necessary I can build a debug version of PostgreSQL and try using that in
> > production so I can get a better backtrace if it crashes again. However,
> > considering that the crash is rare in my environment, it's unlikely I will
> > be able to produce a better backtrace for the error quickly.
> 
> That would be nice.  From what I can see this would be a race condition,
> which is not obvious by looking at the code.  Testing with a two-node
> deployment where the first node has a foreign table which connects to a
> second node, using SCRAM authentication, holding the physical table,
> then doing many foreign scans across many clients don't show any
> problem.  Did libpq complain at some point in the session where the
> crash happened about any error?

The PostgreSQL logfile only shows:

postgres(64978) in free(): bogus pointer (double free?) 0x4a115aec398
2018-09-06 12:01:52.202 PDT [45953] LOG:  server process (PID 64978) was terminated by signal 6: Abort trap
2018-09-06 12:01:52.202 PDT [45953] DETAIL:  Failed process was running: ...
2018-09-06 12:01:52.202 PDT [45953] LOG:  terminating any other active server processes

If there is another place I should look, please let me know.  The log
files of the client process don't show anything during the crash,
probably because the client libpq connection was just dropped when the
server process crashed.  After the crash, other client libpq connections 
show the following, which is probably expected:

WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because
anotherserver process exited abnormally and possibly corrupted shared memory.
 
HINT:  In a moment you should be able to reconnect to the database and repeat your command.

I'll try to install a version with debug symbols on September 14, and
if it crashes again I'll respond with a more complete and accurate
backtrace.

Thanks,
Jeremy


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Jeremy Evans
Date:
On 09/06 03:28, Jeremy Evans wrote:
> On 09/06 02:58, Michael Paquier wrote:
> > On Thu, Sep 06, 2018 at 08:35:39PM +0000, PG Bug reporting form wrote:
> > > If necessary I can build a debug version of PostgreSQL and try using that in
> > > production so I can get a better backtrace if it crashes again. However,
> > > considering that the crash is rare in my environment, it's unlikely I will
> > > be able to produce a better backtrace for the error quickly.
> > 
> > That would be nice.  From what I can see this would be a race condition,
> > which is not obvious by looking at the code.  Testing with a two-node
> > deployment where the first node has a foreign table which connects to a
> > second node, using SCRAM authentication, holding the physical table,
> > then doing many foreign scans across many clients don't show any
> > problem.  Did libpq complain at some point in the session where the
> > crash happened about any error?
> 
> The PostgreSQL logfile only shows:
> 
> postgres(64978) in free(): bogus pointer (double free?) 0x4a115aec398
> 2018-09-06 12:01:52.202 PDT [45953] LOG:  server process (PID 64978) was terminated by signal 6: Abort trap
> 2018-09-06 12:01:52.202 PDT [45953] DETAIL:  Failed process was running: ...
> 2018-09-06 12:01:52.202 PDT [45953] LOG:  terminating any other active server processes
> 
> If there is another place I should look, please let me know.  The log
> files of the client process don't show anything during the crash,
> probably because the client libpq connection was just dropped when the
> server process crashed.  After the crash, other client libpq connections 
> show the following, which is probably expected:
> 
> WARNING:  terminating connection because of crash of another server process
> DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because
anotherserver process exited abnormally and possibly corrupted shared memory.
 
> HINT:  In a moment you should be able to reconnect to the database and repeat your command.
> 
> I'll try to install a version with debug symbols on September 14, and
> if it crashes again I'll respond with a more complete and accurate
> backtrace.

We experienced an almost identical crash this morning.  The query was
different, but the backtrace was almost the same, and the query was
using a foreign table with SCRAM authentication, just like the one
yesterday.

One thing that was similar between the two crashes is that shortly
before both crashes, we were testing database changes on different
databases in the same cluster, different from both the postgres
process that crashed (the client of the foreign table scan) and the
postgres process executing the foreign table scan.

The database changes mostly consisted of the following statement types:

DROP TABLE
DROP SCHEMA
DROP FUNCTION
CREATE FUNCTION
CREATE SCHEMA
CREATE TABLE
CREATE INDEX
CREATE TRIGGER
INSERT
GRANT

We only recently started testing these database changes in this cluster
yesterday.  Based on the timing, I'm guessing this issue only occurs
when system table changes are being made.

I hadn't yet had time to install debug symbols on the production server,
but since I think I have a better idea on how to recreate this issue, I
will try recreating this on a test cluster with debug symbols.

Thanks,
Jeremy


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Jeremy Evans
Date:
On 09/07 09:51, Jeremy Evans wrote:
> I hadn't yet had time to install debug symbols on the production server,
> but since I think I have a better idea on how to recreate this issue, I
> will try recreating this on a test cluster with debug symbols.

Thankfully, I was able to recreate this on a test server with debug
symbols, by using \watch in psql to run the foreign query while running
the database changes on a separate database in a loop.

Here's the backtrace:

#1  0x000017c254721a8e in _libc_abort () at /usr/src/lib/libc/stdlib/abort.c:51
#2  0x000017c2547303b9 in wrterror (d=Variable "d" is not available.) at /usr/src/lib/libc/stdlib/malloc.c:288
#3  0x000017c2547306df in ofree (argpool=Variable "argpool" is not available.) at
/usr/src/lib/libc/stdlib/malloc.c:1298
#4  0x000017c254730499 in free (ptr=0x17c1ee153398) at /usr/src/lib/libc/stdlib/malloc.c:1416
#5  0x000017c20a6b8b00 in pg_fe_scram_free (opaq=0x17c223712000) at fe-auth-scram.c:127
#6  0x000017c20a6b9d20 in pqDropConnection (conn=0x17c1dac61800, flushInput=1 '\001') at fe-connect.c:479
#7  0x000017c20a6bf95e in closePGconn (conn=0x17c1dac61800) at fe-connect.c:3710
#8  0x000017c20a6ba230 in PQfinish (conn=0x17c1dac61800) at fe-connect.c:3731
#9  0x000017c2c276a838 in disconnect_pg_server (entry=0x17c25b118658) at connection.c:312
#10 0x000017c2c2769dba in GetConnection (user=0x17c2c2285c50, will_prep_stmt=0 '\0') at connection.c:169
#11 0x000017c2c275c597 in postgresBeginForeignScan (node=0x17c2d8b18028, eflags=16) at postgres_fdw.c:1325
#12 0x000017bfd91aaa24 in ExecInitForeignScan (node=0x17c26c166408, estate=0x17c2d8b17038, eflags=16) at
nodeForeignscan.c:231
#13 0x000017bfd9177f9a in ExecInitNode (node=0x17c26c166408, estate=0x17c2d8b17038, eflags=16) at execProcnode.c:277
#14 0x000017bfd9190f84 in ExecInitHashJoin (node=0x17c26c16a050, estate=0x17c2d8b17038, eflags=16) at
nodeHashjoin.c:433
#15 0x000017bfd917800a in ExecInitNode (node=0x17c26c16a050, estate=0x17c2d8b17038, eflags=16) at execProcnode.c:300
#16 0x000017bfd91a4ae0 in ExecInitSort (node=0x17c26c168c50, estate=0x17c2d8b17038, eflags=16) at nodeSort.c:207
#17 0x000017bfd9178042 in ExecInitNode (node=0x17c26c168c50, estate=0x17c2d8b17038, eflags=16) at execProcnode.c:313
#18 0x000017bfd917114b in InitPlan (queryDesc=0x17c1e8fce038, eflags=16) at execMain.c:1045
#19 0x000017bfd9170920 in standard_ExecutorStart (queryDesc=0x17c1e8fce038, eflags=16) at execMain.c:264
#20 0x000017bfd91706ee in ExecutorStart (queryDesc=0x17c1e8fce038, eflags=0) at execMain.c:152
#21 0x000017bfd935a604 in PortalStart (portal=0x17c1dac5f038, params=0x0, eflags=0, snapshot=0x0) at pquery.c:520
#22 0x000017bfd935682d in exec_simple_query ( query_string=0x17c1dbf45038 "SELECT "...) at postgres.c:1060
#23 0x000017bfd9355d13 in PostgresMain (argc=1, argv=0x17c2460a1180, dbname=0x17c2528b2ff8 "bsawww",
username=0x17c2528b2fc8"recommendations_public") at postgres.c:4088
 
#24 0x000017bfd92adb29 in BackendRun (port=0x17c2bf214800) at postmaster.c:4405
#25 0x000017bfd92aceb3 in BackendStartup (port=0x17c2bf214800) at postmaster.c:4077
#26 0x000017bfd92abf3a in ServerLoop () at postmaster.c:1755
#27 0x000017bfd92a9b46 in PostmasterMain (argc=3, argv=0x7f7ffffc1c58) at postmaster.c:1363
#28 0x000017bfd91d6f57 in main (argc=3, argv=0x7f7ffffc1c58) at main.c:228

Hopefully this helps.  If you need more information, please let me know.

Thanks,
Jeremy


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Michael Paquier
Date:
On Fri, Sep 07, 2018 at 10:55:18AM -0700, Jeremy Evans wrote:
> Here's the backtrace:
>
> #5  0x000017c20a6b8b00 in pg_fe_scram_free (opaq=0x17c223712000) at fe-auth-scram.c:127
> #6  0x000017c20a6b9d20 in pqDropConnection (conn=0x17c1dac61800, flushInput=1 '\001') at fe-connect.c:479
> #7  0x000017c20a6bf95e in closePGconn (conn=0x17c1dac61800) at fe-connect.c:3710
>
> Hopefully this helps.  If you need more information, please let me know.

Thanks.  This fails when freeing the password field in fe_scram_state.
I have a question: does your password use non-ASCII characters which
could make SASLprep to be run on the password string?  The result string
from pg_saslprep allocates a new string for any results returned by
reading the code, I am wondering if we could be missing something..
--
Michael

Attachment

Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Jeremy Evans
Date:
On 09/07 11:15, Michael Paquier wrote:
> On Fri, Sep 07, 2018 at 10:55:18AM -0700, Jeremy Evans wrote:
> > Here's the backtrace:
> > 
> > #5  0x000017c20a6b8b00 in pg_fe_scram_free (opaq=0x17c223712000) at fe-auth-scram.c:127
> > #6  0x000017c20a6b9d20 in pqDropConnection (conn=0x17c1dac61800, flushInput=1 '\001') at fe-connect.c:479
> > #7  0x000017c20a6bf95e in closePGconn (conn=0x17c1dac61800) at fe-connect.c:3710
> >
> > Hopefully this helps.  If you need more information, please let me know.
> 
> Thanks.  This fails when freeing the password field in fe_scram_state.
> I have a question: does your password use non-ASCII characters which
> could make SASLprep to be run on the password string?  The result string
> from pg_saslprep allocates a new string for any results returned by
> reading the code, I am wondering if we could be missing something..

No, the user mapping password is comprised of hex-ascii characters: [0-9a-f]{16}

In case it helps, here's the content of opaq from gdb (with the specific
strings cleared).

print *((fe_scram_state *) opaq)
{
  state = FE_SCRAM_FINISHED,
  username = 0x17c267526380 "...",
  password = 0x17c1ee153398 <Address 0x17c1ee153398 out of bounds>,
  SaltedPassword = 0x17c223712018 "...",
  client_nonce = 0x17c267529f60 "...",
  client_first_message_bare = 0x17c223711da0 "...",
  client_final_message_without_proof = 0x17c271ca19c0 "...",
  server_first_message = 0x17c1ea48a400 "...",
  salt = 0x17c2675261c0 "...",
  saltlen = 16,
  iterations = 4096,
  nonce = 0x17c2179e9880 "...",
  server_final_message = 0x17c271ca1a40 "...",
  ServerSignature = 0x17c223712078 "..."
}

Thanks,
Jeremy


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> Thanks.  This fails when freeing the password field in fe_scram_state.

I had been studying that when the report first arrived.  I'm not really
happy with the coding in pg_saslprep(): it leaves garbage in its output
parameter in some code paths, and I think it leaks memory in others.
But I couldn't see a path that would lead to the observed failure.

            regards, tom lane


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Michael Paquier
Date:
On Fri, Sep 07, 2018 at 04:18:55PM -0400, Tom Lane wrote:
> I had been studying that when the report first arrived.  I'm not really
> happy with the coding in pg_saslprep(): it leaves garbage in its output
> parameter in some code paths, and I think it leaks memory in others.

Agreed.  Do you want to give it a go?  I can see some of those code
paths?  Likely you spotted more.

> But I couldn't see a path that would lead to the observed failure.

Agreed.
--
Michael

Attachment

Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> On Fri, Sep 07, 2018 at 04:18:55PM -0400, Tom Lane wrote:
>> I had been studying that when the report first arrived.  I'm not really
>> happy with the coding in pg_saslprep(): it leaves garbage in its output
>> parameter in some code paths, and I think it leaks memory in others.

> Agreed.  Do you want to give it a go?  I can see some of those code
> paths?  Likely you spotted more.

Looking closer, there are no leaks, though I think we could use a comment
about that.  And it still makes me itch that pg_saslprep (mostly) doesn't
worry about clearing *output on failure while only three of its four
callers bother to initialize their variables to null.  That's a recipe
for future bugs.  So I propose the attached cleanup.

We're still no closer to an explanation of Jeremy's failure, though
I'm now pretty sure that pg_saslprep itself isn't the issue.

            regards, tom lane

diff --git a/src/backend/libpq/auth-scram.c b/src/backend/libpq/auth-scram.c
index 8fbad53..e997c94 100644
--- a/src/backend/libpq/auth-scram.c
+++ b/src/backend/libpq/auth-scram.c
@@ -453,7 +453,7 @@ pg_be_scram_exchange(void *opaq, char *input, int inputlen,
 char *
 pg_be_scram_build_verifier(const char *password)
 {
-    char       *prep_password = NULL;
+    char       *prep_password;
     pg_saslprep_rc rc;
     char        saltbuf[SCRAM_DEFAULT_SALT_LEN];
     char       *result;
@@ -499,7 +499,7 @@ scram_verify_plain_password(const char *username, const char *password,
     uint8        stored_key[SCRAM_KEY_LEN];
     uint8        server_key[SCRAM_KEY_LEN];
     uint8        computed_key[SCRAM_KEY_LEN];
-    char       *prep_password = NULL;
+    char       *prep_password;
     pg_saslprep_rc rc;

     if (!parse_scram_verifier(verifier, &iterations, &encoded_salt,
diff --git a/src/common/saslprep.c b/src/common/saslprep.c
index 2710215..4cf574f 100644
--- a/src/common/saslprep.c
+++ b/src/common/saslprep.c
@@ -1081,6 +1081,9 @@ pg_saslprep(const char *input, char **output)
     unsigned char *p;
     pg_wchar   *wp;

+    /* Ensure we return *output as NULL on failure */
+    *output = NULL;
+
     /* Check that the password isn't stupendously long */
     if (strlen(input) > MAX_PASSWORD_LENGTH)
     {
@@ -1112,10 +1115,7 @@ pg_saslprep(const char *input, char **output)
      */
     input_size = pg_utf8_string_len(input);
     if (input_size < 0)
-    {
-        *output = NULL;
         return SASLPREP_INVALID_UTF8;
-    }

     input_chars = ALLOC((input_size + 1) * sizeof(pg_wchar));
     if (!input_chars)
@@ -1246,6 +1246,11 @@ pg_saslprep(const char *input, char **output)
     result = ALLOC(result_size + 1);
     if (!result)
         goto oom;
+
+    /*
+     * There are no error exits below here, so the error exit paths don't need
+     * to worry about possibly freeing "result".
+     */
     p = (unsigned char *) result;
     for (wp = output_chars; *wp; wp++)
     {
diff --git a/src/interfaces/libpq/fe-auth-scram.c b/src/interfaces/libpq/fe-auth-scram.c
index 1d9c937..81531f3 100644
--- a/src/interfaces/libpq/fe-auth-scram.c
+++ b/src/interfaces/libpq/fe-auth-scram.c
@@ -747,7 +747,7 @@ verify_server_signature(fe_scram_state *state)
 char *
 pg_fe_scram_build_verifier(const char *password)
 {
-    char       *prep_password = NULL;
+    char       *prep_password;
     pg_saslprep_rc rc;
     char        saltbuf[SCRAM_DEFAULT_SALT_LEN];
     char       *result;

Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Tom Lane
Date:
I wrote:
> We're still no closer to an explanation of Jeremy's failure, though
> I'm now pretty sure that pg_saslprep itself isn't the issue.

I had an idea about that --- it's probably all wet, but the code as
written seems bulletproof enough that I'm forced to postulate something
very strange is happening.

Observe that there are two copies of pg_saslprep() in play: there is one
in the backend, which is compiled to allocate its result with palloc,
and there is one in libpq, which is compiled to allocate its result
with malloc.  Could it be that somehow, when libpq is loaded into the
backend address space as it is here, libpq winds up calling the backend's
copy of pg_saslprep rather than its own?  That would work just fine,
until libpq tried to free the returned string using free(), and then
we'd get exactly the reported error.

The main weakness in this theory is that it suggests that Jeremy's
postgres_fdw connections ought to be falling over more easily than
they are.  However, I think that postgres_fdw will never explicitly
close a PGconn unless it's forced to; during a normal backend session
exit, the process just dies without going through PQfinish, so that
the problem would be masked.  The only way to get to the PQfinish
call shown in the backtrace is for pgfdw_inval_callback to mark the
connection invalid, which'd require either an update on the relevant
foreign server object, an update on the user mapping in use, or a
SI cache reset.  That explains how heavy DDL activity in an apparently
unrelated database can trigger the problem: it eventually results in
a SI message queue overrun and ensuing cache reset.  In the absence
of SI cache resets, maybe indeed the problem is rare even if the
pg_saslprep result is misallocated every time.

Not sure about a good way to test this theory.  Need more caffeine.

            regards, tom lane


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Andrew Gierth
Date:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

 Tom> Could it be that somehow, when libpq is loaded into the backend
 Tom> address space as it is here, libpq winds up calling the backend's
 Tom> copy of pg_saslprep rather than its own?

Yes, that is a thing that happens with ELF-style shared libs when you
dynamically load shared libraries that weren't linked with -Bsymbolic.

Is this potentially also a problem for libpqwalreceiver, which also
loads libpq?

-- 
Andrew (irc:RhodiumToad)


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Tom Lane
Date:
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
>  Tom> Could it be that somehow, when libpq is loaded into the backend
>  Tom> address space as it is here, libpq winds up calling the backend's
>  Tom> copy of pg_saslprep rather than its own?

> Yes, that is a thing that happens with ELF-style shared libs when you
> dynamically load shared libraries that weren't linked with -Bsymbolic.

If true, that's potentially pretty disastrous, because *every* src/port
or src/common function that's used by libpq would be similarly subject
to replacement by its backend version, and every behavioral difference
between the frontend and backend coding would be a source of trouble.

However, the info I can find about this suggests that it's only relevant
for exported functions, not ones that are hidden by means of a
version-script.  Since libpq doesn't export pg_saslprep, this shouldn't
be happening?

But wait, this problem is being reported on OpenBSD.  I wonder whether
their linker behaves anything like Linux's ... hm, it looks like it
*is* the same linker.

I tried setting up a scram-auth-based postgres_fdw connection on
my RHEL6 box and forcing it through the questionable code path
by issuing a dummy ALTER USER MAPPING command from another session.
It worked fine, so the problem doesn't seem to be present on Linux.

Repeating the same thing on a nearby OpenBSD 6.0 image ... kaboom!
It fails exactly as Jeremy describes.  Furthermore, I can confirm
that the attached patch fixes it.

I wonder which other platforms we have that behave like this?
And how come we've not seen symptoms before?  It's hard to believe
that src/common/saslprep.c is the only thing libpq imports
that has FRONTEND-vs-not-FRONTEND behavior differences.

            regards, tom lane

diff --git a/src/Makefile.shlib b/src/Makefile.shlib
index 95b82a6..1555be3 100644
--- a/src/Makefile.shlib
+++ b/src/Makefile.shlib
@@ -149,6 +149,11 @@ ifeq ($(PORTNAME), openbsd)
     ifdef soname
       LINK.shared    += -Wl,-x,-soname,$(soname)
     endif
+    BUILD.exports    = ( echo '{ global:'; $(AWK) '/^[^\#]/ {printf "%s;\n",$$1}' $<; echo ' local: *; };' ) >$@
+    exports_file    = $(SHLIB_EXPORTS:%.txt=%.list)
+    ifneq (,$(exports_file))
+      LINK.shared    += -Wl,--version-script=$(exports_file)
+    endif
     SHLIB_LINK        += -lc
   else
     LINK.shared        = $(LD) -x -Bshareable -Bforcearchive

Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Tom Lane
Date:
I wrote:
> I tried setting up a scram-auth-based postgres_fdw connection on
> my RHEL6 box and forcing it through the questionable code path
> by issuing a dummy ALTER USER MAPPING command from another session.
> It worked fine, so the problem doesn't seem to be present on Linux.

Oh, I meant to show the test I used, for the benefit of anyone who
wants to test on other platforms.  Do this (with adjustments to
taste) to set up:

set password_encryption = "scram-sha-256" ;
create user u1 password 'foobar';
create extension postgres_fdw;
create server loopback foreign data wrapper postgres_fdw
  options (host 'localhost', dbname 'postgres');
create user mapping for u1 server loopback
 options (user 'u1', password 'foobar');
grant usage on foreign server loopback to u1;
\c - u1
create table remote (f1 int);
insert into remote values(1),(2);
create foreign table localt (f1 int) server loopback
  options (table_name 'remote');

At this point you should be able to do

select * from localt;

as u1; if not, adjust pg_hba.conf to require a password on localhost
connections.  Once the select works, *leave that session open*, and
in another session do any no-op ALTER on the user mapping, such as

alter user mapping for u1 server loopback options (set user 'u1');

Now, in the first session, repeat

select * from localt;

It'll dump core with a backtrace similar to what Jeremy showed,
if the bug is present.

            regards, tom lane


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Michael Paquier
Date:
On Sat, Sep 08, 2018 at 07:56:23AM -0400, Tom Lane wrote:
> Looking closer, there are no leaks, though I think we could use a comment
> about that.  And it still makes me itch that pg_saslprep (mostly) doesn't
> worry about clearing *output on failure while only three of its four
> callers bother to initialize their variables to null.  That's a recipe
> for future bugs.  So I propose the attached cleanup.

Okay, that looks acceptable to me.  Thanks for looking at that.
--
Michael

Attachment

Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Michael Paquier
Date:
On Sat, Sep 08, 2018 at 03:02:00PM -0400, Tom Lane wrote:
> I tried setting up a scram-auth-based postgres_fdw connection on
> my RHEL6 box and forcing it through the questionable code path
> by issuing a dummy ALTER USER MAPPING command from another session.
> It worked fine, so the problem doesn't seem to be present on Linux.
>
> Repeating the same thing on a nearby OpenBSD 6.0 image ... kaboom!
> It fails exactly as Jeremy describes.  Furthermore, I can confirm
> that the attached patch fixes it.

That's not good!  I have tried the report on Linux only, with pgbench -C
doing many foreign scans through postgres_fdw which connects to the
remote using SCRAM authentication, and this was stable.  Anything newer
than ec136d19 is impacted.  I am surprised this has not been seen
earlier to be honest.

> I wonder which other platforms we have that behave like this?
> And how come we've not seen symptoms before?  It's hard to believe
> that src/common/saslprep.c is the only thing libpq imports
> that has FRONTEND-vs-not-FRONTEND behavior differences.

libpq is quite particular in the fact that it compiles directly with
stuff in src/common, src/port, etc.  I think that the same problem
exists for pg_md5_encrypt().  So I like the solution proposed.  I guess
you have made sure that postgres_fdw becomes stable this way?  Have you
tried to stress it more with pgbench -C doing only foreign scans and
dozens of clients running in parallel?
--
Michael

Attachment

Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Andrew Gierth
Date:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

 Tom> However, the info I can find about this suggests that it's only
 Tom> relevant for exported functions, not ones that are hidden by means
 Tom> of a version-script. Since libpq doesn't export pg_saslprep, this
 Tom> shouldn't be happening?

 Tom> But wait, this problem is being reported on OpenBSD. I wonder
 Tom> whether their linker behaves anything like Linux's ... hm, it
 Tom> looks like it *is* the same linker.

But the makefile is not doing the same thing. SHLIB_EXPORTS looks like
it's being ignored on every platform except linux, macos and windows in
Makefile.shlib.

-- 
Andrew (irc:RhodiumToad)


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Tom Lane
Date:
I wrote:
> I wonder which other platforms we have that behave like this?
> And how come we've not seen symptoms before?  It's hard to believe
> that src/common/saslprep.c is the only thing libpq imports
> that has FRONTEND-vs-not-FRONTEND behavior differences.

After groveling through all the FRONTEND conditionals in those
directories, it seems like indeed this is the first really obvious problem
that would ensue from libpq invoking the backend version of some common
function.  There are some minor pecadilloes:

* "could not determine encoding" warning in chklocale.c might get printed
via ereport, not fprintf(stderr) as expected; but likely nobody would even
realize that was the wrong thing, and it's a very unlikely message anyway.

* getaddrinfo.c might use non-reentrant gethostbyname, but in the backend
environment that's hardly a problem due to lack of threading.

* thread.c might use non-reentrant strerror, getpwuid, gethostbyname,
but ditto.

* saslprep.c might ereport about password-too-long, which would be bad
(leading to leakage of a partially-built PGconn) but it's probably not
a case anybody would exercise in the field.

* scram_build_verifier might palloc instead of malloc, as might
unicode_norm.c.  In some code paths I think that'd be OK because
the caller would also be backend code and would pfree not free,
but I think there are other paths where this is also a crash risk.
Still, these also are issues that are new with SCRAM support.

So the bottom line is that this theory does fit the observed facts.

Hence we need to make sure that version-script filtering happens on every
ELF-based platform, not just the most popular ones.  I see that the
FreeBSD and NetBSD stanzas in Makefile.shlib also have "ifdef ELF_SYSTEM"
portions, so probably we need to put the version-script stuff into those
too.  Anybody have thoughts about other possibly-affected platforms?

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
>> Is this potentially also a problem for libpqwalreceiver, which also
>> loads libpq?

The proposed fix would make libpq OK in that context, but I'm now
wondering if libpqwalreceiver itself needs a version-script.

            regards, tom lane


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Tom Lane
Date:
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
>  Tom> But wait, this problem is being reported on OpenBSD. I wonder
>  Tom> whether their linker behaves anything like Linux's ... hm, it
>  Tom> looks like it *is* the same linker.

> But the makefile is not doing the same thing. SHLIB_EXPORTS looks like
> it's being ignored on every platform except linux, macos and windows in
> Makefile.shlib.

Exactly.  See proposed patch.

            regards, tom lane


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> libpq is quite particular in the fact that it compiles directly with
> stuff in src/common, src/port, etc.  I think that the same problem
> exists for pg_md5_encrypt().  So I like the solution proposed.  I guess
> you have made sure that postgres_fdw becomes stable this way?

It still passes its regression test with the patch, on my OpenBSD image.
See also the reproducer procedure I posted.

The bigger issue we need to worry about is whether applying version-script
symbol filtering would cause issues for third-party code that might be
relying on being able to access libpq "internal" functions on *BSD.
I don't have much sympathy for that and wouldn't hesitate to apply the
filtering in HEAD and v11, but maybe in v10 we ought not do that.

We could possibly band-aid around the problem in v10 by changing the SCRAM
support functions so that they always malloc, never palloc, whether in
frontend or not.  That might lead to memory leaks in the backend case,
but I'm not sure how much we care about memory leaks during backend
authentication failures --- the process would exit shortly anyway.
I've not investigated this idea any further than thinking of it, though,
and I'm not sure it's worth the trouble to avoid symbol filtering.

I'm inclined to add a test in src/test/authentication that exercises
this problem, in hopes of finding out from the buildfarm whether we
have any other platforms with the issue.

            regards, tom lane


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Andrew Gierth
Date:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

 Tom> Hence we need to make sure that version-script filtering happens
 Tom> on every ELF-based platform, not just the most popular ones. I see
 Tom> that the FreeBSD and NetBSD stanzas in Makefile.shlib also have
 Tom> "ifdef ELF_SYSTEM" portions, so probably we need to put the
 Tom> version-script stuff into those too.

I tested adding your patch into the FreeBSD stanza and it worked for me,
FWIW. That was on a system which still has GNU ld, but I also tried a
test program using ld.lld and the same option and version script file
works.

-- 
Andrew (irc:RhodiumToad)


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Tom Lane
Date:
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
>  Tom> Hence we need to make sure that version-script filtering happens
>  Tom> on every ELF-based platform, not just the most popular ones. I see
>  Tom> that the FreeBSD and NetBSD stanzas in Makefile.shlib also have
>  Tom> "ifdef ELF_SYSTEM" portions, so probably we need to put the
>  Tom> version-script stuff into those too.

> I tested adding your patch into the FreeBSD stanza and it worked for me,
> FWIW. That was on a system which still has GNU ld, but I also tried a
> test program using ld.lld and the same option and version script file
> works.

Hm.  So I can reproduce the crash on NetBSD as well as OpenBSD, using
the test method I showed before ... but *not* on FreeBSD 11.0 x86_64,
even though that's certainly an ELF system.  So now I'm confused again.

            regards, tom lane


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Tom Lane
Date:
I wrote:
> Hm.  So I can reproduce the crash on NetBSD as well as OpenBSD, using
> the test method I showed before ... but *not* on FreeBSD 11.0 x86_64,
> even though that's certainly an ELF system.  So now I'm confused again.

Further debugging shows conclusively that libpq *is* calling the backend
copy of pg_saslprep, and that what it passes to free() came from palloc()
in the case at hand.  Apparently, the default malloc library in this
version of FreeBSD is not as good at noticing bad free() arguments as
the other platforms I tested.  Probably, the process's malloc freelist
is now corrupt and that would eventually lead to a crash, but I did not
try to prove it.

Adding the proposed patch to the FreeBSD stanza of Makefile.shlib
does cause libpq to start calling the desired version of pg_saslprep.

On the other front of developing a test case to detect this problem,
I did not have much luck with mechanizing this specific test: it
requires some functionality we have in the TAP tests, like setting
up an installation with SCRAM password authentication enabled, as
well as other functionality we have in the isolation tester, like
doing things in two different sessions concurrently.  But we don't
really need to test this *exact* scenario; we just need something
that will behave differently if libpq links to backend versions of
any of the problematic functions.  I'm a bit tempted to add
something like this to saslprep.c:

bool
saslprep_is_frontend()
{
#ifdef FRONTEND
    return true;
#else
    return false;
#endif
}

and then have libpq test to make sure this function returns true.

            regards, tom lane


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Tom Lane
Date:
I wrote:
> On the other front of developing a test case to detect this problem,
> I did not have much luck with mechanizing this specific test: it
> requires some functionality we have in the TAP tests, like setting
> up an installation with SCRAM password authentication enabled, as
> well as other functionality we have in the isolation tester, like
> doing things in two different sessions concurrently.  But we don't
> really need to test this *exact* scenario; we just need something
> that will behave differently if libpq links to backend versions of
> any of the problematic functions.  I'm a bit tempted to add
> something like this to saslprep.c:

> bool
> saslprep_is_frontend()
> {
> #ifdef FRONTEND
>     return true;
> #else
>     return false;
> #endif
> }

On reflection, saslprep.c is not the right place for this: if some
other shlib needs a similar test, it might be forced to import SASL
support it has no need for.  So I put the test function into its
own new .c file, which increases the footprint of the patch a bit
but seems much less ad-hoc.

I verified that this patch causes a failure in the postgres_fdw
regression test on affected platforms, even my FreeBSD box where
the original bug doesn't cause an easily-detectable failure.

I propose this for HEAD only; the main point is just to detect
which platforms have the issue, and HEAD seems like enough for that.

            regards, tom lane

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index cdd71a9..578af2e 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -24,6 +24,7 @@
 #include "catalog/index.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
+#include "common/link-canary.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
@@ -503,6 +504,13 @@ BootstrapModeMain(void)
     Assert(IsBootstrapProcessingMode());

     /*
+     * To ensure that src/common/link-canary.c is linked into the backend, we
+     * must call it from somewhere.  Here is as good as anywhere.
+     */
+    if (pg_link_canary_is_frontend())
+        elog(ERROR, "backend is incorrectly linked to frontend functions");
+
+    /*
      * Do backend-like initialization for bootstrap mode
      */
     InitProcess();
diff --git a/src/common/Makefile b/src/common/Makefile
index 1fc2c66..6871747 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -41,7 +41,8 @@ override CPPFLAGS := -DFRONTEND $(CPPFLAGS)
 LIBS += $(PTHREAD_LIBS)

 OBJS_COMMON = base64.o config_info.o controldata_utils.o exec.o file_perm.o \
-    ip.o keywords.o md5.o pg_lzcompress.o pgfnames.o psprintf.o relpath.o \
+    ip.o keywords.o link-canary.o md5.o pg_lzcompress.o \
+    pgfnames.o psprintf.o relpath.o \
     rmtree.o saslprep.o scram-common.o string.o unicode_norm.o \
     username.o wait_error.o

diff --git a/src/common/link-canary.c b/src/common/link-canary.c
new file mode 100644
index 0000000..5b4e562
--- /dev/null
+++ b/src/common/link-canary.c
@@ -0,0 +1,36 @@
+/*-------------------------------------------------------------------------
+ * link-canary.c
+ *      Detect whether src/common functions came from frontend or backend.
+ *
+ * Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *      src/common/link-canary.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "c.h"
+
+#include "common/link-canary.h"
+
+/*
+ * This function just reports whether this file was compiled for frontend
+ * or backend environment.  We need this because in some systems, mainly
+ * ELF-based platforms, it is possible for a shlib (such as libpq) loaded
+ * into the backend to call a backend function named XYZ in preference to
+ * the shlib's own function XYZ.  That's bad if the two functions don't
+ * act identically.  This exact situation comes up for many functions in
+ * src/common and src/port, where the same function names exist in both
+ * libpq and the backend but they don't act quite identically.  To verify
+ * that appropriate measures have been taken to prevent incorrect symbol
+ * resolution, libpq should test that this function returns true.
+ */
+bool
+pg_link_canary_is_frontend(void)
+{
+#ifdef FRONTEND
+    return true;
+#else
+    return false;
+#endif
+}
diff --git a/src/include/common/link-canary.h b/src/include/common/link-canary.h
new file mode 100644
index 0000000..917faae
--- /dev/null
+++ b/src/include/common/link-canary.h
@@ -0,0 +1,17 @@
+/*-------------------------------------------------------------------------
+ *
+ * link-canary.h
+ *      Detect whether src/common functions came from frontend or backend.
+ *
+ * Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * src/include/common/link-canary.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LINK_CANARY_H
+#define LINK_CANARY_H
+
+extern bool pg_link_canary_is_frontend(void);
+
+#endif                            /* LINK_CANARY_H */
diff --git a/src/interfaces/libpq/.gitignore b/src/interfaces/libpq/.gitignore
index 5c232ae..ce1576e 100644
--- a/src/interfaces/libpq/.gitignore
+++ b/src/interfaces/libpq/.gitignore
@@ -25,6 +25,7 @@
 /ip.c
 /md5.c
 /base64.c
+/link-canary.c
 /scram-common.c
 /sha2.c
 /sha2_openssl.c
diff --git a/src/interfaces/libpq/Makefile b/src/interfaces/libpq/Makefile
index abe0a50..ec0122a 100644
--- a/src/interfaces/libpq/Makefile
+++ b/src/interfaces/libpq/Makefile
@@ -49,7 +49,7 @@ endif
 # src/backend/utils/mb
 OBJS += encnames.o wchar.o
 # src/common
-OBJS += base64.o ip.o md5.o scram-common.o saslprep.o unicode_norm.o
+OBJS += base64.o ip.o link-canary.o md5.o scram-common.o saslprep.o unicode_norm.o

 ifeq ($(with_openssl),yes)
 OBJS += fe-secure-openssl.o fe-secure-common.o sha2_openssl.o
@@ -106,7 +106,7 @@ backend_src = $(top_srcdir)/src/backend
 chklocale.c crypt.c erand48.c getaddrinfo.c getpeereid.c inet_aton.c inet_net_ntop.c noblock.c open.c system.c
pgsleep.cpg_strong_random.c pgstrcasecmp.c pqsignal.c snprintf.c strerror.c strlcpy.c strnlen.c thread.c win32error.c
win32setlocale.c:% : $(top_srcdir)/src/port/% 
     rm -f $@ && $(LN_S) $< .

-ip.c md5.c base64.c scram-common.c sha2.c sha2_openssl.c saslprep.c unicode_norm.c: % : $(top_srcdir)/src/common/%
+ip.c md5.c base64.c link-canary.c scram-common.c sha2.c sha2_openssl.c saslprep.c unicode_norm.c: % :
$(top_srcdir)/src/common/%
     rm -f $@ && $(LN_S) $< .

 encnames.c wchar.c: % : $(backend_src)/utils/mb/%
@@ -156,7 +156,7 @@ clean distclean: clean-lib
     rm -f pg_config_paths.h
 # Remove files we (may have) symlinked in from src/port and other places
     rm -f chklocale.c crypt.c erand48.c getaddrinfo.c getpeereid.c inet_aton.c inet_net_ntop.c noblock.c open.c
system.cpgsleep.c pg_strong_random.c pgstrcasecmp.c pqsignal.c snprintf.c strerror.c strlcpy.c strnlen.c thread.c
win32error.cwin32setlocale.c 
-    rm -f ip.c md5.c base64.c scram-common.c sha2.c sha2_openssl.c saslprep.c unicode_norm.c
+    rm -f ip.c md5.c base64.c link-canary.c scram-common.c sha2.c sha2_openssl.c saslprep.c unicode_norm.c
     rm -f encnames.c wchar.c

 maintainer-clean: distclean maintainer-clean-lib
diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index db5aacd..42cdb97 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -71,6 +71,7 @@ static int ldapServiceLookup(const char *purl, PQconninfoOption *options,
 #endif

 #include "common/ip.h"
+#include "common/link-canary.h"
 #include "common/scram-common.h"
 #include "mb/pg_wchar.h"
 #include "port/pg_bswap.h"
@@ -1748,6 +1749,19 @@ connectDBStart(PGconn *conn)
     if (!conn->options_valid)
         goto connect_errReturn;

+    /*
+     * Check for bad linking to backend-internal versions of src/common
+     * functions (see comments in link-canary.c for the reason we need this).
+     * Nobody but developers should see this message, so we don't bother
+     * translating it.
+     */
+    if (!pg_link_canary_is_frontend())
+    {
+        printfPQExpBuffer(&conn->errorMessage,
+                          "libpq is incorrectly linked to backend functions\n");
+        goto connect_errReturn;
+    }
+
     /* Ensure our buffers are empty */
     conn->inStart = conn->inCursor = conn->inEnd = 0;
     conn->outCount = 0;
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 14d2a3c..9ce03ce 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -116,7 +116,8 @@ sub mkvcbuild

     our @pgcommonallfiles = qw(
       base64.c config_info.c controldata_utils.c exec.c file_perm.c ip.c
-      keywords.c md5.c pg_lzcompress.c pgfnames.c psprintf.c relpath.c rmtree.c
+      keywords.c link-canary.c md5.c
+      pg_lzcompress.c pgfnames.c psprintf.c relpath.c rmtree.c
       saslprep.c scram-common.c string.c unicode_norm.c username.c
       wait_error.c);


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Tom Lane
Date:
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> Is this potentially also a problem for libpqwalreceiver, which also
> loads libpq?

As far as I can see, only libpq and the ECPG libraries build their
own copies of any src/port or src/common files, and all of them already
have version scripts.  If libpqwalreceiver tried to incorporate any
such files and compile them under FRONTEND rules, it'd need a version
script as well.  But I think that code is probably content to operate
under backend rules, so it'll just use the functions from the core
backend and be happy.  Hence, no clear need for a version script there.

I think the patches I committed today are sufficient to close this bug,
unless further testing reveals that the problem occurs on additional
platforms.  The buildfarm results so far don't suggest that.

            regards, tom lane


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Andrew Gierth
Date:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

 Tom> I think the patches I committed today are sufficient to close this
 Tom> bug, unless further testing reveals that the problem occurs on
 Tom> additional platforms. The buildfarm results so far don't suggest
 Tom> that.

It seems to have failed on gaur (HP-UX 10.20 gcc 3.4.6 HPPA) too? or do
we not care about that?

No reports yet from any of the solaris derivatives (which will certainly
break) or from AIX (about which I wouldn't care to guess).

-- 
Andrew (irc:RhodiumToad)


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Tom Lane
Date:
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
>  Tom> I think the patches I committed today are sufficient to close this
>  Tom> bug, unless further testing reveals that the problem occurs on
>  Tom> additional platforms. The buildfarm results so far don't suggest
>  Tom> that.

> It seems to have failed on gaur (HP-UX 10.20 gcc 3.4.6 HPPA) too?

Yeah, that came in right after I sent this :-(

> or do we not care about that?

Well, I care, but if I can't fix it then we'll just have to drop
this animal.  I've not studied the docs yet to see if there's a
solution.

> No reports yet from any of the solaris derivatives (which will certainly
> break) or from AIX (about which I wouldn't care to guess).

I'm waiting for those too.

            regards, tom lane


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Tom Lane
Date:
I wrote:
> Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
>> It seems to have failed on gaur (HP-UX 10.20 gcc 3.4.6 HPPA) too?

> Yeah, that came in right after I sent this :-(

>> or do we not care about that?

> Well, I care, but if I can't fix it then we'll just have to drop
> this animal.  I've not studied the docs yet to see if there's a
> solution.

Ugh ... anole just reported in with the same issue, which means it
exists on HP-UX versions that people might still care about in
practice.

I still haven't looked for a fix (real life has intruded more than
usual right at the moment).

            regards, tom lane


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Tom Lane
Date:
I wrote:
> Ugh ... anole just reported in with the same issue, which means it
> exists on HP-UX versions that people might still care about in
> practice.
> I still haven't looked for a fix (real life has intruded more than
> usual right at the moment).

AFAICT, this platform's "ld" doesn't have any simple equivalent of
a version script, but it does have "-B symbolic", which appears to
fix the problem in a quick test.  I'm running a more thorough test
now.

The fix I'm able to test looks like

 ifeq ($(PORTNAME), hpux)
   ifdef SO_MAJOR_VERSION
     shlib            = lib$(NAME)$(DLSUFFIX).$(SO_MAJOR_VERSION)
   endif
   ifeq ($(with_gnu_ld), yes)
     LINK.shared        = $(CC) -shared
     ifdef soname
       LINK.shared    += -Wl,-h -Wl,$(soname)
     endif
   else
-    LINK.shared        = $(LD) -b
+    LINK.shared        = $(LD) -b -B symbolic
     ifdef soname
       LINK.shared    += +h $(soname)
     endif

I'm not sure what to do in the with_gnu_ld subsection: building with
GNU ld might or might not have the problem, and if it does, the best
fix might or might not be "-B symbolic".  But none of the HPUX animals
in the buildfarm are using GNU ld, so I suspect that that subsection is
dead code.  I'm inclined to just add a comment saying that that code
has not been tested recently.

            regards, tom lane


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Tom Lane
Date:
I wrote:
> AFAICT, this platform's "ld" doesn't have any simple equivalent of
> a version script, but it does have "-B symbolic", which appears to
> fix the problem in a quick test.

BTW, I discovered while rummaging through the git history that we
used to use -Bsymbolic on most platforms --- but only for ODBC,
which had exactly the problem we're fighting now of an intended-to-
be-local reference being resolved outside the library, but in a place
where the malfunction was very obvious.  When we removed ODBC from the
core distro, the Makefile support for -Bsymbolic eventually went away
too.  But back around 7.2 we had this:

$ grep -r shlib_symbolic /home/postgres/REL7_2/pgsql
/home/postgres/REL7_2/pgsql/src/interfaces/odbc/GNUmakefile: 54: LINK.shared += $(shlib_symbolic)
/home/postgres/REL7_2/pgsql/src/makefiles/Makefile.bsdi: 19: shlib_symbolic = -Wl,-Bsymbolic
/home/postgres/REL7_2/pgsql/src/makefiles/Makefile.freebsd: 6: shlib_symbolic = -Wl,-Bsymbolic -lc
/home/postgres/REL7_2/pgsql/src/makefiles/Makefile.hpux: 33: shlib_symbolic = -Bsymbolic
/home/postgres/REL7_2/pgsql/src/makefiles/Makefile.irix5: 4: shlib_symbolic = -Wl,-B,symbolic
/home/postgres/REL7_2/pgsql/src/makefiles/Makefile.linux: 4: shlib_symbolic = -Wl,-Bsymbolic
/home/postgres/REL7_2/pgsql/src/makefiles/Makefile.netbsd: 6: shlib_symbolic = -Wl,-Bsymbolic -lc
/home/postgres/REL7_2/pgsql/src/makefiles/Makefile.openbsd: 6: shlib_symbolic = -Wl,-Bsymbolic
/home/postgres/REL7_2/pgsql/src/makefiles/Makefile.sco: 3: shlib_symbolic = -Wl,-Bsymbolic
/home/postgres/REL7_2/pgsql/src/makefiles/Makefile.solaris: 11: shlib_symbolic = -Wl,-Bsymbolic
/home/postgres/REL7_2/pgsql/src/makefiles/Makefile.unixware: 6: shlib_symbolic = -Wl,-Bsymbolic

This suggests that we're gonna need -Bsymbolic on Solaris too,
which agrees with your remark upthread.  Unfortunately, with
all the Solaris buildfarm members being hors de C99 right now,
there's no easy way to test.  I'm inclined to go fix the problem
that's blocking castoroides, at least.

            regards, tom lane


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Andrew Gierth
Date:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

 Tom> This suggests that we're gonna need -Bsymbolic on Solaris too,
 Tom> which agrees with your remark upthread. Unfortunately, with all
 Tom> the Solaris buildfarm members being hors de C99 right now, there's
 Tom> no easy way to test.

damselfly and rover_firefly are running derivatives of OpenSolaris, and
they failed as expected.

Using a version script should be possible on Solaris and derivatives (I
believe it was Solaris that introduced the concept and they were copied
by GNU ld, not the reverse) but I don't know what the linker options
are.

-- 
Andrew (irc:RhodiumToad)


Re: BUG #15367: Crash in pg_fe_scram_free when using foreign tables

From
Tom Lane
Date:
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
>  Tom> This suggests that we're gonna need -Bsymbolic on Solaris too,
>  Tom> which agrees with your remark upthread. Unfortunately, with all
>  Tom> the Solaris buildfarm members being hors de C99 right now, there's
>  Tom> no easy way to test.

> damselfly and rover_firefly are running derivatives of OpenSolaris, and
> they failed as expected.

Ah, so I see.

> Using a version script should be possible on Solaris and derivatives (I
> believe it was Solaris that introduced the concept and they were copied
> by GNU ld, not the reverse) but I don't know what the linker options
> are.

Me neither.  If somebody using one of those platforms wants to offer
a better solution, that's fine, but for today let's go with -Bsymbolic.
I pushed a patch accordingly.

            regards, tom lane