Thread: BUG #15367: Crash in pg_fe_scram_free when using foreign tables
The following bug has been logged on the website: Bug reference: 15367 Logged by: Jeremy Evans Email address: code@jeremyevans.net PostgreSQL version: 10.5 Operating system: OpenBSD 6.3 amd64 Description: One of the postgres processes in our production environment crashed with the following backtrace: (gdb) bt #0 thrkill () at -:3 #1 0x000004a14ca47fbe in _libc_abort () at /usr/src/lib/libc/stdlib/abort.c:51 #2 0x000004a14c9fd029 in wrterror (d=Variable "d" is not available. ) at /usr/src/lib/libc/stdlib/malloc.c:288 #3 0x000004a14c9fd34f in ofree (argpool=Variable "argpool" is not available. ) at /usr/src/lib/libc/stdlib/malloc.c:1298 #4 0x000004a14c9fd109 in free (ptr=0x4a115aec398) at /usr/src/lib/libc/stdlib/malloc.c:1416 #5 0x000004a09a7301e8 in pg_fe_scram_free () from /usr/local/lib/libpq.so.6.10 #6 0x000004a09a73561f in pqPacketSend () from /usr/local/lib/libpq.so.6.10 #7 0x000004a09a731507 in PQfinish () from /usr/local/lib/libpq.so.6.10 #8 0x000004a074dc5549 in GetConnection () from /usr/local/lib/postgresql/postgres_fdw.so #9 0x000004a074dbc02a in postgresBeginForeignScan () from /usr/local/lib/postgresql/postgres_fdw.so #10 0x0000049e62eab412 in ExecInitForeignScan () from /usr/local/bin/postgres #11 0x0000049e62e89ace in ExecInitNode () from /usr/local/bin/postgres #12 0x0000049e62e9aae8 in ExecInitHashJoin () from /usr/local/bin/postgres #13 0x0000049e62e89b3e in ExecInitNode () from /usr/local/bin/postgres #14 0x0000049e62ea7a06 in ExecInitSort () from /usr/local/bin/postgres #15 0x0000049e62e89aee in ExecInitNode () from /usr/local/bin/postgres #16 0x0000049e62e85379 in standard_ExecutorStart () from /usr/local/bin/postgres #17 0x0000049e62fc4ffb in PortalStart () from /usr/local/bin/postgres #18 0x0000049e62fc43d5 in exec_simple_query () from /usr/local/bin/postgres #19 0x0000049e62fc250b in PostgresMain () from /usr/local/bin/postgres #20 0x0000049e62f473c5 in PostmasterMain () from /usr/local/bin/postgres #21 0x0000049e62ec986b in main () from /usr/local/bin/postgres The PostgreSQL logfile showed: postgres(64978) in free(): bogus pointer (double free?) 0x4a115aec398 2018-09-06 12:01:52.202 PDT [45953] LOG: server process (PID 64978) was terminated by signal 6: Abort trap We are using foreign tables to other databases in the same cluster, and all databases are set to require SCRAM authentication. We've been using SCRAM authentication for about 2 months, and foreign tables for a few years without problems. PostgreSQL version: PostgreSQL 10.5 on x86_64-unknown-openbsd6.3, compiled by OpenBSD clang version 5.0.1 (tags/RELEASE_501/final) (based on LLVM 5.0.1), 64-bit PostgreSQL was installed via OpenBSD ports, using the 6.3-stable branch. PostgreSQL configuration file changes: DateStyle | ISO, MDY | configuration file default_text_search_config | pg_catalog.english | configuration file dynamic_shared_memory_type | posix | configuration file lc_messages | C | configuration file lc_monetary | C | configuration file lc_numeric | C | configuration file lc_time | C | configuration file listen_addresses | * | configuration file log_timezone | US/Pacific | configuration file max_connections | 100 | configuration file max_stack_depth | 2MB | environment variable password_encryption | scram-sha-256 | configuration file shared_buffers | 128MB | configuration file ssl | on | configuration file TimeZone | US/Pacific | configuration file Operating system: OpenBSD 6.3 amd64, with OpenBSD syspatches up to 016: OpenBSD bethpage.bsa.ca.gov 6.3 GENERIC.MP#8 amd64 As the backtrace shows the error was during the closing of a connection, this appears to be unrelated to any specific query. The backtrace also shows withis is due to the use of foreign tables, and thus likely independent of what client program was used to connect. In case it does matter, it was using libpq via the ruby-pg library, and no connection pool or load balancer was being used. Nothing unusual in the PostgreSQL logs other than the the crash and the typical crash recovery. Hardware is a HP DL380p Gen8 server with 256GB of ECC memory and 2 800GB SSDs in hardware RAID-1, with 2 Xeon E5-2670 CPUs. Due to the crash, this is unlikely to be disk/RAID related. General memory corruption is also unlikely as the memory is ECC. If necessary I can build a debug version of PostgreSQL and try using that in production so I can get a better backtrace if it crashes again. However, considering that the crash is rare in my environment, it's unlikely I will be able to produce a better backtrace for the error quickly.
=?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes: > One of the postgres processes in our production environment crashed with the > following backtrace: > (gdb) bt > #0 thrkill () at -:3 > #1 0x000004a14ca47fbe in _libc_abort () at > /usr/src/lib/libc/stdlib/abort.c:51 > #2 0x000004a14c9fd029 in wrterror (d=Variable "d" is not available. > ) at /usr/src/lib/libc/stdlib/malloc.c:288 > #3 0x000004a14c9fd34f in ofree (argpool=Variable "argpool" is not > available. > ) at /usr/src/lib/libc/stdlib/malloc.c:1298 > #4 0x000004a14c9fd109 in free (ptr=0x4a115aec398) at > /usr/src/lib/libc/stdlib/malloc.c:1416 > #5 0x000004a09a7301e8 in pg_fe_scram_free () from > /usr/local/lib/libpq.so.6.10 > #6 0x000004a09a73561f in pqPacketSend () from > /usr/local/lib/libpq.so.6.10 > #7 0x000004a09a731507 in PQfinish () from /usr/local/lib/libpq.so.6.10 > #8 0x000004a074dc5549 in GetConnection () from > /usr/local/lib/postgresql/postgres_fdw.so Hmm. Unfortunately, I don't think I believe this backtrace. For one thing, there's no direct pathway from pqPacketSend to pg_fe_scram_free. Also, if we ignore that and figure that somehow pg_fe_scram_free is doing a duplicate free, it's still really hard to see how that would happen. Given that you didn't have debug symbols installed, it's not very surprising for the trace to be misleading. Probably, what we're seeing here is the nearest exported functions' names rather than where control really went. > If necessary I can build a debug version of PostgreSQL and try using that in > production so I can get a better backtrace if it crashes again. Please. I don't doubt that there's an issue here, but we're going to need a higher-quality fix on where it is before we can do much. regards, tom lane
On Thu, Sep 06, 2018 at 08:35:39PM +0000, PG Bug reporting form wrote: > If necessary I can build a debug version of PostgreSQL and try using that in > production so I can get a better backtrace if it crashes again. However, > considering that the crash is rare in my environment, it's unlikely I will > be able to produce a better backtrace for the error quickly. That would be nice. From what I can see this would be a race condition, which is not obvious by looking at the code. Testing with a two-node deployment where the first node has a foreign table which connects to a second node, using SCRAM authentication, holding the physical table, then doing many foreign scans across many clients don't show any problem. Did libpq complain at some point in the session where the crash happened about any error? -- Michael
Attachment
On 09/06 02:58, Michael Paquier wrote: > On Thu, Sep 06, 2018 at 08:35:39PM +0000, PG Bug reporting form wrote: > > If necessary I can build a debug version of PostgreSQL and try using that in > > production so I can get a better backtrace if it crashes again. However, > > considering that the crash is rare in my environment, it's unlikely I will > > be able to produce a better backtrace for the error quickly. > > That would be nice. From what I can see this would be a race condition, > which is not obvious by looking at the code. Testing with a two-node > deployment where the first node has a foreign table which connects to a > second node, using SCRAM authentication, holding the physical table, > then doing many foreign scans across many clients don't show any > problem. Did libpq complain at some point in the session where the > crash happened about any error? The PostgreSQL logfile only shows: postgres(64978) in free(): bogus pointer (double free?) 0x4a115aec398 2018-09-06 12:01:52.202 PDT [45953] LOG: server process (PID 64978) was terminated by signal 6: Abort trap 2018-09-06 12:01:52.202 PDT [45953] DETAIL: Failed process was running: ... 2018-09-06 12:01:52.202 PDT [45953] LOG: terminating any other active server processes If there is another place I should look, please let me know. The log files of the client process don't show anything during the crash, probably because the client libpq connection was just dropped when the server process crashed. After the crash, other client libpq connections show the following, which is probably expected: WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because anotherserver process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. I'll try to install a version with debug symbols on September 14, and if it crashes again I'll respond with a more complete and accurate backtrace. Thanks, Jeremy
On 09/06 03:28, Jeremy Evans wrote: > On 09/06 02:58, Michael Paquier wrote: > > On Thu, Sep 06, 2018 at 08:35:39PM +0000, PG Bug reporting form wrote: > > > If necessary I can build a debug version of PostgreSQL and try using that in > > > production so I can get a better backtrace if it crashes again. However, > > > considering that the crash is rare in my environment, it's unlikely I will > > > be able to produce a better backtrace for the error quickly. > > > > That would be nice. From what I can see this would be a race condition, > > which is not obvious by looking at the code. Testing with a two-node > > deployment where the first node has a foreign table which connects to a > > second node, using SCRAM authentication, holding the physical table, > > then doing many foreign scans across many clients don't show any > > problem. Did libpq complain at some point in the session where the > > crash happened about any error? > > The PostgreSQL logfile only shows: > > postgres(64978) in free(): bogus pointer (double free?) 0x4a115aec398 > 2018-09-06 12:01:52.202 PDT [45953] LOG: server process (PID 64978) was terminated by signal 6: Abort trap > 2018-09-06 12:01:52.202 PDT [45953] DETAIL: Failed process was running: ... > 2018-09-06 12:01:52.202 PDT [45953] LOG: terminating any other active server processes > > If there is another place I should look, please let me know. The log > files of the client process don't show anything during the crash, > probably because the client libpq connection was just dropped when the > server process crashed. After the crash, other client libpq connections > show the following, which is probably expected: > > WARNING: terminating connection because of crash of another server process > DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because anotherserver process exited abnormally and possibly corrupted shared memory. > HINT: In a moment you should be able to reconnect to the database and repeat your command. > > I'll try to install a version with debug symbols on September 14, and > if it crashes again I'll respond with a more complete and accurate > backtrace. We experienced an almost identical crash this morning. The query was different, but the backtrace was almost the same, and the query was using a foreign table with SCRAM authentication, just like the one yesterday. One thing that was similar between the two crashes is that shortly before both crashes, we were testing database changes on different databases in the same cluster, different from both the postgres process that crashed (the client of the foreign table scan) and the postgres process executing the foreign table scan. The database changes mostly consisted of the following statement types: DROP TABLE DROP SCHEMA DROP FUNCTION CREATE FUNCTION CREATE SCHEMA CREATE TABLE CREATE INDEX CREATE TRIGGER INSERT GRANT We only recently started testing these database changes in this cluster yesterday. Based on the timing, I'm guessing this issue only occurs when system table changes are being made. I hadn't yet had time to install debug symbols on the production server, but since I think I have a better idea on how to recreate this issue, I will try recreating this on a test cluster with debug symbols. Thanks, Jeremy
On 09/07 09:51, Jeremy Evans wrote: > I hadn't yet had time to install debug symbols on the production server, > but since I think I have a better idea on how to recreate this issue, I > will try recreating this on a test cluster with debug symbols. Thankfully, I was able to recreate this on a test server with debug symbols, by using \watch in psql to run the foreign query while running the database changes on a separate database in a loop. Here's the backtrace: #1 0x000017c254721a8e in _libc_abort () at /usr/src/lib/libc/stdlib/abort.c:51 #2 0x000017c2547303b9 in wrterror (d=Variable "d" is not available.) at /usr/src/lib/libc/stdlib/malloc.c:288 #3 0x000017c2547306df in ofree (argpool=Variable "argpool" is not available.) at /usr/src/lib/libc/stdlib/malloc.c:1298 #4 0x000017c254730499 in free (ptr=0x17c1ee153398) at /usr/src/lib/libc/stdlib/malloc.c:1416 #5 0x000017c20a6b8b00 in pg_fe_scram_free (opaq=0x17c223712000) at fe-auth-scram.c:127 #6 0x000017c20a6b9d20 in pqDropConnection (conn=0x17c1dac61800, flushInput=1 '\001') at fe-connect.c:479 #7 0x000017c20a6bf95e in closePGconn (conn=0x17c1dac61800) at fe-connect.c:3710 #8 0x000017c20a6ba230 in PQfinish (conn=0x17c1dac61800) at fe-connect.c:3731 #9 0x000017c2c276a838 in disconnect_pg_server (entry=0x17c25b118658) at connection.c:312 #10 0x000017c2c2769dba in GetConnection (user=0x17c2c2285c50, will_prep_stmt=0 '\0') at connection.c:169 #11 0x000017c2c275c597 in postgresBeginForeignScan (node=0x17c2d8b18028, eflags=16) at postgres_fdw.c:1325 #12 0x000017bfd91aaa24 in ExecInitForeignScan (node=0x17c26c166408, estate=0x17c2d8b17038, eflags=16) at nodeForeignscan.c:231 #13 0x000017bfd9177f9a in ExecInitNode (node=0x17c26c166408, estate=0x17c2d8b17038, eflags=16) at execProcnode.c:277 #14 0x000017bfd9190f84 in ExecInitHashJoin (node=0x17c26c16a050, estate=0x17c2d8b17038, eflags=16) at nodeHashjoin.c:433 #15 0x000017bfd917800a in ExecInitNode (node=0x17c26c16a050, estate=0x17c2d8b17038, eflags=16) at execProcnode.c:300 #16 0x000017bfd91a4ae0 in ExecInitSort (node=0x17c26c168c50, estate=0x17c2d8b17038, eflags=16) at nodeSort.c:207 #17 0x000017bfd9178042 in ExecInitNode (node=0x17c26c168c50, estate=0x17c2d8b17038, eflags=16) at execProcnode.c:313 #18 0x000017bfd917114b in InitPlan (queryDesc=0x17c1e8fce038, eflags=16) at execMain.c:1045 #19 0x000017bfd9170920 in standard_ExecutorStart (queryDesc=0x17c1e8fce038, eflags=16) at execMain.c:264 #20 0x000017bfd91706ee in ExecutorStart (queryDesc=0x17c1e8fce038, eflags=0) at execMain.c:152 #21 0x000017bfd935a604 in PortalStart (portal=0x17c1dac5f038, params=0x0, eflags=0, snapshot=0x0) at pquery.c:520 #22 0x000017bfd935682d in exec_simple_query ( query_string=0x17c1dbf45038 "SELECT "...) at postgres.c:1060 #23 0x000017bfd9355d13 in PostgresMain (argc=1, argv=0x17c2460a1180, dbname=0x17c2528b2ff8 "bsawww", username=0x17c2528b2fc8"recommendations_public") at postgres.c:4088 #24 0x000017bfd92adb29 in BackendRun (port=0x17c2bf214800) at postmaster.c:4405 #25 0x000017bfd92aceb3 in BackendStartup (port=0x17c2bf214800) at postmaster.c:4077 #26 0x000017bfd92abf3a in ServerLoop () at postmaster.c:1755 #27 0x000017bfd92a9b46 in PostmasterMain (argc=3, argv=0x7f7ffffc1c58) at postmaster.c:1363 #28 0x000017bfd91d6f57 in main (argc=3, argv=0x7f7ffffc1c58) at main.c:228 Hopefully this helps. If you need more information, please let me know. Thanks, Jeremy
On Fri, Sep 07, 2018 at 10:55:18AM -0700, Jeremy Evans wrote: > Here's the backtrace: > > #5 0x000017c20a6b8b00 in pg_fe_scram_free (opaq=0x17c223712000) at fe-auth-scram.c:127 > #6 0x000017c20a6b9d20 in pqDropConnection (conn=0x17c1dac61800, flushInput=1 '\001') at fe-connect.c:479 > #7 0x000017c20a6bf95e in closePGconn (conn=0x17c1dac61800) at fe-connect.c:3710 > > Hopefully this helps. If you need more information, please let me know. Thanks. This fails when freeing the password field in fe_scram_state. I have a question: does your password use non-ASCII characters which could make SASLprep to be run on the password string? The result string from pg_saslprep allocates a new string for any results returned by reading the code, I am wondering if we could be missing something.. -- Michael
Attachment
On 09/07 11:15, Michael Paquier wrote: > On Fri, Sep 07, 2018 at 10:55:18AM -0700, Jeremy Evans wrote: > > Here's the backtrace: > > > > #5 0x000017c20a6b8b00 in pg_fe_scram_free (opaq=0x17c223712000) at fe-auth-scram.c:127 > > #6 0x000017c20a6b9d20 in pqDropConnection (conn=0x17c1dac61800, flushInput=1 '\001') at fe-connect.c:479 > > #7 0x000017c20a6bf95e in closePGconn (conn=0x17c1dac61800) at fe-connect.c:3710 > > > > Hopefully this helps. If you need more information, please let me know. > > Thanks. This fails when freeing the password field in fe_scram_state. > I have a question: does your password use non-ASCII characters which > could make SASLprep to be run on the password string? The result string > from pg_saslprep allocates a new string for any results returned by > reading the code, I am wondering if we could be missing something.. No, the user mapping password is comprised of hex-ascii characters: [0-9a-f]{16} In case it helps, here's the content of opaq from gdb (with the specific strings cleared). print *((fe_scram_state *) opaq) { state = FE_SCRAM_FINISHED, username = 0x17c267526380 "...", password = 0x17c1ee153398 <Address 0x17c1ee153398 out of bounds>, SaltedPassword = 0x17c223712018 "...", client_nonce = 0x17c267529f60 "...", client_first_message_bare = 0x17c223711da0 "...", client_final_message_without_proof = 0x17c271ca19c0 "...", server_first_message = 0x17c1ea48a400 "...", salt = 0x17c2675261c0 "...", saltlen = 16, iterations = 4096, nonce = 0x17c2179e9880 "...", server_final_message = 0x17c271ca1a40 "...", ServerSignature = 0x17c223712078 "..." } Thanks, Jeremy
Michael Paquier <michael@paquier.xyz> writes: > Thanks. This fails when freeing the password field in fe_scram_state. I had been studying that when the report first arrived. I'm not really happy with the coding in pg_saslprep(): it leaves garbage in its output parameter in some code paths, and I think it leaks memory in others. But I couldn't see a path that would lead to the observed failure. regards, tom lane
On Fri, Sep 07, 2018 at 04:18:55PM -0400, Tom Lane wrote: > I had been studying that when the report first arrived. I'm not really > happy with the coding in pg_saslprep(): it leaves garbage in its output > parameter in some code paths, and I think it leaks memory in others. Agreed. Do you want to give it a go? I can see some of those code paths? Likely you spotted more. > But I couldn't see a path that would lead to the observed failure. Agreed. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > On Fri, Sep 07, 2018 at 04:18:55PM -0400, Tom Lane wrote: >> I had been studying that when the report first arrived. I'm not really >> happy with the coding in pg_saslprep(): it leaves garbage in its output >> parameter in some code paths, and I think it leaks memory in others. > Agreed. Do you want to give it a go? I can see some of those code > paths? Likely you spotted more. Looking closer, there are no leaks, though I think we could use a comment about that. And it still makes me itch that pg_saslprep (mostly) doesn't worry about clearing *output on failure while only three of its four callers bother to initialize their variables to null. That's a recipe for future bugs. So I propose the attached cleanup. We're still no closer to an explanation of Jeremy's failure, though I'm now pretty sure that pg_saslprep itself isn't the issue. regards, tom lane diff --git a/src/backend/libpq/auth-scram.c b/src/backend/libpq/auth-scram.c index 8fbad53..e997c94 100644 --- a/src/backend/libpq/auth-scram.c +++ b/src/backend/libpq/auth-scram.c @@ -453,7 +453,7 @@ pg_be_scram_exchange(void *opaq, char *input, int inputlen, char * pg_be_scram_build_verifier(const char *password) { - char *prep_password = NULL; + char *prep_password; pg_saslprep_rc rc; char saltbuf[SCRAM_DEFAULT_SALT_LEN]; char *result; @@ -499,7 +499,7 @@ scram_verify_plain_password(const char *username, const char *password, uint8 stored_key[SCRAM_KEY_LEN]; uint8 server_key[SCRAM_KEY_LEN]; uint8 computed_key[SCRAM_KEY_LEN]; - char *prep_password = NULL; + char *prep_password; pg_saslprep_rc rc; if (!parse_scram_verifier(verifier, &iterations, &encoded_salt, diff --git a/src/common/saslprep.c b/src/common/saslprep.c index 2710215..4cf574f 100644 --- a/src/common/saslprep.c +++ b/src/common/saslprep.c @@ -1081,6 +1081,9 @@ pg_saslprep(const char *input, char **output) unsigned char *p; pg_wchar *wp; + /* Ensure we return *output as NULL on failure */ + *output = NULL; + /* Check that the password isn't stupendously long */ if (strlen(input) > MAX_PASSWORD_LENGTH) { @@ -1112,10 +1115,7 @@ pg_saslprep(const char *input, char **output) */ input_size = pg_utf8_string_len(input); if (input_size < 0) - { - *output = NULL; return SASLPREP_INVALID_UTF8; - } input_chars = ALLOC((input_size + 1) * sizeof(pg_wchar)); if (!input_chars) @@ -1246,6 +1246,11 @@ pg_saslprep(const char *input, char **output) result = ALLOC(result_size + 1); if (!result) goto oom; + + /* + * There are no error exits below here, so the error exit paths don't need + * to worry about possibly freeing "result". + */ p = (unsigned char *) result; for (wp = output_chars; *wp; wp++) { diff --git a/src/interfaces/libpq/fe-auth-scram.c b/src/interfaces/libpq/fe-auth-scram.c index 1d9c937..81531f3 100644 --- a/src/interfaces/libpq/fe-auth-scram.c +++ b/src/interfaces/libpq/fe-auth-scram.c @@ -747,7 +747,7 @@ verify_server_signature(fe_scram_state *state) char * pg_fe_scram_build_verifier(const char *password) { - char *prep_password = NULL; + char *prep_password; pg_saslprep_rc rc; char saltbuf[SCRAM_DEFAULT_SALT_LEN]; char *result;
I wrote: > We're still no closer to an explanation of Jeremy's failure, though > I'm now pretty sure that pg_saslprep itself isn't the issue. I had an idea about that --- it's probably all wet, but the code as written seems bulletproof enough that I'm forced to postulate something very strange is happening. Observe that there are two copies of pg_saslprep() in play: there is one in the backend, which is compiled to allocate its result with palloc, and there is one in libpq, which is compiled to allocate its result with malloc. Could it be that somehow, when libpq is loaded into the backend address space as it is here, libpq winds up calling the backend's copy of pg_saslprep rather than its own? That would work just fine, until libpq tried to free the returned string using free(), and then we'd get exactly the reported error. The main weakness in this theory is that it suggests that Jeremy's postgres_fdw connections ought to be falling over more easily than they are. However, I think that postgres_fdw will never explicitly close a PGconn unless it's forced to; during a normal backend session exit, the process just dies without going through PQfinish, so that the problem would be masked. The only way to get to the PQfinish call shown in the backtrace is for pgfdw_inval_callback to mark the connection invalid, which'd require either an update on the relevant foreign server object, an update on the user mapping in use, or a SI cache reset. That explains how heavy DDL activity in an apparently unrelated database can trigger the problem: it eventually results in a SI message queue overrun and ensuing cache reset. In the absence of SI cache resets, maybe indeed the problem is rare even if the pg_saslprep result is misallocated every time. Not sure about a good way to test this theory. Need more caffeine. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> Could it be that somehow, when libpq is loaded into the backend Tom> address space as it is here, libpq winds up calling the backend's Tom> copy of pg_saslprep rather than its own? Yes, that is a thing that happens with ELF-style shared libs when you dynamically load shared libraries that weren't linked with -Bsymbolic. Is this potentially also a problem for libpqwalreceiver, which also loads libpq? -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Tom> Could it be that somehow, when libpq is loaded into the backend > Tom> address space as it is here, libpq winds up calling the backend's > Tom> copy of pg_saslprep rather than its own? > Yes, that is a thing that happens with ELF-style shared libs when you > dynamically load shared libraries that weren't linked with -Bsymbolic. If true, that's potentially pretty disastrous, because *every* src/port or src/common function that's used by libpq would be similarly subject to replacement by its backend version, and every behavioral difference between the frontend and backend coding would be a source of trouble. However, the info I can find about this suggests that it's only relevant for exported functions, not ones that are hidden by means of a version-script. Since libpq doesn't export pg_saslprep, this shouldn't be happening? But wait, this problem is being reported on OpenBSD. I wonder whether their linker behaves anything like Linux's ... hm, it looks like it *is* the same linker. I tried setting up a scram-auth-based postgres_fdw connection on my RHEL6 box and forcing it through the questionable code path by issuing a dummy ALTER USER MAPPING command from another session. It worked fine, so the problem doesn't seem to be present on Linux. Repeating the same thing on a nearby OpenBSD 6.0 image ... kaboom! It fails exactly as Jeremy describes. Furthermore, I can confirm that the attached patch fixes it. I wonder which other platforms we have that behave like this? And how come we've not seen symptoms before? It's hard to believe that src/common/saslprep.c is the only thing libpq imports that has FRONTEND-vs-not-FRONTEND behavior differences. regards, tom lane diff --git a/src/Makefile.shlib b/src/Makefile.shlib index 95b82a6..1555be3 100644 --- a/src/Makefile.shlib +++ b/src/Makefile.shlib @@ -149,6 +149,11 @@ ifeq ($(PORTNAME), openbsd) ifdef soname LINK.shared += -Wl,-x,-soname,$(soname) endif + BUILD.exports = ( echo '{ global:'; $(AWK) '/^[^\#]/ {printf "%s;\n",$$1}' $<; echo ' local: *; };' ) >$@ + exports_file = $(SHLIB_EXPORTS:%.txt=%.list) + ifneq (,$(exports_file)) + LINK.shared += -Wl,--version-script=$(exports_file) + endif SHLIB_LINK += -lc else LINK.shared = $(LD) -x -Bshareable -Bforcearchive
I wrote: > I tried setting up a scram-auth-based postgres_fdw connection on > my RHEL6 box and forcing it through the questionable code path > by issuing a dummy ALTER USER MAPPING command from another session. > It worked fine, so the problem doesn't seem to be present on Linux. Oh, I meant to show the test I used, for the benefit of anyone who wants to test on other platforms. Do this (with adjustments to taste) to set up: set password_encryption = "scram-sha-256" ; create user u1 password 'foobar'; create extension postgres_fdw; create server loopback foreign data wrapper postgres_fdw options (host 'localhost', dbname 'postgres'); create user mapping for u1 server loopback options (user 'u1', password 'foobar'); grant usage on foreign server loopback to u1; \c - u1 create table remote (f1 int); insert into remote values(1),(2); create foreign table localt (f1 int) server loopback options (table_name 'remote'); At this point you should be able to do select * from localt; as u1; if not, adjust pg_hba.conf to require a password on localhost connections. Once the select works, *leave that session open*, and in another session do any no-op ALTER on the user mapping, such as alter user mapping for u1 server loopback options (set user 'u1'); Now, in the first session, repeat select * from localt; It'll dump core with a backtrace similar to what Jeremy showed, if the bug is present. regards, tom lane
On Sat, Sep 08, 2018 at 07:56:23AM -0400, Tom Lane wrote: > Looking closer, there are no leaks, though I think we could use a comment > about that. And it still makes me itch that pg_saslprep (mostly) doesn't > worry about clearing *output on failure while only three of its four > callers bother to initialize their variables to null. That's a recipe > for future bugs. So I propose the attached cleanup. Okay, that looks acceptable to me. Thanks for looking at that. -- Michael
Attachment
On Sat, Sep 08, 2018 at 03:02:00PM -0400, Tom Lane wrote: > I tried setting up a scram-auth-based postgres_fdw connection on > my RHEL6 box and forcing it through the questionable code path > by issuing a dummy ALTER USER MAPPING command from another session. > It worked fine, so the problem doesn't seem to be present on Linux. > > Repeating the same thing on a nearby OpenBSD 6.0 image ... kaboom! > It fails exactly as Jeremy describes. Furthermore, I can confirm > that the attached patch fixes it. That's not good! I have tried the report on Linux only, with pgbench -C doing many foreign scans through postgres_fdw which connects to the remote using SCRAM authentication, and this was stable. Anything newer than ec136d19 is impacted. I am surprised this has not been seen earlier to be honest. > I wonder which other platforms we have that behave like this? > And how come we've not seen symptoms before? It's hard to believe > that src/common/saslprep.c is the only thing libpq imports > that has FRONTEND-vs-not-FRONTEND behavior differences. libpq is quite particular in the fact that it compiles directly with stuff in src/common, src/port, etc. I think that the same problem exists for pg_md5_encrypt(). So I like the solution proposed. I guess you have made sure that postgres_fdw becomes stable this way? Have you tried to stress it more with pgbench -C doing only foreign scans and dozens of clients running in parallel? -- Michael
Attachment
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> However, the info I can find about this suggests that it's only Tom> relevant for exported functions, not ones that are hidden by means Tom> of a version-script. Since libpq doesn't export pg_saslprep, this Tom> shouldn't be happening? Tom> But wait, this problem is being reported on OpenBSD. I wonder Tom> whether their linker behaves anything like Linux's ... hm, it Tom> looks like it *is* the same linker. But the makefile is not doing the same thing. SHLIB_EXPORTS looks like it's being ignored on every platform except linux, macos and windows in Makefile.shlib. -- Andrew (irc:RhodiumToad)
I wrote: > I wonder which other platforms we have that behave like this? > And how come we've not seen symptoms before? It's hard to believe > that src/common/saslprep.c is the only thing libpq imports > that has FRONTEND-vs-not-FRONTEND behavior differences. After groveling through all the FRONTEND conditionals in those directories, it seems like indeed this is the first really obvious problem that would ensue from libpq invoking the backend version of some common function. There are some minor pecadilloes: * "could not determine encoding" warning in chklocale.c might get printed via ereport, not fprintf(stderr) as expected; but likely nobody would even realize that was the wrong thing, and it's a very unlikely message anyway. * getaddrinfo.c might use non-reentrant gethostbyname, but in the backend environment that's hardly a problem due to lack of threading. * thread.c might use non-reentrant strerror, getpwuid, gethostbyname, but ditto. * saslprep.c might ereport about password-too-long, which would be bad (leading to leakage of a partially-built PGconn) but it's probably not a case anybody would exercise in the field. * scram_build_verifier might palloc instead of malloc, as might unicode_norm.c. In some code paths I think that'd be OK because the caller would also be backend code and would pfree not free, but I think there are other paths where this is also a crash risk. Still, these also are issues that are new with SCRAM support. So the bottom line is that this theory does fit the observed facts. Hence we need to make sure that version-script filtering happens on every ELF-based platform, not just the most popular ones. I see that the FreeBSD and NetBSD stanzas in Makefile.shlib also have "ifdef ELF_SYSTEM" portions, so probably we need to put the version-script stuff into those too. Anybody have thoughts about other possibly-affected platforms? Andrew Gierth <andrew@tao11.riddles.org.uk> writes: >> Is this potentially also a problem for libpqwalreceiver, which also >> loads libpq? The proposed fix would make libpq OK in that context, but I'm now wondering if libpqwalreceiver itself needs a version-script. regards, tom lane
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Tom> But wait, this problem is being reported on OpenBSD. I wonder > Tom> whether their linker behaves anything like Linux's ... hm, it > Tom> looks like it *is* the same linker. > But the makefile is not doing the same thing. SHLIB_EXPORTS looks like > it's being ignored on every platform except linux, macos and windows in > Makefile.shlib. Exactly. See proposed patch. regards, tom lane
Michael Paquier <michael@paquier.xyz> writes: > libpq is quite particular in the fact that it compiles directly with > stuff in src/common, src/port, etc. I think that the same problem > exists for pg_md5_encrypt(). So I like the solution proposed. I guess > you have made sure that postgres_fdw becomes stable this way? It still passes its regression test with the patch, on my OpenBSD image. See also the reproducer procedure I posted. The bigger issue we need to worry about is whether applying version-script symbol filtering would cause issues for third-party code that might be relying on being able to access libpq "internal" functions on *BSD. I don't have much sympathy for that and wouldn't hesitate to apply the filtering in HEAD and v11, but maybe in v10 we ought not do that. We could possibly band-aid around the problem in v10 by changing the SCRAM support functions so that they always malloc, never palloc, whether in frontend or not. That might lead to memory leaks in the backend case, but I'm not sure how much we care about memory leaks during backend authentication failures --- the process would exit shortly anyway. I've not investigated this idea any further than thinking of it, though, and I'm not sure it's worth the trouble to avoid symbol filtering. I'm inclined to add a test in src/test/authentication that exercises this problem, in hopes of finding out from the buildfarm whether we have any other platforms with the issue. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> Hence we need to make sure that version-script filtering happens Tom> on every ELF-based platform, not just the most popular ones. I see Tom> that the FreeBSD and NetBSD stanzas in Makefile.shlib also have Tom> "ifdef ELF_SYSTEM" portions, so probably we need to put the Tom> version-script stuff into those too. I tested adding your patch into the FreeBSD stanza and it worked for me, FWIW. That was on a system which still has GNU ld, but I also tried a test program using ld.lld and the same option and version script file works. -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Tom> Hence we need to make sure that version-script filtering happens > Tom> on every ELF-based platform, not just the most popular ones. I see > Tom> that the FreeBSD and NetBSD stanzas in Makefile.shlib also have > Tom> "ifdef ELF_SYSTEM" portions, so probably we need to put the > Tom> version-script stuff into those too. > I tested adding your patch into the FreeBSD stanza and it worked for me, > FWIW. That was on a system which still has GNU ld, but I also tried a > test program using ld.lld and the same option and version script file > works. Hm. So I can reproduce the crash on NetBSD as well as OpenBSD, using the test method I showed before ... but *not* on FreeBSD 11.0 x86_64, even though that's certainly an ELF system. So now I'm confused again. regards, tom lane
I wrote: > Hm. So I can reproduce the crash on NetBSD as well as OpenBSD, using > the test method I showed before ... but *not* on FreeBSD 11.0 x86_64, > even though that's certainly an ELF system. So now I'm confused again. Further debugging shows conclusively that libpq *is* calling the backend copy of pg_saslprep, and that what it passes to free() came from palloc() in the case at hand. Apparently, the default malloc library in this version of FreeBSD is not as good at noticing bad free() arguments as the other platforms I tested. Probably, the process's malloc freelist is now corrupt and that would eventually lead to a crash, but I did not try to prove it. Adding the proposed patch to the FreeBSD stanza of Makefile.shlib does cause libpq to start calling the desired version of pg_saslprep. On the other front of developing a test case to detect this problem, I did not have much luck with mechanizing this specific test: it requires some functionality we have in the TAP tests, like setting up an installation with SCRAM password authentication enabled, as well as other functionality we have in the isolation tester, like doing things in two different sessions concurrently. But we don't really need to test this *exact* scenario; we just need something that will behave differently if libpq links to backend versions of any of the problematic functions. I'm a bit tempted to add something like this to saslprep.c: bool saslprep_is_frontend() { #ifdef FRONTEND return true; #else return false; #endif } and then have libpq test to make sure this function returns true. regards, tom lane
I wrote: > On the other front of developing a test case to detect this problem, > I did not have much luck with mechanizing this specific test: it > requires some functionality we have in the TAP tests, like setting > up an installation with SCRAM password authentication enabled, as > well as other functionality we have in the isolation tester, like > doing things in two different sessions concurrently. But we don't > really need to test this *exact* scenario; we just need something > that will behave differently if libpq links to backend versions of > any of the problematic functions. I'm a bit tempted to add > something like this to saslprep.c: > bool > saslprep_is_frontend() > { > #ifdef FRONTEND > return true; > #else > return false; > #endif > } On reflection, saslprep.c is not the right place for this: if some other shlib needs a similar test, it might be forced to import SASL support it has no need for. So I put the test function into its own new .c file, which increases the footprint of the patch a bit but seems much less ad-hoc. I verified that this patch causes a failure in the postgres_fdw regression test on affected platforms, even my FreeBSD box where the original bug doesn't cause an easily-detectable failure. I propose this for HEAD only; the main point is just to detect which platforms have the issue, and HEAD seems like enough for that. regards, tom lane diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c index cdd71a9..578af2e 100644 --- a/src/backend/bootstrap/bootstrap.c +++ b/src/backend/bootstrap/bootstrap.c @@ -24,6 +24,7 @@ #include "catalog/index.h" #include "catalog/pg_collation.h" #include "catalog/pg_type.h" +#include "common/link-canary.h" #include "libpq/pqsignal.h" #include "miscadmin.h" #include "nodes/makefuncs.h" @@ -503,6 +504,13 @@ BootstrapModeMain(void) Assert(IsBootstrapProcessingMode()); /* + * To ensure that src/common/link-canary.c is linked into the backend, we + * must call it from somewhere. Here is as good as anywhere. + */ + if (pg_link_canary_is_frontend()) + elog(ERROR, "backend is incorrectly linked to frontend functions"); + + /* * Do backend-like initialization for bootstrap mode */ InitProcess(); diff --git a/src/common/Makefile b/src/common/Makefile index 1fc2c66..6871747 100644 --- a/src/common/Makefile +++ b/src/common/Makefile @@ -41,7 +41,8 @@ override CPPFLAGS := -DFRONTEND $(CPPFLAGS) LIBS += $(PTHREAD_LIBS) OBJS_COMMON = base64.o config_info.o controldata_utils.o exec.o file_perm.o \ - ip.o keywords.o md5.o pg_lzcompress.o pgfnames.o psprintf.o relpath.o \ + ip.o keywords.o link-canary.o md5.o pg_lzcompress.o \ + pgfnames.o psprintf.o relpath.o \ rmtree.o saslprep.o scram-common.o string.o unicode_norm.o \ username.o wait_error.o diff --git a/src/common/link-canary.c b/src/common/link-canary.c new file mode 100644 index 0000000..5b4e562 --- /dev/null +++ b/src/common/link-canary.c @@ -0,0 +1,36 @@ +/*------------------------------------------------------------------------- + * link-canary.c + * Detect whether src/common functions came from frontend or backend. + * + * Copyright (c) 2018, PostgreSQL Global Development Group + * + * IDENTIFICATION + * src/common/link-canary.c + * + *------------------------------------------------------------------------- + */ +#include "c.h" + +#include "common/link-canary.h" + +/* + * This function just reports whether this file was compiled for frontend + * or backend environment. We need this because in some systems, mainly + * ELF-based platforms, it is possible for a shlib (such as libpq) loaded + * into the backend to call a backend function named XYZ in preference to + * the shlib's own function XYZ. That's bad if the two functions don't + * act identically. This exact situation comes up for many functions in + * src/common and src/port, where the same function names exist in both + * libpq and the backend but they don't act quite identically. To verify + * that appropriate measures have been taken to prevent incorrect symbol + * resolution, libpq should test that this function returns true. + */ +bool +pg_link_canary_is_frontend(void) +{ +#ifdef FRONTEND + return true; +#else + return false; +#endif +} diff --git a/src/include/common/link-canary.h b/src/include/common/link-canary.h new file mode 100644 index 0000000..917faae --- /dev/null +++ b/src/include/common/link-canary.h @@ -0,0 +1,17 @@ +/*------------------------------------------------------------------------- + * + * link-canary.h + * Detect whether src/common functions came from frontend or backend. + * + * Copyright (c) 2018, PostgreSQL Global Development Group + * + * src/include/common/link-canary.h + * + *------------------------------------------------------------------------- + */ +#ifndef LINK_CANARY_H +#define LINK_CANARY_H + +extern bool pg_link_canary_is_frontend(void); + +#endif /* LINK_CANARY_H */ diff --git a/src/interfaces/libpq/.gitignore b/src/interfaces/libpq/.gitignore index 5c232ae..ce1576e 100644 --- a/src/interfaces/libpq/.gitignore +++ b/src/interfaces/libpq/.gitignore @@ -25,6 +25,7 @@ /ip.c /md5.c /base64.c +/link-canary.c /scram-common.c /sha2.c /sha2_openssl.c diff --git a/src/interfaces/libpq/Makefile b/src/interfaces/libpq/Makefile index abe0a50..ec0122a 100644 --- a/src/interfaces/libpq/Makefile +++ b/src/interfaces/libpq/Makefile @@ -49,7 +49,7 @@ endif # src/backend/utils/mb OBJS += encnames.o wchar.o # src/common -OBJS += base64.o ip.o md5.o scram-common.o saslprep.o unicode_norm.o +OBJS += base64.o ip.o link-canary.o md5.o scram-common.o saslprep.o unicode_norm.o ifeq ($(with_openssl),yes) OBJS += fe-secure-openssl.o fe-secure-common.o sha2_openssl.o @@ -106,7 +106,7 @@ backend_src = $(top_srcdir)/src/backend chklocale.c crypt.c erand48.c getaddrinfo.c getpeereid.c inet_aton.c inet_net_ntop.c noblock.c open.c system.c pgsleep.cpg_strong_random.c pgstrcasecmp.c pqsignal.c snprintf.c strerror.c strlcpy.c strnlen.c thread.c win32error.c win32setlocale.c:% : $(top_srcdir)/src/port/% rm -f $@ && $(LN_S) $< . -ip.c md5.c base64.c scram-common.c sha2.c sha2_openssl.c saslprep.c unicode_norm.c: % : $(top_srcdir)/src/common/% +ip.c md5.c base64.c link-canary.c scram-common.c sha2.c sha2_openssl.c saslprep.c unicode_norm.c: % : $(top_srcdir)/src/common/% rm -f $@ && $(LN_S) $< . encnames.c wchar.c: % : $(backend_src)/utils/mb/% @@ -156,7 +156,7 @@ clean distclean: clean-lib rm -f pg_config_paths.h # Remove files we (may have) symlinked in from src/port and other places rm -f chklocale.c crypt.c erand48.c getaddrinfo.c getpeereid.c inet_aton.c inet_net_ntop.c noblock.c open.c system.cpgsleep.c pg_strong_random.c pgstrcasecmp.c pqsignal.c snprintf.c strerror.c strlcpy.c strnlen.c thread.c win32error.cwin32setlocale.c - rm -f ip.c md5.c base64.c scram-common.c sha2.c sha2_openssl.c saslprep.c unicode_norm.c + rm -f ip.c md5.c base64.c link-canary.c scram-common.c sha2.c sha2_openssl.c saslprep.c unicode_norm.c rm -f encnames.c wchar.c maintainer-clean: distclean maintainer-clean-lib diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c index db5aacd..42cdb97 100644 --- a/src/interfaces/libpq/fe-connect.c +++ b/src/interfaces/libpq/fe-connect.c @@ -71,6 +71,7 @@ static int ldapServiceLookup(const char *purl, PQconninfoOption *options, #endif #include "common/ip.h" +#include "common/link-canary.h" #include "common/scram-common.h" #include "mb/pg_wchar.h" #include "port/pg_bswap.h" @@ -1748,6 +1749,19 @@ connectDBStart(PGconn *conn) if (!conn->options_valid) goto connect_errReturn; + /* + * Check for bad linking to backend-internal versions of src/common + * functions (see comments in link-canary.c for the reason we need this). + * Nobody but developers should see this message, so we don't bother + * translating it. + */ + if (!pg_link_canary_is_frontend()) + { + printfPQExpBuffer(&conn->errorMessage, + "libpq is incorrectly linked to backend functions\n"); + goto connect_errReturn; + } + /* Ensure our buffers are empty */ conn->inStart = conn->inCursor = conn->inEnd = 0; conn->outCount = 0; diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm index 14d2a3c..9ce03ce 100644 --- a/src/tools/msvc/Mkvcbuild.pm +++ b/src/tools/msvc/Mkvcbuild.pm @@ -116,7 +116,8 @@ sub mkvcbuild our @pgcommonallfiles = qw( base64.c config_info.c controldata_utils.c exec.c file_perm.c ip.c - keywords.c md5.c pg_lzcompress.c pgfnames.c psprintf.c relpath.c rmtree.c + keywords.c link-canary.c md5.c + pg_lzcompress.c pgfnames.c psprintf.c relpath.c rmtree.c saslprep.c scram-common.c string.c unicode_norm.c username.c wait_error.c);
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > Is this potentially also a problem for libpqwalreceiver, which also > loads libpq? As far as I can see, only libpq and the ECPG libraries build their own copies of any src/port or src/common files, and all of them already have version scripts. If libpqwalreceiver tried to incorporate any such files and compile them under FRONTEND rules, it'd need a version script as well. But I think that code is probably content to operate under backend rules, so it'll just use the functions from the core backend and be happy. Hence, no clear need for a version script there. I think the patches I committed today are sufficient to close this bug, unless further testing reveals that the problem occurs on additional platforms. The buildfarm results so far don't suggest that. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> I think the patches I committed today are sufficient to close this Tom> bug, unless further testing reveals that the problem occurs on Tom> additional platforms. The buildfarm results so far don't suggest Tom> that. It seems to have failed on gaur (HP-UX 10.20 gcc 3.4.6 HPPA) too? or do we not care about that? No reports yet from any of the solaris derivatives (which will certainly break) or from AIX (about which I wouldn't care to guess). -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Tom> I think the patches I committed today are sufficient to close this > Tom> bug, unless further testing reveals that the problem occurs on > Tom> additional platforms. The buildfarm results so far don't suggest > Tom> that. > It seems to have failed on gaur (HP-UX 10.20 gcc 3.4.6 HPPA) too? Yeah, that came in right after I sent this :-( > or do we not care about that? Well, I care, but if I can't fix it then we'll just have to drop this animal. I've not studied the docs yet to see if there's a solution. > No reports yet from any of the solaris derivatives (which will certainly > break) or from AIX (about which I wouldn't care to guess). I'm waiting for those too. regards, tom lane
I wrote: > Andrew Gierth <andrew@tao11.riddles.org.uk> writes: >> It seems to have failed on gaur (HP-UX 10.20 gcc 3.4.6 HPPA) too? > Yeah, that came in right after I sent this :-( >> or do we not care about that? > Well, I care, but if I can't fix it then we'll just have to drop > this animal. I've not studied the docs yet to see if there's a > solution. Ugh ... anole just reported in with the same issue, which means it exists on HP-UX versions that people might still care about in practice. I still haven't looked for a fix (real life has intruded more than usual right at the moment). regards, tom lane
I wrote: > Ugh ... anole just reported in with the same issue, which means it > exists on HP-UX versions that people might still care about in > practice. > I still haven't looked for a fix (real life has intruded more than > usual right at the moment). AFAICT, this platform's "ld" doesn't have any simple equivalent of a version script, but it does have "-B symbolic", which appears to fix the problem in a quick test. I'm running a more thorough test now. The fix I'm able to test looks like ifeq ($(PORTNAME), hpux) ifdef SO_MAJOR_VERSION shlib = lib$(NAME)$(DLSUFFIX).$(SO_MAJOR_VERSION) endif ifeq ($(with_gnu_ld), yes) LINK.shared = $(CC) -shared ifdef soname LINK.shared += -Wl,-h -Wl,$(soname) endif else - LINK.shared = $(LD) -b + LINK.shared = $(LD) -b -B symbolic ifdef soname LINK.shared += +h $(soname) endif I'm not sure what to do in the with_gnu_ld subsection: building with GNU ld might or might not have the problem, and if it does, the best fix might or might not be "-B symbolic". But none of the HPUX animals in the buildfarm are using GNU ld, so I suspect that that subsection is dead code. I'm inclined to just add a comment saying that that code has not been tested recently. regards, tom lane
I wrote: > AFAICT, this platform's "ld" doesn't have any simple equivalent of > a version script, but it does have "-B symbolic", which appears to > fix the problem in a quick test. BTW, I discovered while rummaging through the git history that we used to use -Bsymbolic on most platforms --- but only for ODBC, which had exactly the problem we're fighting now of an intended-to- be-local reference being resolved outside the library, but in a place where the malfunction was very obvious. When we removed ODBC from the core distro, the Makefile support for -Bsymbolic eventually went away too. But back around 7.2 we had this: $ grep -r shlib_symbolic /home/postgres/REL7_2/pgsql /home/postgres/REL7_2/pgsql/src/interfaces/odbc/GNUmakefile: 54: LINK.shared += $(shlib_symbolic) /home/postgres/REL7_2/pgsql/src/makefiles/Makefile.bsdi: 19: shlib_symbolic = -Wl,-Bsymbolic /home/postgres/REL7_2/pgsql/src/makefiles/Makefile.freebsd: 6: shlib_symbolic = -Wl,-Bsymbolic -lc /home/postgres/REL7_2/pgsql/src/makefiles/Makefile.hpux: 33: shlib_symbolic = -Bsymbolic /home/postgres/REL7_2/pgsql/src/makefiles/Makefile.irix5: 4: shlib_symbolic = -Wl,-B,symbolic /home/postgres/REL7_2/pgsql/src/makefiles/Makefile.linux: 4: shlib_symbolic = -Wl,-Bsymbolic /home/postgres/REL7_2/pgsql/src/makefiles/Makefile.netbsd: 6: shlib_symbolic = -Wl,-Bsymbolic -lc /home/postgres/REL7_2/pgsql/src/makefiles/Makefile.openbsd: 6: shlib_symbolic = -Wl,-Bsymbolic /home/postgres/REL7_2/pgsql/src/makefiles/Makefile.sco: 3: shlib_symbolic = -Wl,-Bsymbolic /home/postgres/REL7_2/pgsql/src/makefiles/Makefile.solaris: 11: shlib_symbolic = -Wl,-Bsymbolic /home/postgres/REL7_2/pgsql/src/makefiles/Makefile.unixware: 6: shlib_symbolic = -Wl,-Bsymbolic This suggests that we're gonna need -Bsymbolic on Solaris too, which agrees with your remark upthread. Unfortunately, with all the Solaris buildfarm members being hors de C99 right now, there's no easy way to test. I'm inclined to go fix the problem that's blocking castoroides, at least. regards, tom lane
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> This suggests that we're gonna need -Bsymbolic on Solaris too, Tom> which agrees with your remark upthread. Unfortunately, with all Tom> the Solaris buildfarm members being hors de C99 right now, there's Tom> no easy way to test. damselfly and rover_firefly are running derivatives of OpenSolaris, and they failed as expected. Using a version script should be possible on Solaris and derivatives (I believe it was Solaris that introduced the concept and they were copied by GNU ld, not the reverse) but I don't know what the linker options are. -- Andrew (irc:RhodiumToad)
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: > Tom> This suggests that we're gonna need -Bsymbolic on Solaris too, > Tom> which agrees with your remark upthread. Unfortunately, with all > Tom> the Solaris buildfarm members being hors de C99 right now, there's > Tom> no easy way to test. > damselfly and rover_firefly are running derivatives of OpenSolaris, and > they failed as expected. Ah, so I see. > Using a version script should be possible on Solaris and derivatives (I > believe it was Solaris that introduced the concept and they were copied > by GNU ld, not the reverse) but I don't know what the linker options > are. Me neither. If somebody using one of those platforms wants to offer a better solution, that's fine, but for today let's go with -Bsymbolic. I pushed a patch accordingly. regards, tom lane