Thread: Interesting fail when migrating Pg from Ubuntu Bionic to Focal
Hi, I know it's going to be most likely due to glibc and locales, but I found interesting case that I can't figure out how to fix. We have pg 12.6 on bionic. Works. Added focal replica (binary). Replicates OK, but then fails when I try to pg_dump -s. Error is: pg_dump: error: query failed: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. pg_dump: error: query was: SELECT p.tableoid, p.oid, p.proname, p.prolang, p.pronargs, p.proargtypes, p.prorettype, (SELECTpg_catalog.array_agg(acl ORDER BY row_n) FROM (SELECT acl, row_n FROM pg_catalog.unnest(coalesce(p.proacl,pg_catalog.acldefault('f',p.proowner)))WITH ORDINALITY AS perm(acl,row_n) WHERE NOT EXISTS( SELECT 1 FROM pg_catalog.unnest(coalesce(pip.initprivs,pg_catalog.acldefault('f',p.proowner))) AS init(init_acl)WHERE acl = init_acl)) as foo) AS proacl, (SELECT pg_catalog.array_agg(acl ORDER BY row_n) FROM (SELECT acl,row_n FROM pg_catalog.unnest(coalesce(pip.initprivs,pg_catalog.acldefault('f',p.proowner))) WITH ORDINALITY AS initp(acl,row_n)WHERE NOT EXISTS ( SELECT 1 FROM pg_catalog.unnest(coalesce(p.proacl,pg_catalog.acldefault('f',p.proowner)))AS permp(orig_acl) WHERE acl = orig_acl)) as foo)AS rproacl, NULL AS initproacl, NULL AS initrproacl, p.pronamespace, (SELECT rolname FROM pg_catalog.pg_roles WHERE oid= p.proowner) AS rolname FROM pg_proc p LEFT JOIN pg_init_privs pip ON (p.oid = pip.objoid AND pip.classoid = 'pg_proc'::regclassAND pip.objsubid = 0) WHERE p.prokind <> 'a' AND NOT EXISTS (SELECT 1 FROM pg_depend WHERE classid = 'pg_proc'::regclass AND objid = p.oid AND deptype = 'i') AND ( pronamespace != (SELECT oid FROM pg_namespace WHERE nspname = 'pg_catalog') OR EXISTS (SELECT 1 FROM pg_cast WHERE pg_cast.oid > 16383 AND p.oid = pg_cast.castfunc) OR EXISTS (SELECT 1 FROM pg_transform WHERE pg_transform.oid > 16383 AND (p.oid = pg_transform.trffromsql OR p.oid = pg_transform.trftosql)) OR p.proacl IS DISTINCT FROM pip.initprivs) Based on https://wiki.postgresql.org/wiki/Locale_data_changes I wrote: SELECT 'reindex index ' || indexrelid::regclass::text FROM (SELECT indexrelid, indrelid, indcollation[i] coll FROM pg_index, generate_subscripts(indcollation, 1) g(i)) s JOIN pg_collation c ON coll=c.oid WHERE collprovider IN ('d', 'c') AND collname NOT IN ('C', 'POSIX') \gexec and let it run (80k+ indexes). Afterwards - same problem. So I did "reindex system". And the query still fails. I tried running simple "select * from table", for each of: - pg_catalog.pg_roles - pg_catalog.pg_proc - pg_catalog.pg_init_privs - pg_catalog.pg_depend - pg_catalog.pg_namespace - pg_catalog.pg_cast - pg_catalog.pg_transform and it worked, so I'm kinda at loss here. I have test system, can test anything. Any idea on what could be the reason, and if the system is fixable afterwards? Ah, one more thing - straced backend when it was doing its thing. Last 50 lines: 15:01:31.229198 futex(0xfffe49b8cb3c, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.229233 futex(0xfffe49b8c41c, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.229268 futex(0xfffe49b8cdfc, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.229303 futex(0xfffe49b8ce00, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.229338 futex(0xfffe49b8ce08, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.229379 futex(0xfffe49b8dde4, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.229415 futex(0xfffe49b91524, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.229450 futex(0xfffe49b8998c, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.229485 futex(0xfffe49b8cf9c, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.229521 futex(0xfffe49b882e4, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.229556 futex(0xfffe49b8e548, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.229596 futex(0xfffe49b8d75c, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.229634 futex(0xfffe49b8cfe8, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.229670 futex(0xfffe49b8dfe8, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.229705 futex(0xfffe49b8dff0, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.229740 futex(0xfffe49b8dff8, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.229775 futex(0xfffe49b8ddf0, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.229810 futex(0xfffe49b8e0a8, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.229845 futex(0xfffe49b8e200, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.229881 futex(0xfffe49b8e534, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.229925 futex(0xfffe49b8e734, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.229961 futex(0xfffe49b8e7dc, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.229998 futex(0xfffe49b8e924, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.230034 futex(0xfffe49b8ea6c, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.230070 futex(0xfffe49b8ea74, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.230105 futex(0xfffe49b91514, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.230141 futex(0xfffe49b8ad84, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.230177 futex(0xfffe49b91518, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.230212 futex(0xfffe49b9152c, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.230247 futex(0xfffe49b91530, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.230283 futex(0xfffe49b91718, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.230323 futex(0xfffe49b91720, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.230384 futex(0xfffe49b8e488, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000013> 15:01:31.230423 futex(0xfffe49b946cc, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.230464 futex(0xfffe49b884d4, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.230515 futex(0xfffe49b955c0, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.230624 futex(0xfffe49ba8738, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000013> 15:01:31.230678 futex(0xfffe49b91288, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000012> 15:01:31.231031 brk(0xaaaae513e000) = 0xaaaae513e000 <0.000016> 15:01:31.232480 brk(0xaaaae515f000) = 0xaaaae515f000 <0.000018> 15:01:31.239133 brk(0xaaaae5180000) = 0xaaaae5180000 <0.000019> 15:01:31.245463 brk(0xaaaae51a1000) = 0xaaaae51a1000 <0.000020> 15:01:31.245847 brk(0xaaaae51c2000) = 0xaaaae51c2000 <0.000016> 15:01:31.259644 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xffff80da8000 <0.000025> 15:01:31.259705 mmap(NULL, 36864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xfffe4a36a000 <0.000014> 15:01:31.260012 futex(0xaaaae4f7fcf8, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000014> 15:01:31.260074 mprotect(0xfffe4a36a000, 36864, PROT_READ|PROT_EXEC) = 0 <0.000017> 15:01:31.260120 mprotect(0xffff80da8000, 4096, PROT_READ|PROT_EXEC) = 0 <0.000013> 15:01:31.260165 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xfffd80da80a8} --- 15:01:31.451341 +++ killed by SIGSEGV (core dumped) +++ depesz
Re: Interesting fail when migrating Pg from Ubuntu Bionic to Focal
From
hubert depesz lubaczewski
Date:
On Thu, Mar 03, 2022 at 04:04:28PM +0100, hubert depesz lubaczewski wrote: > and it worked, so I'm kinda at loss here. based on some talk on IRC, I was able to get stack trace from fail: (gdb) bt #0 0x0000fffe4a36e4d8 in ?? () #1 0x0000aaaabe03ffb8 in ExecProcNode (node=0xaaaae4f87cf8) at ./build/../src/include/executor/executor.h:242 #2 ExecMergeJoin (pstate=0xaaaae4f87da8) at ./build/../src/backend/executor/nodeMergejoin.c:656 #3 0x0000aaaabe03ffb8 in ExecProcNode (node=0xaaaae4f87ae8) at ./build/../src/include/executor/executor.h:242 #4 ExecMergeJoin (pstate=0xaaaae4f876c8) at ./build/../src/backend/executor/nodeMergejoin.c:656 #5 0x0000aaaabe039b1c in ExecProcNode (node=0xaaaae4f876c8) at ./build/../src/include/executor/executor.h:242 #6 ExecHashJoinImpl (parallel=false, pstate=0xaaaae4f87408) at ./build/../src/backend/executor/nodeHashjoin.c:262 #7 ExecHashJoin (pstate=0xaaaae4f87408) at ./build/../src/backend/executor/nodeHashjoin.c:591 #8 0x0000aaaabe01ed5c in ExecProcNode (node=0xaaaae4f87408) at ./build/../src/include/executor/executor.h:242 #9 ExecutePlan (execute_once=<optimized out>, dest=0xfffe49be0828, direction=<optimized out>, numberTuples=0, sendTuples=<optimizedout>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0xaaaae4f87408, estate=0xaaaae4ce4180) at ./build/../src/backend/executor/execMain.c:1632 #10 standard_ExecutorRun (queryDesc=0xaaaae4e9da70, direction=<optimized out>, count=0, execute_once=<optimized out>) at./build/../src/backend/executor/execMain.c:350 #11 0x0000ffff7d1981fc in pgss_ExecutorRun (queryDesc=0xaaaae4e9da70, direction=ForwardScanDirection, count=0, execute_once=true)at ./build/../contrib/pg_stat_statements/pg_stat_statements.c:893 #12 0x0000ffff7d182688 in explain_ExecutorRun (queryDesc=0xaaaae4e9da70, direction=ForwardScanDirection, count=0, execute_once=true)at ./build/../contrib/auto_explain/auto_explain.c:320 #13 0x0000aaaabe1754c8 in PortalRunSelect (portal=portal@entry=0xaaaae4d476e0, forward=forward@entry=true, count=0, count@entry=9223372036854775807,dest=dest@entry=0xfffe49be0828) at ./build/../src/backend/tcop/pquery.c:938 #14 0x0000aaaabe176a64 in PortalRun (portal=portal@entry=0xaaaae4d476e0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,run_once=run_once@entry=true, dest=dest@entry=0xfffe49be0828, altdest=altdest@entry=0xfffe49be0828, completionTag=completionTag@entry=0xffffc89001f8 "") at ./build/../src/backend/tcop/pquery.c:779 #15 0x0000aaaabe172a18 in exec_simple_query ( query_string=query_string@entry=0xaaaae4c6c7b0 "SELECT\n p.tableoid,\n p.oid,\n p.proname,\n p.prolang,\n p.pronargs,\n p.proargtypes,\n p.prorettype,\n (\n SELECT\n", ' ' <repeats 12 times>, "pg_catalog.array_agg(aclORDER BY row_n)\n F"...) at ./build/../src/backend/tcop/postgres.c:1215 #16 0x0000aaaabe1736ac in PostgresMain (argc=<optimized out>, argv=argv@entry=0xaaaae4cc9018, dbname=<optimized out>, username=<optimizedout>) at ./build/../src/backend/tcop/postgres.c:4271 #17 0x0000aaaabe0fa768 in BackendRun (port=0xaaaae4cb88e0, port=0xaaaae4cb88e0) at ./build/../src/backend/postmaster/postmaster.c:4510 #18 BackendStartup (port=0xaaaae4cb88e0) at ./build/../src/backend/postmaster/postmaster.c:4193 #19 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725 #20 0x0000aaaabe0fb74c in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1398 #21 0x0000aaaabde8c8a8 in main (argc=5, argv=0xaaaae4c65ea0) at ./build/../src/backend/main/main.c:228 And one more thing - if it matters - it's on aarch64 architecture (graviton ec2 in AWS) Best regards, depesz
Re: Interesting fail when migrating Pg from Ubuntu Bionic to Focal
From
hubert depesz lubaczewski
Date:
On Thu, Mar 03, 2022 at 04:11:56PM +0100, hubert depesz lubaczewski wrote: > On Thu, Mar 03, 2022 at 04:04:28PM +0100, hubert depesz lubaczewski wrote: > > and it worked, so I'm kinda at loss here. > > based on some talk on IRC, I was able to get stack trace from fail: Based on the stack trace I was able to get it to break using simple query: select p.proname, (SELECT rolname from pg_catalog.pg_roles where oid = p.proowner) from pg_proc p; I took a simple look at ranges of oid/prowner, and they look fine: =# select min(proowner), max(proowner), count(*) from pg_proc; min │ max │ count ─────┼─────┼─────── 10 │ 10 │ 2970 (1 row) 16:38:34 db: postgres@postgres, pid:1991057 =# select min(oid), max(oid), count(*) from pg_roles; min │ max │ count ─────┼───────────┼─────── 10 │ 310235824 │ 244 (1 row) Also, as I didn't mention it before: it's Pg 12.9. Best regards, depesz
Re: Interesting fail when migrating Pg from Ubuntu Bionic to Focal
From
hubert depesz lubaczewski
Date:
On Thu, Mar 03, 2022 at 05:39:21PM +0100, hubert depesz lubaczewski wrote: > On Thu, Mar 03, 2022 at 04:11:56PM +0100, hubert depesz lubaczewski wrote: > > On Thu, Mar 03, 2022 at 04:04:28PM +0100, hubert depesz lubaczewski wrote: > > > and it worked, so I'm kinda at loss here. > > > > based on some talk on IRC, I was able to get stack trace from fail: > > Based on the stack trace I was able to get it to break using simple > query: > select p.proname, (SELECT rolname from pg_catalog.pg_roles where oid = p.proowner) from pg_proc p; > > I took a simple look at ranges of oid/prowner, and they look fine: > > =# select min(proowner), max(proowner), count(*) from pg_proc; > min │ max │ count > ─────┼─────┼─────── > 10 │ 10 │ 2970 > (1 row) > > 16:38:34 db: postgres@postgres, pid:1991057 > =# select min(oid), max(oid), count(*) from pg_roles; > min │ max │ count > ─────┼───────────┼─────── > 10 │ 310235824 │ 244 > (1 row) > > > Also, as I didn't mention it before: it's Pg 12.9. OK. Traced it back to JIT. With JIT enabled: =# show jit; jit ───── on (1 row) =# explain select p.proname, (SELECT rolname from pg_catalog.pg_roles where oid = p.proowner) from pg_proc p; QUERY PLAN ────────────────────────────────────────────────────────────────────────────────────────────── Seq Scan on pg_proc p (cost=0.00..156507.84 rows=63264 width=128) SubPlan 1 -> Index Scan using pg_authid_oid_index on pg_authid (cost=0.14..2.36 rows=1 width=64) Index Cond: (oid = p.proowner) JIT: Functions: 8 Options: Inlining false, Optimization false, Expressions true, Deforming true (7 rows) =# select p.proname, (SELECT rolname from pg_catalog.pg_roles where oid = p.proowner) from pg_proc p; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. 16:42:44 canvas test, cluster 999, standalone db: @, pid:2014255 ?!> But when I disable jit - query works fine. versions of things that I think are relevant: =$ dpkg -l | grep -E 'llvm|clang|gcc|glibc' ii gcc 4:9.3.0-1ubuntu2 arm64 GNU C compiler ii gcc-10-base:arm64 10.3.0-1ubuntu1~20.04 arm64 GCC, the GNU Compiler Collection(base package) ii gcc-9 9.3.0-17ubuntu1~20.04 arm64 GNU C compiler ii gcc-9-base:arm64 9.3.0-17ubuntu1~20.04 arm64 GCC, the GNU Compiler Collection(base package) ii libgcc-9-dev:arm64 9.3.0-17ubuntu1~20.04 arm64 GCC support library (developmentfiles) ii libgcc-s1:arm64 10.3.0-1ubuntu1~20.04 arm64 GCC support library ii libllvm9:arm64 1:9.0.1-12 arm64 Modular compiler and toolchaintechnologies, runtime library Best regards, depesz
hubert depesz lubaczewski <depesz@depesz.com> writes: > OK. Traced it back to JIT. With JIT enabled: Hah, that's useful info. Seems like it must be incorrect code generated by JIT. > versions of things that I think are relevant: > =$ dpkg -l | grep -E 'llvm|clang|gcc|glibc' > ii gcc 4:9.3.0-1ubuntu2 arm64 GNU C compiler > ii gcc-10-base:arm64 10.3.0-1ubuntu1~20.04 arm64 GCC, the GNU Compiler Collection(base package) > ii gcc-9 9.3.0-17ubuntu1~20.04 arm64 GNU C compiler > ii gcc-9-base:arm64 9.3.0-17ubuntu1~20.04 arm64 GCC, the GNU Compiler Collection(base package) > ii libgcc-9-dev:arm64 9.3.0-17ubuntu1~20.04 arm64 GCC support library (developmentfiles) > ii libgcc-s1:arm64 10.3.0-1ubuntu1~20.04 arm64 GCC support library > ii libllvm9:arm64 1:9.0.1-12 arm64 Modular compiler and toolchaintechnologies, runtime library arm64, eh? I wonder if that's buggier than the Intel code paths. I tried and failed to reproduce this on Fedora 35 on aarch64, but that has what I think is a newer LLVM version: clang-13.0.0-3.fc35.aarch64 clang-libs-13.0.0-3.fc35.aarch64 clang-resource-filesystem-13.0.0-3.fc35.aarch64 gcc-11.2.1-9.fc35.aarch64 gcc-c++-11.2.1-9.fc35.aarch64 llvm-13.0.0-4.fc35.aarch64 llvm-devel-13.0.0-4.fc35.aarch64 llvm-libs-13.0.0-4.fc35.aarch64 llvm-static-13.0.0-4.fc35.aarch64 llvm-test-13.0.0-4.fc35.aarch64 Don't think I can readily install anything as old as LLVM 9 ... regards, tom lane
Re: Interesting fail when migrating Pg from Ubuntu Bionic to Focal
From
hubert depesz lubaczewski
Date:
On Fri, Mar 04, 2022 at 02:09:52PM -0500, Tom Lane wrote: > arm64, eh? I wonder if that's buggier than the Intel code paths. > > I tried and failed to reproduce this on Fedora 35 on aarch64, > but that has what I think is a newer LLVM version: I have suspicion that it also kinda depends on number of rows in there. When I deleted some schemas, which included some functions, the problem disappeared. I wasn't able to pinpoint specific thing, though, and when I called the pg_proc + pg_roles query for each separate row - it worked flawlessly. Best regards, depesz
hubert depesz lubaczewski <depesz@depesz.com> writes: > On Fri, Mar 04, 2022 at 02:09:52PM -0500, Tom Lane wrote: >> I tried and failed to reproduce this on Fedora 35 on aarch64, >> but that has what I think is a newer LLVM version: > I have suspicion that it also kinda depends on number of rows in there. > When I deleted some schemas, which included some functions, the problem > disappeared. > I wasn't able to pinpoint specific thing, though, and when I called the > pg_proc + pg_roles query for each separate row - it worked flawlessly. Mmm ... it might have just been that the planner chose not to use JIT when it thought there were fewer rows involved. Did you check with EXPLAIN that these cut-down cases still used JIT? regards, tom lane
On 3/4/22 17:03, Tom Lane wrote:
Mmm ... it might have just been that the planner chose not to use JIT when it thought there were fewer rows involved. Did you check with EXPLAIN that these cut-down cases still used JIT?
This is interesting and informative answer. How do I check whether JIT is used in the explain plan? Can you give me an example?
TIA
-- Mladen Gogala Database Consultant Tel: (347) 321-1217 https://dbwhisperer.wordpress.com
Mladen Gogala <gogala.mladen@gmail.com> writes: > On 3/4/22 17:03, Tom Lane wrote: >> Mmm ... it might have just been that the planner chose not to use >> JIT when it thought there were fewer rows involved. Did you check >> with EXPLAIN that these cut-down cases still used JIT? > This is interesting and informative answer. How do I check whether JIT > is used in the explain plan? Can you give me an example? If EXPLAIN prints some stuff about JIT, then JIT is going to be used, otherwise not. Here's an example from my (entirely unsuccessful) attempts to duplicate depesz's problem: =# explain select p.proname, (SELECT rolname from pg_catalog.pg_roles where oid = p.proowner) from pg_proc p; QUERY PLAN ---------------------------------------------------------------------------------------------- Seq Scan on pg_proc p (cost=0.00..393587.22 rows=47243 width=128) SubPlan 1 -> Index Scan using pg_authid_oid_index on pg_authid (cost=0.28..8.30 rows=1 width=64) Index Cond: (oid = p.proowner) JIT: Functions: 8 Options: Inlining false, Optimization false, Expressions true, Deforming true (7 rows) regards, tom lane
Re: Interesting fail when migrating Pg from Ubuntu Bionic to Focal
From
hubert depesz lubaczewski
Date:
On Fri, Mar 04, 2022 at 05:03:14PM -0500, Tom Lane wrote: > hubert depesz lubaczewski <depesz@depesz.com> writes: > > On Fri, Mar 04, 2022 at 02:09:52PM -0500, Tom Lane wrote: > >> I tried and failed to reproduce this on Fedora 35 on aarch64, > >> but that has what I think is a newer LLVM version: > > > I have suspicion that it also kinda depends on number of rows in there. > > When I deleted some schemas, which included some functions, the problem > > disappeared. > > I wasn't able to pinpoint specific thing, though, and when I called the > > pg_proc + pg_roles query for each separate row - it worked flawlessly. > > Mmm ... it might have just been that the planner chose not to use > JIT when it thought there were fewer rows involved. Did you check > with EXPLAIN that these cut-down cases still used JIT? Hi, I tore these boxes down, so can't check immediately, but I think I remember that you're right - single-row queries didn't use JIT. Best regards, depesz
hubert depesz lubaczewski <depesz@depesz.com> writes: > On Fri, Mar 04, 2022 at 05:03:14PM -0500, Tom Lane wrote: >> Mmm ... it might have just been that the planner chose not to use >> JIT when it thought there were fewer rows involved. Did you check >> with EXPLAIN that these cut-down cases still used JIT? > I tore these boxes down, so can't check immediately, but I think > I remember that you're right - single-row queries didn't use JIT. FWIW, I went to the trouble of installing Ubuntu Focal on my raspberry pi to see if I could duplicate this, and I couldn't. However, what you get from a fresh install now is $ dpkg -l | grep libllvm ii libllvm10:arm64 1:10.0.0-4ubuntu1 arm64 Modular compiler and toolchain technologies,runtime library not 9.0.1. I also found that Fedora 31/aarch64 is still downloadable, and that does contain LLVM 9 ... and I could not reproduce it there either. So I've run out of things to try. It still seems like a bug in a specific point release of LLVM is a possible explanation, especially given that Debian went so far as to replace that release in an LTS distro. (The gcc-related packages are all exactly the same versions as you show.) I'd believe that data corruption could be a contributing factor too, except that you were able to read all the rows without JIT. regards, tom lane
Re: Interesting fail when migrating Pg from Ubuntu Bionic to Focal
From
hubert depesz lubaczewski
Date:
On Sun, Mar 06, 2022 at 11:10:00AM -0500, Tom Lane wrote: > > I tore these boxes down, so can't check immediately, but I think > > I remember that you're right - single-row queries didn't use JIT. Got focal box up. Loaded schema for Pg. Initially select didn't break anything, but when I tuned down jit_above_cost so that it will kick in - got fails immediately. > FWIW, I went to the trouble of installing Ubuntu Focal on my > raspberry pi to see if I could duplicate this, and I couldn't. > However, what you get from a fresh install now is > $ dpkg -l | grep libllvm > ii libllvm10:arm64 1:10.0.0-4ubuntu1 arm64 Modular compiler and toolchain technologies,runtime library > not 9.0.1. I also found that Fedora 31/aarch64 is still downloadable, and > that does contain LLVM 9 ... and I could not reproduce it there either. Soo... plot thickens. Looks that pg 12 supplied by pgdg required libllvm9: =$ apt-cache show postgresql-12 | grep -E '^(Package|Version|Depends):' Package: postgresql-12 Version: 12.9-2.pgdg20.04+1 Depends: ..., libllvm9 (>= 1:9~svn298832-1~), ... Package: postgresql-12 Version: 12.9-0ubuntu0.20.04.1 Depends: ..., libllvm10 (>= 1:9~svn298832-1~), ... Package: postgresql-12 Version: 12.2-4 Depends: ..., libllvm10 (>= 1:9~svn298832-1~), ... Newer pg12 (12.10) from pgdg also depends on llvm9. Perhaps changing the deb to use/depend-on llvm10 would solve the problem, for now I'm not sure how to do it. Reached to Christoph about it. Best regards, depesz
hubert depesz lubaczewski <depesz@depesz.com> writes: > Got focal box up. Loaded schema for Pg. > Initially select didn't break anything, but when I tuned down > jit_above_cost so that it will kick in - got fails immediately. Hmph. I tried two more builds: * downgraded llvm to llvm-9, compiled PG 12 from source * installed postgresql-12.10-1.pgdg20.04+1 from pgdg repo Neither of those configurations fail for me, so either it's been fixed since 12.9, or (more likely) there is something to your test case beyond what you've mentioned. (I guess a long-shot possibility is that my raspberry pi 3B+ is sufficiently different hardware from your box as to not see the problem. Doubt that though.) Can you create a self-contained test case? regards, tom lane
Hi, On 2022-03-07 16:11:28 +0100, hubert depesz lubaczewski wrote: > On Sun, Mar 06, 2022 at 11:10:00AM -0500, Tom Lane wrote: > > > I tore these boxes down, so can't check immediately, but I think > > > I remember that you're right - single-row queries didn't use JIT. > > Got focal box up. Loaded schema for Pg. > > Initially select didn't break anything, but when I tuned down > jit_above_cost so that it will kick in - got fails immediately. Could you set jit_debugging_support=on and show a backtrace with that? > > FWIW, I went to the trouble of installing Ubuntu Focal on my > > raspberry pi to see if I could duplicate this, and I couldn't. > > However, what you get from a fresh install now is > > $ dpkg -l | grep libllvm > > ii libllvm10:arm64 1:10.0.0-4ubuntu1 arm64 Modular compiler and toolchaintechnologies, runtime library > > not 9.0.1. I also found that Fedora 31/aarch64 is still downloadable, and > > that does contain LLVM 9 ... and I could not reproduce it there either. > > Soo... plot thickens. > > Looks that pg 12 supplied by pgdg required libllvm9: What are the libllvm package versions on which you can reproduce the crash? > Newer pg12 (12.10) from pgdg also depends on llvm9. Perhaps changing the deb to > use/depend-on llvm10 would solve the problem, for now I'm not sure how to do > it. Reached to Christoph about it. It'd require rebuilding. It's possible that the problem is one of the patches applied by ubuntu to llvm and that's why others can't reproduce so far. There's quite a few... Greetings, Andres Freund
Re: Interesting fail when migrating Pg from Ubuntu Bionic to Focal
From
hubert depesz lubaczewski
Date:
On Mon, Mar 07, 2022 at 12:22:26PM -0500, Tom Lane wrote: > Neither of those configurations fail for me, so either > it's been fixed since 12.9, or (more likely) there is > something to your test case beyond what you've mentioned. Upgraded to 12.10 from pgdg, same problem. > (I guess a long-shot possibility is that my raspberry pi 3B+ > is sufficiently different hardware from your box as to not > see the problem. Doubt that though.) My "hardware" is AWS EC2 graviton instance (c6g.2xlarge). > Can you create a self-contained test case? I tried, but it seems that number of functions/users is a factor, and I wasn't able to make simple test that would replicate the issue. I guess I will just disable JIT on arm for now, and be done with it. Best regards, depesz
Re: Interesting fail when migrating Pg from Ubuntu Bionic to Focal
From
hubert depesz lubaczewski
Date:
On Mon, Mar 07, 2022 at 09:54:22AM -0800, Andres Freund wrote: > > Initially select didn't break anything, but when I tuned down > > jit_above_cost so that it will kick in - got fails immediately. > Could you set jit_debugging_support=on and show a backtrace with that? Here you go: Program received signal SIGSEGV, Segmentation fault. 0x0000fffe7587b6c8 in deform_0_8 () (gdb) bt #0 0x0000fffe7587b6c8 in deform_0_8 () #1 0x0000fffe7587b5f4 in evalexpr_0_7 () #2 0x0000aaaab67e6f7c in ExecEvalExprSwitchContext (isNull=0xfffff7956217, econtext=0xaaaaf6014608, state=0xaaaaf6014ab0)at ./build/../src/include/executor/executor.h:316 #3 ExecProject (projInfo=0xaaaaf6014aa8) at ./build/../src/include/executor/executor.h:350 #4 ExecScan (node=<optimized out>, accessMtd=0xaaaab68051b8 <SeqNext>, recheckMtd=0xaaaab6805248 <SeqRecheck>) at ./build/../src/backend/executor/execScan.c:239 #5 0x0000aaaab67ded5c in ExecProcNode (node=0xaaaaf60143b8) at ./build/../src/include/executor/executor.h:242 #6 ExecutePlan (execute_once=<optimized out>, dest=0xaaaaf5fc15e8, direction=<optimized out>, numberTuples=0, sendTuples=<optimizedout>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0xaaaaf60143b8, estate=0xaaaaf5e01360) at ./build/../src/backend/executor/execMain.c:1632 #7 standard_ExecutorRun (queryDesc=0xaaaaf5ef4130, direction=<optimized out>, count=0, execute_once=<optimized out>) at./build/../src/backend/executor/execMain.c:350 #8 0x0000ffffa87751fc in pgss_ExecutorRun (queryDesc=0xaaaaf5ef4130, direction=ForwardScanDirection, count=0, execute_once=true)at ./build/../contrib/pg_stat_statements/pg_stat_statements.c:893 #9 0x0000ffffa875f688 in explain_ExecutorRun (queryDesc=0xaaaaf5ef4130, direction=ForwardScanDirection, count=0, execute_once=true)at ./build/../contrib/auto_explain/auto_explain.c:320 #10 0x0000aaaab69354c8 in PortalRunSelect (portal=portal@entry=0xaaaaf5e83ef0, forward=forward@entry=true, count=0, count@entry=9223372036854775807,dest=dest@entry=0xaaaaf5fc15e8) at ./build/../src/backend/tcop/pquery.c:938 #11 0x0000aaaab6936a64 in PortalRun (portal=portal@entry=0xaaaaf5e83ef0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,run_once=run_once@entry=true, dest=dest@entry=0xaaaaf5fc15e8, altdest=altdest@entry=0xaaaaf5fc15e8, completionTag=completionTag@entry=0xfffff7956848 "") at ./build/../src/backend/tcop/pquery.c:779 #12 0x0000aaaab6932a18 in exec_simple_query (query_string=query_string@entry=0xaaaaf5d9af90 "select p.proname, (SELECT rolnamefrom pg_catalog.pg_roles where oid = p.proowner) from pg_proc p;") at ./build/../src/backend/tcop/postgres.c:1215 #13 0x0000aaaab69336ac in PostgresMain (argc=<optimized out>, argv=argv@entry=0xaaaaf5df6f18, dbname=<optimized out>, username=<optimizedout>) at ./build/../src/backend/tcop/postgres.c:4271 #14 0x0000aaaab68ba768 in BackendRun (port=0xaaaaf5df0910, port=0xaaaaf5df0910) at ./build/../src/backend/postmaster/postmaster.c:4510 #15 BackendStartup (port=0xaaaaf5df0910) at ./build/../src/backend/postmaster/postmaster.c:4193 #16 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725 #17 0x0000aaaab68bb74c in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1398 #18 0x0000aaaab664c8a8 in main (argc=5, argv=0xaaaaf5d94ea0) at ./build/../src/backend/main/main.c:228 > > > FWIW, I went to the trouble of installing Ubuntu Focal on my > > > raspberry pi to see if I could duplicate this, and I couldn't. > > > However, what you get from a fresh install now is > > > $ dpkg -l | grep libllvm > > > ii libllvm10:arm64 1:10.0.0-4ubuntu1 arm64 Modular compiler and toolchaintechnologies, runtime library > > > not 9.0.1. I also found that Fedora 31/aarch64 is still downloadable, and > > > that does contain LLVM 9 ... and I could not reproduce it there either. > > > > Soo... plot thickens. > > > > Looks that pg 12 supplied by pgdg required libllvm9: > What are the libllvm package versions on which you can reproduce the crash? Hmm .. not sure I fully understand. Here are all packages that I think are relevant: root@c999-pg1302:~# dpkg -l | grep -iE 'postgresql-12|llvm|clang' ii libllvm9:arm64 1:9.0.1-12 arm64 Modular compiler and toolchaintechnologies, runtime library ii postgresql-12 12.9-2.pgdg20.04+1 arm64 The World's Most Advanced OpenSource Relational Database ii postgresql-12-dbgsym 12.9-2.pgdg20.04+1 arm64 debug symbols for postgresql-12 ii postgresql-12-pg-collkey 0.5.1-1insops1 arm64 ICU collation function wrapperfor PostgreSQL 12 ii postgresql-12-postgis-2.5 2.5.5+dfsg-1.pgdg20.04+2 arm64 Geographic objects support forPostgreSQL 12 ii postgresql-12-postgis-2.5-scripts 2.5.5+dfsg-1.pgdg20.04+2 all Geographic objects support forPostgreSQL 12 -- SQL scripts ii postgresql-12-repack 1.4.7-2.pgdg20.04+1 arm64 reorganize tables in PostgreSQLdatabases with minimal locks ii postgresql-12-system-stats 1.0.0-1 arm64 A packaged version of EnterpriseDB'ssystem_stats postgres extension Hope it helps. Best regards, depesz
Re: hubert depesz lubaczewski > Soo... plot thickens. > > Looks that pg 12 supplied by pgdg required libllvm9: > > =$ apt-cache show postgresql-12 | grep -E '^(Package|Version|Depends):' > Package: postgresql-12 > Version: 12.9-2.pgdg20.04+1 > Depends: ..., libllvm9 (>= 1:9~svn298832-1~), ... > > Package: postgresql-12 > Version: 12.9-0ubuntu0.20.04.1 > Depends: ..., libllvm10 (>= 1:9~svn298832-1~), ... > > Package: postgresql-12 > Version: 12.2-4 > Depends: ..., libllvm10 (>= 1:9~svn298832-1~), ... > > > Newer pg12 (12.10) from pgdg also depends on llvm9. Perhaps changing the deb to > use/depend-on llvm10 would solve the problem, for now I'm not sure how to do > it. Reached to Christoph about it. Thanks for spotting that. The problem turned out me being not smart enough to determine the newest llvm version installed; the not-quite-correct Makefile code failed on "9" < "10". I have now fixed that problem. Updated postgresql-{11,12,13,14} packages using libllvm10 instead of libllvm9 are available in focal-pgdg-testing. Since this is potentially a breaking change, I'll not push these live immediately but will give people a chance to test these until next week. Christoph
Re: hubert depesz lubaczewski > Soo... plot thickens. > > Looks that pg 12 supplied by pgdg required libllvm9: > > =$ apt-cache show postgresql-12 | grep -E '^(Package|Version|Depends):' > Package: postgresql-12 > Version: 12.9-2.pgdg20.04+1 > Depends: ..., libllvm9 (>= 1:9~svn298832-1~), ... > > Package: postgresql-12 > Version: 12.9-0ubuntu0.20.04.1 > Depends: ..., libllvm10 (>= 1:9~svn298832-1~), ... > > Package: postgresql-12 > Version: 12.2-4 > Depends: ..., libllvm10 (>= 1:9~svn298832-1~), ... > > > Newer pg12 (12.10) from pgdg also depends on llvm9. Perhaps changing the deb to > use/depend-on llvm10 would solve the problem, for now I'm not sure how to do > it. Reached to Christoph about it. Thanks for spotting that. The problem turned out me being not smart enough to determine the newest llvm version installed; the not-quite-correct Makefile code failed on "9" < "10". I have now fixed that problem. Updated postgresql-{11,12,13,14} packages using libllvm10 instead of libllvm9 are available in focal-pgdg-testing. Since this is potentially a breaking change, I'll not push these live immediately but will give people a chance to test these until next week. Christoph