Thread: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12
BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 16045 Logged by: Hans Buschmann Email address: buschmann@nidsa.net PostgreSQL version: 12.0 Operating system: Windows 10 64bit Description: I just did a pg_upgrade from pg 11.5 to pg 12.0 on my development machine under Windows 64bit (both distributions from EDB). cpsdb=# select version (); version ------------------------------------------------------------ PostgreSQL 12.0, compiled by Visual C++ build 1914, 64-bit (1 row) The pg_upgrade with --link went flawlessly, I started (only!) the new server 12.0 and could connect and access individual databases. As recommended by the resulting analyze_new_cluster.bat I tried a full vacuumdb with: "N:/pgsql/bin/vacuumdb" -U postgres --all --analyze-only which crashed with vacuumdb: vacuuming database "cpsdb" vacuumdb: error: vacuuming of table "admin.q_tbl_archiv" in database "cpsdb" failed: ERROR: compressed data is corrupted I connected to the database through pgsql and looked at the table "admin.q_tbl_archiv" cpsdb=# \d+ q_tbl_archiv; Table "admin.q_tbl_archiv" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description ------------------+------------------------------------+-----------+----------+---------+----------+--------------+------------- table_name | information_schema.sql_identifier | | | | plain | | column_name | information_schema.sql_identifier | | | | plain | | ordinal_position | information_schema.cardinal_number | | | | plain | | col_qualifier | text | | | | extended | | id_column | information_schema.sql_identifier | | | | plain | | id_default | information_schema.character_data | | | | extended | | Access method: heap When trying to select * from q_tbl_archiv I got: cpsdb=# select * from q_tbl_archiv; ERROR: invalid memory alloc request size 18446744073709551613 This table was created a long time back under 9.5 or 9.6 with the (here truncated) following command: create table q_tbl_archiv as with qseason as ( select table_name,column_name, ordinal_position ,replace(column_name,'_season','') as col_qualifier -- ,'id_'||replace(column_name,'_season','') as id_column from information_schema.columns where column_name like '%_season' and ordinal_position < 10 and table_name in ( 'table1' ,'table2' -- here truncated: -- ... (here where all table of mine having columns like xxx_season) -- to reproduce, change to your own tablenames in a test database ) order by table_name ) select qs.*,c.column_name as id_column, c.column_default as id_default from qseason qs left join information_schema.columns c on c.table_name=qs.table_name and c.column_name like 'id_%' ; Until now this table was always restored without error by migrating to a new major version through pg_dump/initdb/pr_restore. To verify the integrity of the table I restored the dump taken under pg_dump from pg 11.5 just before the pg_upgrade to another machine. The restore and analyze went OK and select * from q_tbl_archiv showed all tuples, eg (edited): cpsdb_dev=# select * from q_tbl_archiv; table_name | column_name | ordinal_position | col_qualifier | id_column | id_default --------------------------+--------------+------------------+---------------+-----------+---------------------------------------------------------- table1 | chm_season | 2 | chm | | table2 | cs_season | 2 | cs | id_cs | nextval('table2_id_cs_seq'::regclass) ... In conclusion, this seems to me like an error/omission of pg_upgrade. It seems to handle these specially derived tables from information_schema not correctly, resulting in failures of the upgraded database. For me, this error is not so crucial, because this table is only used for administrative purposes and can easily be restored from backup. But I want to share my findings for the sake of other users of pg_upgrade. Thanks for investigating! Hans Buschmann
Re: BUG #16045: vacuum_db crash and illegal memory alloc afterpg_upgrade from PG11 to PG12
From
Tomas Vondra
Date:
On Tue, Oct 08, 2019 at 05:08:53PM +0000, PG Bug reporting form wrote: >The following bug has been logged on the website: > >Bug reference: 16045 >Logged by: Hans Buschmann >Email address: buschmann@nidsa.net >PostgreSQL version: 12.0 >Operating system: Windows 10 64bit >Description: > >I just did a pg_upgrade from pg 11.5 to pg 12.0 on my development machine >under Windows 64bit (both distributions from EDB). > >cpsdb=# select version (); > version >------------------------------------------------------------ > PostgreSQL 12.0, compiled by Visual C++ build 1914, 64-bit >(1 row) > >The pg_upgrade with --link went flawlessly, I started (only!) the new server >12.0 and could connect and access individual databases. > >As recommended by the resulting analyze_new_cluster.bat I tried a full >vacuumdb with: > >"N:/pgsql/bin/vacuumdb" -U postgres --all --analyze-only > >which crashed with >vacuumdb: vacuuming database "cpsdb" >vacuumdb: error: vacuuming of table "admin.q_tbl_archiv" in database "cpsdb" >failed: ERROR: compressed data is corrupted > >I connected to the database through pgsql and looked at the table >"admin.q_tbl_archiv" > >cpsdb=# \d+ q_tbl_archiv; > Table >"admin.q_tbl_archiv" > Column | Type | Collation | >Nullable | Default | Storage | Stats target | Description >------------------+------------------------------------+-----------+----------+---------+----------+--------------+------------- > table_name | information_schema.sql_identifier | | > | | plain | | > column_name | information_schema.sql_identifier | | > | | plain | | > ordinal_position | information_schema.cardinal_number | | > | | plain | | > col_qualifier | text | | > | | extended | | > id_column | information_schema.sql_identifier | | > | | plain | | > id_default | information_schema.character_data | | > | | extended | | >Access method: heap > >When trying to select * from q_tbl_archiv I got: > >cpsdb=# select * from q_tbl_archiv; >ERROR: invalid memory alloc request size 18446744073709551613 > >This table was created a long time back under 9.5 or 9.6 with the (here >truncated) following command: > > >create table q_tbl_archiv as >with >qseason as ( >select table_name,column_name, ordinal_position >,replace(column_name,'_season','') as col_qualifier >-- ,'id_'||replace(column_name,'_season','') as id_column >from information_schema.columns >where >column_name like '%_season' >and ordinal_position < 10 >and table_name in ( > 'table1' >,'table2' >-- here truncated: >-- ... (here where all table of mine having columns like xxx_season) >-- to reproduce, change to your own tablenames in a test database >) >order by table_name >) >select qs.*,c.column_name as id_column, c.column_default as id_default >from > qseason qs > left join information_schema.columns c on c.table_name=qs.table_name and >c.column_name like 'id_%' >; > >Until now this table was always restored without error by migrating to a new >major version through pg_dump/initdb/pr_restore. > >To verify the integrity of the table I restored the dump taken under pg_dump >from pg 11.5 just before the pg_upgrade to another machine. > >The restore and analyze went OK and select * from q_tbl_archiv showed all >tuples, eg (edited): > >cpsdb_dev=# select * from q_tbl_archiv; > table_name | column_name | ordinal_position | col_qualifier >| id_column | id_default >--------------------------+--------------+------------------+---------------+-----------+---------------------------------------------------------- > table1 | chm_season | 2 | chm >| | > table2 | cs_season | 2 | cs >| id_cs | nextval('table2_id_cs_seq'::regclass) >... > >In conclusion, this seems to me like an error/omission of pg_upgrade. > There's clearly something bad happening. It's a bit strange, though. Had this been a data corruption issue, I'd expect the pg_dump to fail too, but it succeeds. >It seems to handle these specially derived tables from information_schema >not correctly, resulting in failures of the upgraded database. > Well, I don't see how that should make any difference. It's a CTAS and that should create a regular table, that's not an issue. I wonder if there were some changes to the data types involved, but that would be essentially a break in on-disk format and we're careful about not doing that ... >For me, this error is not so crucial, because this table is only used for >administrative purposes and can easily be restored from backup. > >But I want to share my findings for the sake of other users of pg_upgrade. > OK, thanks. Could you maybe set log_error_verbosity = verbose before invoking the vacuum (you can set that in that session)? That should give us more details about where exactly the error is triggered. Even better, if you could attach a debugger to the session, set breakpoints on locations triggering 'invalid memory alloc request size' and then show the backtrace (obviously, that's more complicated). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #16045: vacuum_db crash and illegal memory alloc afterpg_upgrade from PG11 to PG12
From
Tomas Vondra
Date:
FWIW I can reproduce this - it's enough to do this on the 11 cluster create table q_tbl_archiv as with qseason as ( select table_name,column_name, ordinal_position ,replace(column_name,'_season','') as col_qualifier -- ,'id_'||replace(column_name,'_season','') as id_column from information_schema.columns order by table_name ) select qs.*,c.column_name as id_column, c.column_default as id_default from qseason qs left join information_schema.columns c on c.table_name=qs.table_name and c.column_name like 'id_%'; and then analyze q_tbl_archiv which produces backtrace like this: No symbol "stats" in current context. (gdb) bt #0 0x0000746095262951 in __memmove_avx_unaligned_erms () from /lib64/libc.so.6 #1 0x0000000000890a8e in varstrfastcmp_locale (a1p=0x17716b4 "per_language\a", len1=<optimized out>, a2p=0x176af28 '\177'<repeats 136 times>, "\021\004", len2=-4, ssup=<optimized out>, ssup=<optimized out>) at varlena.c:2320 #2 0x0000000000890cb1 in varlenafastcmp_locale (x=24581808, y=24555300, ssup=0x7ffc649463f0) at varlena.c:2219 #3 0x00000000005b73b4 in ApplySortComparator (ssup=0x7ffc649463f0, isNull2=false, datum2=<optimized out>, isNull1=false,datum1=<optimized out>) at ../../../src/include/utils/sortsupport.h:224 #4 compare_scalars (a=<optimized out>, b=<optimized out>, arg=0x7ffc649463e0) at analyze.c:2700 #5 0x00000000008f9953 in qsort_arg (a=a@entry=0x178fdc0, n=<optimized out>, n@entry=2158, es=es@entry=16, cmp=cmp@entry=0x5b7390<compare_scalars>, arg=arg@entry=0x7ffc649463e0) at qsort_arg.c:140 #6 0x00000000005b86a6 in compute_scalar_stats (stats=0x176a208, fetchfunc=<optimized out>, samplerows=<optimized out>, totalrows=2158)at analyze.c:2273 #7 0x00000000005b9d95 in do_analyze_rel (onerel=onerel@entry=0x74608c00d3e8, params=params@entry=0x7ffc64946970, va_cols=va_cols@entry=0x0,acquirefunc=<optimized out>, relpages=22, inh=inh@entry=false, in_outer_xact=false, elevel=13) at analyze.c:529 #8 0x00000000005bb2c9 in analyze_rel (relid=<optimized out>, relation=<optimized out>, params=params@entry=0x7ffc64946970,va_cols=0x0, in_outer_xact=<optimized out>, bstrategy=<optimized out>) at analyze.c:260 #9 0x000000000062c7b0 in vacuum (relations=0x1727120, params=params@entry=0x7ffc64946970, bstrategy=<optimized out>, bstrategy@entry=0x0,isTopLevel=isTopLevel@entry=true) at vacuum.c:413 #10 0x000000000062cd49 in ExecVacuum (pstate=pstate@entry=0x16c9518, vacstmt=vacstmt@entry=0x16a82b8, isTopLevel=isTopLevel@entry=true)at vacuum.c:199 #11 0x00000000007a6d64 in standard_ProcessUtility (pstmt=0x16a8618, queryString=0x16a77a8 "", context=<optimized out>, params=0x0,queryEnv=0x0, dest=0x16a8710, completionTag=0x7ffc64946cb0 "") at utility.c:670 #12 0x00000000007a4006 in PortalRunUtility (portal=0x170f368, pstmt=0x16a8618, isTopLevel=<optimized out>, setHoldSnapshot=<optimizedout>, dest=0x16a8710, completionTag=0x7ffc64946cb0 "") at pquery.c:1175 #13 0x00000000007a4b61 in PortalRunMulti (portal=portal@entry=0x170f368, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,dest=dest@entry=0x16a8710, altdest=altdest@entry=0x16a8710, completionTag=completionTag@entry=0x7ffc64946cb0 "") at pquery.c:1321 #14 0x00000000007a5864 in PortalRun (portal=portal@entry=0x170f368, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,run_once=run_once@entry=true, dest=dest@entry=0x16a8710, altdest=altdest@entry=0x16a8710, completionTag=0x7ffc64946cb0 "") at pquery.c:796 #15 0x00000000007a174e in exec_simple_query (query_string=0x16a77a8 "") at postgres.c:1215 Looking at compute_scalar_stats, the "stats" parameter does not seem particularly healthy: (gdb) p *stats $3 = {attr = 0x10, attrtypid = 12, attrtypmod = 0, attrtype = 0x1762e00, attrcollid = 356, anl_context = 0x7f7f7f7e00000000,compute_stats = 0x100, minrows = 144, extra_data = 0x1762e00, stats_valid = false, stanullfrac = 0, stawidth = 0, stadistinct = 0, stakind = {0, 0, 0, 0, 0}, staop = {0, 0, 0, 0, 0}, stacoll = {0, 0, 0, 0, 0}, numnumbers= {0, 0, 0, 0, 0}, stanumbers = {0x0, 0x0, 0x0, 0x0, 0x0}, numvalues = {0, 0, 0, 0, 2139062142}, stavalues = { 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f}, statypid = {2139062143,2139062143, 2139062143, 2139062143, 2139062143}, statyplen = {32639, 32639, 32639, 32639, 32639}, statypbyval = {127, 127, 127, 127, 127}, statypalign = "\177\177\177\177\177", tupattnum = 2139062143, rows = 0x7f7f7f7f7f7f7f7f,tupDesc = 0x7f7f7f7f7f7f7f7f, exprvals = 0x8, exprnulls = 0x4, rowstride = 24522240} Not sure about the root cause yet. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12
From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > FWIW I can reproduce this - it's enough to do this on the 11 cluster I failed to reproduce any problem from your example, but I was trying in C locale on a Linux machine. What environment are you testing? regards, tom lane
Re: BUG #16045: vacuum_db crash and illegal memory alloc afterpg_upgrade from PG11 to PG12
From
Tomas Vondra
Date:
On Wed, Oct 09, 2019 at 03:59:07PM +0200, Tomas Vondra wrote: >FWIW I can reproduce this - it's enough to do this on the 11 cluster > >create table q_tbl_archiv as >with >qseason as ( >select table_name,column_name, ordinal_position >,replace(column_name,'_season','') as col_qualifier >-- ,'id_'||replace(column_name,'_season','') as id_column >from information_schema.columns >order by table_name >) >select qs.*,c.column_name as id_column, c.column_default as id_default >from > qseason qs > left join information_schema.columns c on c.table_name=qs.table_name and >c.column_name like 'id_%'; > > >and then > > analyze q_tbl_archiv > >which produces backtrace like this: > >No symbol "stats" in current context. >(gdb) bt >#0 0x0000746095262951 in __memmove_avx_unaligned_erms () from /lib64/libc.so.6 >#1 0x0000000000890a8e in varstrfastcmp_locale (a1p=0x17716b4 "per_language\a", len1=<optimized out>, a2p=0x176af28 '\177'<repeats 136 times>, "\021\004", len2=-4, ssup=<optimized out>, ssup=<optimized out>) at varlena.c:2320 >#2 0x0000000000890cb1 in varlenafastcmp_locale (x=24581808, y=24555300, ssup=0x7ffc649463f0) at varlena.c:2219 >#3 0x00000000005b73b4 in ApplySortComparator (ssup=0x7ffc649463f0, isNull2=false, datum2=<optimized out>, isNull1=false,datum1=<optimized out>) at ../../../src/include/utils/sortsupport.h:224 >#4 compare_scalars (a=<optimized out>, b=<optimized out>, arg=0x7ffc649463e0) at analyze.c:2700 >#5 0x00000000008f9953 in qsort_arg (a=a@entry=0x178fdc0, n=<optimized out>, n@entry=2158, es=es@entry=16, cmp=cmp@entry=0x5b7390<compare_scalars>, arg=arg@entry=0x7ffc649463e0) at qsort_arg.c:140 >#6 0x00000000005b86a6 in compute_scalar_stats (stats=0x176a208, fetchfunc=<optimized out>, samplerows=<optimized out>,totalrows=2158) at analyze.c:2273 >#7 0x00000000005b9d95 in do_analyze_rel (onerel=onerel@entry=0x74608c00d3e8, params=params@entry=0x7ffc64946970, va_cols=va_cols@entry=0x0,acquirefunc=<optimized out>, relpages=22, inh=inh@entry=false, in_outer_xact=false, elevel=13) > at analyze.c:529 >#8 0x00000000005bb2c9 in analyze_rel (relid=<optimized out>, relation=<optimized out>, params=params@entry=0x7ffc64946970,va_cols=0x0, in_outer_xact=<optimized out>, bstrategy=<optimized out>) at analyze.c:260 >#9 0x000000000062c7b0 in vacuum (relations=0x1727120, params=params@entry=0x7ffc64946970, bstrategy=<optimized out>, bstrategy@entry=0x0,isTopLevel=isTopLevel@entry=true) at vacuum.c:413 >#10 0x000000000062cd49 in ExecVacuum (pstate=pstate@entry=0x16c9518, vacstmt=vacstmt@entry=0x16a82b8, isTopLevel=isTopLevel@entry=true)at vacuum.c:199 >#11 0x00000000007a6d64 in standard_ProcessUtility (pstmt=0x16a8618, queryString=0x16a77a8 "", context=<optimized out>, params=0x0,queryEnv=0x0, dest=0x16a8710, completionTag=0x7ffc64946cb0 "") at utility.c:670 >#12 0x00000000007a4006 in PortalRunUtility (portal=0x170f368, pstmt=0x16a8618, isTopLevel=<optimized out>, setHoldSnapshot=<optimizedout>, dest=0x16a8710, completionTag=0x7ffc64946cb0 "") at pquery.c:1175 >#13 0x00000000007a4b61 in PortalRunMulti (portal=portal@entry=0x170f368, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,dest=dest@entry=0x16a8710, altdest=altdest@entry=0x16a8710, > completionTag=completionTag@entry=0x7ffc64946cb0 "") at pquery.c:1321 >#14 0x00000000007a5864 in PortalRun (portal=portal@entry=0x170f368, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,run_once=run_once@entry=true, dest=dest@entry=0x16a8710, altdest=altdest@entry=0x16a8710, > completionTag=0x7ffc64946cb0 "") at pquery.c:796 >#15 0x00000000007a174e in exec_simple_query (query_string=0x16a77a8 "") at postgres.c:1215 > >Looking at compute_scalar_stats, the "stats" parameter does not seem >particularly healthy: > >(gdb) p *stats >$3 = {attr = 0x10, attrtypid = 12, attrtypmod = 0, attrtype = 0x1762e00, attrcollid = 356, anl_context = 0x7f7f7f7e00000000,compute_stats = 0x100, minrows = 144, extra_data = 0x1762e00, stats_valid = false, stanullfrac = 0, > stawidth = 0, stadistinct = 0, stakind = {0, 0, 0, 0, 0}, staop = {0, 0, 0, 0, 0}, stacoll = {0, 0, 0, 0, 0}, numnumbers= {0, 0, 0, 0, 0}, stanumbers = {0x0, 0x0, 0x0, 0x0, 0x0}, numvalues = {0, 0, 0, 0, 2139062142}, stavalues = { > 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f}, statypid = {2139062143,2139062143, 2139062143, 2139062143, 2139062143}, statyplen = {32639, 32639, 32639, 32639, 32639}, > statypbyval = {127, 127, 127, 127, 127}, statypalign = "\177\177\177\177\177", tupattnum = 2139062143, rows = 0x7f7f7f7f7f7f7f7f,tupDesc = 0x7f7f7f7f7f7f7f7f, exprvals = 0x8, exprnulls = 0x4, rowstride = 24522240} > >Not sure about the root cause yet. > OK, a couple more observations - the table schema looks like this: Table "public.q_tbl_archiv" Column | Type | Collation | Nullable | Default ------------------+------------------------------------+-----------+----------+--------- table_name | information_schema.sql_identifier | | | column_name | information_schema.sql_identifier | | | ordinal_position | information_schema.cardinal_number | | | col_qualifier | text | | | id_column | information_schema.sql_identifier | | | id_default | information_schema.character_data | | | and I can succesfully do this: test=# analyze q_tbl_archiv (table_name, column_name, ordinal_position, id_column, id_default); ANALYZE but as soon as I include the col_qualifier column, it fails: test=# analyze q_tbl_archiv (table_name, column_name, ordinal_position, id_column, id_default, col_qualifier); ERROR: compressed data is corrupted But it fails differently (with the segfault) when analyzing just the one column: test=# analyze q_tbl_archiv (col_qualifier); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded. Moreover, there are some other interesting failures - I can do select max(table_name) from q_tbl_archiv; select max(column_name) from q_tbl_archiv; select max(ordinal_position) from q_tbl_archiv; but as soon as I try doing that with col_qualifier, it crashes and burns: select max(col_qualifier) from q_tbl_archiv; The backtrace is rather strange in this case (a lot of missing calls, etc.). However, when called for the next two columns, it still crashes, but the backtraces look somewhat saner: select max(id_column) from q_tbl_archiv; Program received signal SIGSEGV, Segmentation fault. 0x00007db3186c6617 in __strlen_avx2 () from /lib64/libc.so.6 (gdb) bt #0 0x00007db3186c6617 in __strlen_avx2 () from /lib64/libc.so.6 #1 0x0000000000894ced in cstring_to_text (s=0x7db32ce38935 <error: Cannot access memory at address 0x7db32ce38935>) at varlena.c:173 #2 name_text (fcinfo=<optimized out>) at varlena.c:3573 #3 0x000000000063860d in ExecInterpExpr (state=0x1136900, econtext=0x1135128, isnull=<optimized out>) at execExprInterp.c:649 #4 0x000000000064f699 in ExecEvalExprSwitchContext (isNull=0x7ffcfd8f3b2f, econtext=<optimized out>, state=<optimized out>)at ../../../src/include/executor/executor.h:307 #5 advance_aggregates (aggstate=0x1134ef0, aggstate=0x1134ef0) at nodeAgg.c:679 #6 agg_retrieve_direct (aggstate=0x1134ef0) at nodeAgg.c:1847 #7 ExecAgg (pstate=0x1134ef0) at nodeAgg.c:1572 #8 0x000000000063b58b in ExecProcNode (node=0x1134ef0) at ../../../src/include/executor/executor.h:239 #9 ExecutePlan (execute_once=<optimized out>, dest=0x1144248, direction=<optimized out>, numberTuples=0, sendTuples=<optimizedout>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x1134ef0, estate=0x1134c98) at execMain.c:1646 #10 standard_ExecutorRun (queryDesc=0x1094f18, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364 #11 0x00000000007a43cc in PortalRunSelect (portal=0x10da368, forward=<optimized out>, count=0, dest=<optimized out>) at pquery.c:929 #12 0x00000000007a5958 in PortalRun (portal=portal@entry=0x10da368, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,run_once=run_once@entry=true, dest=dest@entry=0x1144248, altdest=altdest@entry=0x1144248, completionTag=0x7ffcfd8f3db0 "") at pquery.c:770 #13 0x00000000007a177e in exec_simple_query (query_string=0x10727a8 "select max(id_column) from q_tbl_archiv ;") at postgres.c:1215 #14 0x00000000007a2f3f in PostgresMain (argc=<optimized out>, argv=argv@entry=0x109e400, dbname=<optimized out>, username=<optimizedout>) at postgres.c:4236 #15 0x00000000007237ce in BackendRun (port=0x1097c30, port=0x1097c30) at postmaster.c:4437 #16 BackendStartup (port=0x1097c30) at postmaster.c:4128 #17 ServerLoop () at postmaster.c:1704 #18 0x000000000072458e in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x106c350) at postmaster.c:1377 #19 0x000000000047d101 in main (argc=3, argv=0x106c350) at main.c:228 select max(id_default) from q_tbl_archiv; Program received signal SIGABRT, Aborted. 0x00007db3185a1e35 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x00007db3185a1e35 in raise () from /lib64/libc.so.6 #1 0x00007db31858c895 in abort () from /lib64/libc.so.6 #2 0x00000000008b4470 in ExceptionalCondition (conditionName=conditionName@entry=0xabe49e "1", errorType=errorType@entry=0x907128"unrecognized TOAST vartag", fileName=fileName@entry=0xa4965b "execTuples.c", lineNumber=lineNumber@entry=971) at assert.c:54 #3 0x00000000006466d3 in slot_deform_heap_tuple (natts=6, offp=0x1135170, tuple=<optimized out>, slot=0x1135128) at execTuples.c:985 #4 tts_buffer_heap_getsomeattrs (slot=0x1135128, natts=<optimized out>) at execTuples.c:676 #5 0x00000000006489fc in slot_getsomeattrs_int (slot=slot@entry=0x1135128, attnum=6) at execTuples.c:1877 #6 0x00000000006379a3 in slot_getsomeattrs (attnum=<optimized out>, slot=0x1135128) at ../../../src/include/executor/tuptable.h:345 #7 ExecInterpExpr (state=0x11364b0, econtext=0x1134cd8, isnull=<optimized out>) at execExprInterp.c:441 #8 0x000000000064f699 in ExecEvalExprSwitchContext (isNull=0x7ffcfd8f3b2f, econtext=<optimized out>, state=<optimized out>)at ../../../src/include/executor/executor.h:307 #9 advance_aggregates (aggstate=0x1134aa0, aggstate=0x1134aa0) at nodeAgg.c:679 #10 agg_retrieve_direct (aggstate=0x1134aa0) at nodeAgg.c:1847 #11 ExecAgg (pstate=0x1134aa0) at nodeAgg.c:1572 #12 0x000000000063b58b in ExecProcNode (node=0x1134aa0) at ../../../src/include/executor/executor.h:239 #13 ExecutePlan (execute_once=<optimized out>, dest=0x11439d8, direction=<optimized out>, numberTuples=0, sendTuples=<optimizedout>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x1134aa0, estate=0x1134848) at execMain.c:1646 #14 standard_ExecutorRun (queryDesc=0x1094f18, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364 #15 0x00000000007a43cc in PortalRunSelect (portal=0x10da368, forward=<optimized out>, count=0, dest=<optimized out>) at pquery.c:929 #16 0x00000000007a5958 in PortalRun (portal=portal@entry=0x10da368, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,run_once=run_once@entry=true, dest=dest@entry=0x11439d8, altdest=altdest@entry=0x11439d8, completionTag=0x7ffcfd8f3db0 "") at pquery.c:770 #17 0x00000000007a177e in exec_simple_query (query_string=0x10727a8 "select max(id_default) from q_tbl_archiv ;") at postgres.c:1215 #18 0x00000000007a2f3f in PostgresMain (argc=<optimized out>, argv=argv@entry=0x109e4f0, dbname=<optimized out>, username=<optimizedout>) at postgres.c:4236 #19 0x00000000007237ce in BackendRun (port=0x10976f0, port=0x10976f0) at postmaster.c:4437 #20 BackendStartup (port=0x10976f0) at postmaster.c:4128 #21 ServerLoop () at postmaster.c:1704 #22 0x000000000072458e in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x106c350) at postmaster.c:1377 #23 0x000000000047d101 in main (argc=3, argv=0x106c350) at main.c:228 It's quite puzzling, though. If I had to guess, I'd say it's some sort of memory management issue (either we're corrupting it somehow, or perhaps using it after pfree). -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #16045: vacuum_db crash and illegal memory alloc afterpg_upgrade from PG11 to PG12
From
Tomas Vondra
Date:
On Wed, Oct 09, 2019 at 10:07:01AM -0400, Tom Lane wrote: >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> FWIW I can reproduce this - it's enough to do this on the 11 cluster > >I failed to reproduce any problem from your example, but I was trying >in C locale on a Linux machine. What environment are you testing? > > regards, tom lane test=# show lc_collate ; lc_collate ------------ C.UTF-8 (1 row) I can reproduce this pretty easily like this: 1) build 11 git checkout REL_11_STABLE ./configure --prefix=/home/user/pg-11 --enable-debug --enable-cassert && make -s clean && make -s -j4 install 2) build 12 git checkout REL_12_STABLE ./configure --prefix=/home/user/pg-12 --enable-debug --enable-cassert && make -s clean && make -s -j4 install 3) create the 11 cluster /home/user/pg-11/bin/pg_ctl -D /tmp/data-11 init /home/user/pg-11/bin/pg_ctl -D /tmp/data-11 -l /tmp/pg-11.log start /home/user/pg-11/bin/createdb test /home/user/pg-11/bin/psql test 4) create the table create table q_tbl_archiv as with qseason as ( select table_name,column_name, ordinal_position ,replace(column_name,'_season','') as col_qualifier -- ,'id_'||replace(column_name,'_season','') as id_column from information_schema.columns order by table_name ) select qs.*,c.column_name as id_column, c.column_default as id_default from qseason qs left join information_schema.columns c on c.table_name=qs.table_name and c.column_name like 'id_%'; 5) shutdown the 11 cluster /home/user/pg-11/bin/pg_ctl -D /tmp/data-11 stop 6) init 12 cluster /home/user/pg-12/bin/pg_ctl -D /tmp/data-12 init 7) do the pg_upgrade thing /home/user/pg-12/bin/pg_upgrade -b /home/user/pg-11/bin -B /home/user/pg-12/bin -d /tmp/data-11 -D /tmp/data-12 -k 8) start 12 cluster /home/user/pg-12/bin/pg_ctl -D /tmp/data-12 -l /tmp/pg-12.log start 9) kabooom /home/user/pg-12/bin/psql test -c "analyze q_tbl_archiv" On my system (Fedora 30 in x86_64) this reliably results a crash (and various other crashes as demonstrated in my previous message). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
AW: BUG #16045: vacuum_db crash and illegal memory alloc afterpg_upgrade from PG11 to PG12
From
Hans Buschmann
Date:
Hi Tomas, Nice that you could reproduce it. This was just the way I followed. For your Info, here are my no-standard config params: name | current_setting ------------------------------------+--------------------------------- application_name | psql auto_explain.log_analyze | on auto_explain.log_min_duration | 0 auto_explain.log_nested_statements | on client_encoding | WIN1252 cluster_name | HB_DEV data_checksums | on DateStyle | ISO, DMY default_text_search_config | pg_catalog.german dynamic_shared_memory_type | windows effective_cache_size | 8GB lc_collate | C lc_ctype | German_Germany.1252 lc_messages | C lc_monetary | German_Germany.1252 lc_numeric | German_Germany.1252 lc_time | German_Germany.1252 log_destination | stderr log_directory | N:/ZZ_log/pg_log_hbdev log_error_verbosity | verbose log_file_mode | 0640 log_line_prefix | WHB %a %t %i %e %2l:> log_statement | mod log_temp_files | 0 log_timezone | CET logging_collector | on maintenance_work_mem | 128MB max_connections | 100 max_stack_depth | 2MB max_wal_size | 1GB min_wal_size | 80MB pg_stat_statements.max | 5000 pg_stat_statements.track | all random_page_cost | 1 search_path | public, archiv, ablage, admin server_encoding | UTF8 server_version | 12.0 shared_buffers | 1GB shared_preload_libraries | auto_explain,pg_stat_statements temp_buffers | 32MB TimeZone | CET transaction_deferrable | off transaction_isolation | read committed transaction_read_only | off update_process_title | off wal_buffers | 16MB wal_segment_size | 16MB work_mem | 32MB (48 rows) Indeed, the database has UTF8 Encoding. The Extended error-log (i have set auto_explain): WHB psql 2019-10-09 15:45:03 CEST XX000 7:> ERROR: XX000: invalid memory alloc request size 18446744073709551613 WHB psql 2019-10-09 15:45:03 CEST XX000 8:> LOCATION: palloc, d:\pginstaller_12.auto\postgres.windows-x64\src\backend\utils\mmgr\mcxt.c:934 WHB psql 2019-10-09 15:45:03 CEST XX000 9:> STATEMENT: select * from q_tbl_archiv; WHB vacuumdb 2019-10-09 15:46:42 CEST 00000 1:> LOG: 00000: duration: 0.022 ms plan: Query Text: SELECT pg_catalog.set_config('search_path', '', false); Result (cost=0.00..0.01 rows=1 width=32) (actual time=0.014..0.015 rows=1 loops=1) WHB vacuumdb 2019-10-09 15:46:42 CEST 00000 2:> LOCATION: explain_ExecutorEnd, d:\pginstaller_12.auto\postgres.windows-x64\contrib\auto_explain\auto_explain.c:415 WHB vacuumdb 2019-10-09 15:46:42 CEST 00000 3:> LOG: 00000: duration: 0.072 ms plan: Query Text: SELECT datname FROM pg_database WHERE datallowconn ORDER BY 1; Sort (cost=1.16..1.16 rows=1 width=64) (actual time=0.063..0.064 rows=14 loops=1) Sort Key: datname Sort Method: quicksort Memory: 26kB -> Seq Scan on pg_database (cost=0.00..1.15 rows=1 width=64) (actual time=0.018..0.022 rows=14 loops=1) Filter: datallowconn Rows Removed by Filter: 1 WHB vacuumdb 2019-10-09 15:46:42 CEST 00000 4:> LOCATION: explain_ExecutorEnd, d:\pginstaller_12.auto\postgres.windows-x64\contrib\auto_explain\auto_explain.c:415 WHB vacuumdb 2019-10-09 15:46:43 CEST 00000 1:> LOG: 00000: duration: 0.027 ms plan: Query Text: SELECT pg_catalog.set_config('search_path', '', false); Result (cost=0.00..0.01 rows=1 width=32) (actual time=0.012..0.013 rows=1 loops=1) WHB vacuumdb 2019-10-09 15:46:43 CEST 00000 2:> LOCATION: explain_ExecutorEnd, d:\pginstaller_12.auto\postgres.windows-x64\contrib\auto_explain\auto_explain.c:415 WHB vacuumdb 2019-10-09 15:46:43 CEST 00000 3:> LOG: 00000: duration: 1.036 ms plan: Query Text: SELECT c.relname, ns.nspname FROM pg_catalog.pg_class c JOIN pg_catalog.pg_namespace ns ON c.relnamespace OPERATOR(pg_catalog.=) ns.oid LEFT JOIN pg_catalog.pg_class t ON c.reltoastrelid OPERATOR(pg_catalog.=) t.oid WHERE c.relkind OPERATOR(pg_catalog.=) ANY (array['r', 'm']) ORDER BY c.relpages DESC; Sort (cost=56.56..56.59 rows=13 width=132) (actual time=0.843..0.854 rows=320 loops=1) Sort Key: c.relpages DESC Sort Method: quicksort Memory: 110kB -> Hash Join (cost=1.23..56.32 rows=13 width=132) (actual time=0.082..0.649 rows=320 loops=1) Hash Cond: (c.relnamespace = ns.oid) -> Seq Scan on pg_class c (cost=0.00..55.05 rows=13 width=76) (actual time=0.034..0.545 rows=320 loops=1) Filter: ((relkind)::text = ANY ('{r,m}'::text[])) Rows Removed by Filter: 950 -> Hash (cost=1.10..1.10 rows=10 width=68) (actual time=0.022..0.022 rows=10 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on pg_namespace ns (cost=0.00..1.10 rows=10 width=68) (actual time=0.010..0.011 rows=10 loops=1) WHB vacuumdb 2019-10-09 15:46:43 CEST 00000 4:> LOCATION: explain_ExecutorEnd, d:\pginstaller_12.auto\postgres.windows-x64\contrib\auto_explain\auto_explain.c:415 WHB vacuumdb 2019-10-09 15:46:43 CEST 00000 5:> LOG: 00000: duration: 0.011 ms plan: Query Text: SELECT pg_catalog.set_config('search_path', '', false); Result (cost=0.00..0.01 rows=1 width=32) (actual time=0.008..0.008 rows=1 loops=1) WHB vacuumdb 2019-10-09 15:46:43 CEST 00000 6:> LOCATION: explain_ExecutorEnd, d:\pginstaller_12.auto\postgres.windows-x64\contrib\auto_explain\auto_explain.c:415 WHB 2019-10-09 15:47:01 CEST 00000 22:> LOG: 00000: server process (PID 4708) was terminated by exception 0xC0000005 WHB 2019-10-09 15:47:01 CEST 00000 23:> DETAIL: Failed process was running: ANALYZE admin.q_tbl_archiv; WHB 2019-10-09 15:47:01 CEST 00000 24:> HINT: See C include file "ntstatus.h" for a description of the hexadecimal value. WHB 2019-10-09 15:47:01 CEST 00000 25:> LOCATION: LogChildExit, d:\pginstaller_12.auto\postgres.windows-x64\src\backend\postmaster\postmaster.c:3670 WHB 2019-10-09 15:47:01 CEST 00000 26:> LOG: 00000: terminating any other active server processes WHB 2019-10-09 15:47:01 CEST 00000 27:> LOCATION: HandleChildCrash, d:\pginstaller_12.auto\postgres.windows-x64\src\backend\postmaster\postmaster.c:3400 WHB psql 2019-10-09 15:47:01 CEST 57P02 10:> WARNING: 57P02: terminating connection because of crash of another serverprocess WHB psql 2019-10-09 15:47:01 CEST 57P02 11:> DETAIL: The postmaster has commanded this server process to roll back thecurrent transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. WHB psql 2019-10-09 15:47:01 CEST 57P02 12:> HINT: In a moment you should be able to reconnect to the database and repeatyour command. WHB psql 2019-10-09 15:47:01 CEST 57P02 13:> LOCATION: quickdie, d:\pginstaller_12.auto\postgres.windows-x64\src\backend\tcop\postgres.c:2717 WHB 2019-10-09 15:47:02 CEST 57P02 3:> WARNING: 57P02: terminating connection because of crash of another server process WHB 2019-10-09 15:47:02 CEST 57P02 4:> DETAIL: The postmaster has commanded this server process to roll back the currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory. WHB 2019-10-09 15:47:02 CEST 57P02 5:> HINT: In a moment you should be able to reconnect to the database and repeat yourcommand. WHB 2019-10-09 15:47:02 CEST 57P02 6:> LOCATION: quickdie, d:\pginstaller_12.auto\postgres.windows-x64\src\backend\tcop\postgres.c:2717 WHB 2019-10-09 15:47:02 CEST 00000 28:> LOG: 00000: all server processes terminated; reinitializing WHB 2019-10-09 15:47:02 CEST 00000 29:> LOCATION: PostmasterStateMachine, d:\pginstaller_12.auto\postgres.windows-x64\src\backend\postmaster\postmaster.c:3912 WHB 2019-10-09 15:47:02 CEST 00000 1:> LOG: 00000: database system was interrupted; last known up at 2019-10-09 15:46:03CEST WHB 2019-10-09 15:47:02 CEST 00000 2:> LOCATION: StartupXLOG, d:\pginstaller_12.auto\postgres.windows-x64\src\backend\access\transam\xlog.c:6277 The table was imported successively by pg_dump/pg_restore from the previous versions into pg11. This was the same what I did on the other machine (pg 11.5). On this test machine I could successfully Export the table withpg_dump -t. On the erroneous PG12 Cluster I succeeded to recreate a similar table with the original create table Statements: no Errors. Under PG12 upgraded, I tried to select only the first column (select table_name from q_tbl_archiv) and got erroneaus results(shown first 2 entries): cpsdb=# select table_name from q_tbl_archiv; table_name --------------------------------------------- \x11chemmat\x17chm_season !collectionsheet\x15cs_season It seems that the length Bytes are present in the Output. Hope this Information helps. Hans Buschmann
Re: BUG #16045: vacuum_db crash and illegal memory alloc afterpg_upgrade from PG11 to PG12
From
Tomas Vondra
Date:
Well, I think I found the root cause. It's because of 7c15cef86d, which changed the definition of sql_identifier so that it's a domain over name instead of varchar. So we now have this: SELECT typname, typlen FROM pg_type WHERE typname = 'sql_identifier': -[ RECORD 1 ]--+--------------- typname | sql_identifier typlen | -1 instead of this -[ RECORD 1 ]--+--------------- typname | sql_identifier typlen | 64 Unfortunately, that seems very much like a break of on-disk format, and after pg_upgrade any table containing sql_identifier columns is pretty much guaranteed to be badly mangled. For example, the first row from the table used in the original report looks like this on PostgreSQL 11: test=# select ctid, * from q_tbl_archiv limit 1; -[ RECORD 1 ]----+-------------------------- ctid | (0,1) table_name | _pg_foreign_data_wrappers column_name | foreign_data_wrapper_name ordinal_position | 5 col_qualifier | foreign_data_wrapper_name id_column | id_default | while on PostgreSQL 12 after pg_upgrade it looks like this test=# select ctid, table_name, column_name, ordinal_position from q_tbl_archiv limit 1;: -[ RECORD 1 ]----+--------------------------------------------------------- ctid | (0,1) table_name | 5_pg_foreign_data_wrappers5foreign_data_wrapper_name\x05 column_name | _data_wrapper_name ordinal_position | 0 Not sure what to do about this :-( regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12
From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > Well, I think I found the root cause. It's because of 7c15cef86d, which > changed the definition of sql_identifier so that it's a domain over name > instead of varchar. Ah... > Not sure what to do about this :-( Fortunately, there should be close to zero people with user tables depending on sql_identifier. I think we should just add a test in pg_upgrade that refuses to upgrade if there are any such columns. It won't be the first such restriction. regards, tom lane
Re: BUG #16045: vacuum_db crash and illegal memory alloc afterpg_upgrade from PG11 to PG12
From
Tomas Vondra
Date:
On Wed, Oct 09, 2019 at 07:18:45PM -0400, Tom Lane wrote: >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> Well, I think I found the root cause. It's because of 7c15cef86d, which >> changed the definition of sql_identifier so that it's a domain over name >> instead of varchar. > >Ah... > >> Not sure what to do about this :-( > >Fortunately, there should be close to zero people with user tables >depending on sql_identifier. I think we should just add a test in >pg_upgrade that refuses to upgrade if there are any such columns. >It won't be the first such restriction. > Hmmm, yeah. I agree the number of people using sql_identifier in user tables is low, but OTOH we got this report within a week after release, so maybe it's higher than we think. Another option would be to teach pg_upgrade to switch the columns to 'text' or 'varchar', not sure if that's possible or how much work would that be. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12
From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On Wed, Oct 09, 2019 at 07:18:45PM -0400, Tom Lane wrote: >> Fortunately, there should be close to zero people with user tables >> depending on sql_identifier. I think we should just add a test in >> pg_upgrade that refuses to upgrade if there are any such columns. >> It won't be the first such restriction. > Hmmm, yeah. I agree the number of people using sql_identifier in user > tables is low, but OTOH we got this report within a week after release, > so maybe it's higher than we think. True. > Another option would be to teach pg_upgrade to switch the columns to > 'text' or 'varchar', not sure if that's possible or how much work would > that be. I think it'd be a mess --- the actual hacking would have to happen in pg_dump, I think, and it'd be a kluge because pg_dump doesn't normally understand what server version its output is going to. So we'd more or less have to control it through a new pg_dump switch that pg_upgrade would use. Ick. Also, even if we did try to silently convert such columns that way, I bet we'd get other bug reports about "why'd my columns suddenly change type?". So I'd rather force the user to be involved. regards, tom lane
Re: BUG #16045: vacuum_db crash and illegal memory alloc afterpg_upgrade from PG11 to PG12
From
Andres Freund
Date:
On 2019-10-09 19:41:54 -0400, Tom Lane wrote: > Also, even if we did try to silently convert such columns that way, > I bet we'd get other bug reports about "why'd my columns suddenly > change type?". So I'd rather force the user to be involved. +1
Re: BUG #16045: vacuum_db crash and illegal memory alloc afterpg_upgrade from PG11 to PG12
From
Tomas Vondra
Date:
On Wed, Oct 09, 2019 at 06:48:13PM -0700, Andres Freund wrote: >On 2019-10-09 19:41:54 -0400, Tom Lane wrote: >> Also, even if we did try to silently convert such columns that way, >> I bet we'd get other bug reports about "why'd my columns suddenly >> change type?". So I'd rather force the user to be involved. > >+1 Fair enough, attached is a patch doing that, I think. Maybe the file should be named differently, as it contains other objects than just tables. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12
From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > Fair enough, attached is a patch doing that, I think. Maybe the file > should be named differently, as it contains other objects than just > tables. Seems about right, though I notice it will not detect domains over sql_identifier. How much do we care about that? To identify such domains, I think we'd need something like WHERE attypid IN (recursive-WITH-query), which makes me nervous. We did support those starting with 8.4, which is as far back as pg_upgrade will go, so in theory it should work. But I think we had bugs with such cases in old releases. Do we want to assume that the source server has been updated enough to avoid any such bugs? The expense of such a query might be daunting, too. regards, tom lane
Re: BUG #16045: vacuum_db crash and illegal memory alloc afterpg_upgrade from PG11 to PG12
From
Tomas Vondra
Date:
On Thu, Oct 10, 2019 at 10:19:12AM -0400, Tom Lane wrote: >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> Fair enough, attached is a patch doing that, I think. Maybe the file >> should be named differently, as it contains other objects than just >> tables. > >Seems about right, though I notice it will not detect domains over >sql_identifier. How much do we care about that? > >To identify such domains, I think we'd need something like >WHERE attypid IN (recursive-WITH-query), which makes me nervous. >We did support those starting with 8.4, which is as far back as >pg_upgrade will go, so in theory it should work. But I think we >had bugs with such cases in old releases. Do we want to assume >that the source server has been updated enough to avoid any such >bugs? The expense of such a query might be daunting, too. > Not sure. Regarding bugs, I think it's fine to assume the users are running sufficiently recent version - they may not, but then they're probably subject to various other bugs (data corruption, queries). If they're not, then they'll either get false positives (in which case they'll be forced to update) or false negatives (which is just as if we did nothing). For the query cost, I think we can assume the domain hierarchies are not particularly deep (in practice I'd expect just domains directly on the sql_identifier type). And I doubt people are using that very widely, it's probably more like this report - ad-hoc CTAS, so just a couple of items. So I wouldn't expect it to be a huge deal in most cases. But even if it takes a second or two, it's a one-time cost. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12
From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On Thu, Oct 10, 2019 at 10:19:12AM -0400, Tom Lane wrote: >> To identify such domains, I think we'd need something like >> WHERE attypid IN (recursive-WITH-query), which makes me nervous. >> We did support those starting with 8.4, which is as far back as >> pg_upgrade will go, so in theory it should work. But I think we >> had bugs with such cases in old releases. Do we want to assume >> that the source server has been updated enough to avoid any such >> bugs? The expense of such a query might be daunting, too. > For the query cost, I think we can assume the domain hierarchies are not > particularly deep (in practice I'd expect just domains directly on the > sql_identifier type). And I doubt people are using that very widely, > it's probably more like this report - ad-hoc CTAS, so just a couple of > items. So I wouldn't expect it to be a huge deal in most cases. But even > if it takes a second or two, it's a one-time cost. What I was worried about was the planner possibly trying to apply the atttypid restriction as a scan qual using a subplan, which might be rather awful. But it doesn't look like that happens. I get a hash semijoin to the CTE output, in all branches back to 8.4, on this trial query: explain with recursive sqlidoids(toid) as ( select 'information_schema.sql_identifier'::pg_catalog.regtype as toid union select oid from pg_catalog.pg_type, sqlidoids where typtype = 'd' and typbasetype = sqlidoids.toid ) SELECT n.nspname, c.relname, a.attname FROM pg_catalog.pg_class c, pg_catalog.pg_namespace n, pg_catalog.pg_attribute a WHERE c.oid = a.attrelid AND NOT a.attisdropped AND a.atttypid in (select toid from sqlidoids) AND c.relkind IN ('r','v','i') and c.relnamespace = n.oid AND n.nspname !~ '^pg_temp_' AND n.nspname !~ '^pg_toast_temp_' AND n.nspname NOT IN ('pg_catalog', 'information_schema'); regards, tom lane
Re: BUG #16045: vacuum_db crash and illegal memory alloc afterpg_upgrade from PG11 to PG12
From
Tomas Vondra
Date:
On Thu, Oct 10, 2019 at 04:14:20PM -0400, Tom Lane wrote: >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> On Thu, Oct 10, 2019 at 10:19:12AM -0400, Tom Lane wrote: >>> To identify such domains, I think we'd need something like >>> WHERE attypid IN (recursive-WITH-query), which makes me nervous. >>> We did support those starting with 8.4, which is as far back as >>> pg_upgrade will go, so in theory it should work. But I think we >>> had bugs with such cases in old releases. Do we want to assume >>> that the source server has been updated enough to avoid any such >>> bugs? The expense of such a query might be daunting, too. > >> For the query cost, I think we can assume the domain hierarchies are not >> particularly deep (in practice I'd expect just domains directly on the >> sql_identifier type). And I doubt people are using that very widely, >> it's probably more like this report - ad-hoc CTAS, so just a couple of >> items. So I wouldn't expect it to be a huge deal in most cases. But even >> if it takes a second or two, it's a one-time cost. > >What I was worried about was the planner possibly trying to apply the >atttypid restriction as a scan qual using a subplan, which might be rather >awful. But it doesn't look like that happens. OK. > I get a hash semijoin to >the CTE output, in all branches back to 8.4, on this trial query: > >explain >with recursive sqlidoids(toid) as ( >select 'information_schema.sql_identifier'::pg_catalog.regtype as toid >union >select oid from pg_catalog.pg_type, sqlidoids > where typtype = 'd' and typbasetype = sqlidoids.toid >) >SELECT n.nspname, c.relname, a.attname >FROM pg_catalog.pg_class c, > pg_catalog.pg_namespace n, > pg_catalog.pg_attribute a >WHERE c.oid = a.attrelid AND > NOT a.attisdropped AND > a.atttypid in (select toid from sqlidoids) AND > c.relkind IN ('r','v','i') and > c.relnamespace = n.oid AND > n.nspname !~ '^pg_temp_' AND > n.nspname !~ '^pg_toast_temp_' AND > n.nspname NOT IN ('pg_catalog', 'information_schema'); > I think that's not quite sufficient - the problem is that we can have domains and composite types on sql_identifier, in some arbitrary order. And the recursive CTE won't handle that the way it's written - it will miss domains on composite types containing sql_identifier. And we have quite a few of them in the information schema, so maybe someone created a domain on one of those (however unlikely it may seem). I think this recursive CTE does it correctly: WITH RECURSIVE oids AS ( -- type itself SELECT 'information_schema.sql_identifier'::regtype AS oid UNION ALL SELECT * FROM ( -- domains on the type WITH x AS (SELECT oid FROM oids) SELECT t.oid FROM pg_catalog.pg_type t, x WHERE typbasetype = x.oid AND typtype = 'd' UNION -- composite types containing the type SELECT t.oid FROM pg_catalog.pg_type t, pg_catalog.pg_class c, pg_catalog.pg_attribute a, x WHERE t.typtype = 'c' AND t.oid = c.reltype AND c.oid = a.attrelid AND a.atttypid = x.oid ) foo ) I had to use CTE within CTE, because the 'oids' can be referenced only once, but we have two subqueries there. Maybe there's a better solution. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #16045: vacuum_db crash and illegal memory alloc afterpg_upgrade from PG11 to PG12
From
Tomas Vondra
Date:
OK, here is an updated patch, with the recursive CTE. I've done a fair amount of testing on it on older versions (up to 9.4), and it seems to work just fine. Another thing that I noticed is that the query does not need to look at RELKIND_COMPOSITE_TYPE, because we only really care about cases when sql_identifier is stored on-disk. Composite type alone does not do that, and the CTE includes OIDs of composite types that we then check against relations and matviews. Barring objections, I'll push this early next week. BTW the query (including the RELKIND_COMPSITE_TYPE) was essentially just a lightly-massaged copy of old_9_6_check_for_unknown_data_type_usage, so that seems wrong too. The comment explicitly says: * Also check composite types, in case they are used for table columns. but even a simple "create type c as (a unknown, b int)" without any table using it enough to trigger the failure. But maybe it's intentional, not sure. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12
From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > here is an updated patch, with the recursive CTE. I've done a fair > amount of testing on it on older versions (up to 9.4), and it seems to > work just fine. Might be a good idea to exclude attisdropped columns in the part of the recursive query that's looking for sql_identifier columns of composite types. I'm not sure if composites can have dropped columns today, but even if they can't it seems like a wise bit of future-proofing. (We'll no doubt have occasion to use this logic again...) Looks good other than that nit. > BTW the query (including the RELKIND_COMPSITE_TYPE) was essentially just > a lightly-massaged copy of old_9_6_check_for_unknown_data_type_usage, so > that seems wrong too. Yeah, we should back-port this logic into that check too, IMO. regards, tom lane
Re: BUG #16045: vacuum_db crash and illegal memory alloc afterpg_upgrade from PG11 to PG12
From
Tomas Vondra
Date:
On Sun, Oct 13, 2019 at 02:26:48PM -0400, Tom Lane wrote: >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> here is an updated patch, with the recursive CTE. I've done a fair >> amount of testing on it on older versions (up to 9.4), and it seems to >> work just fine. > >Might be a good idea to exclude attisdropped columns in the part of the >recursive query that's looking for sql_identifier columns of composite >types. I'm not sure if composites can have dropped columns today, >but even if they can't it seems like a wise bit of future-proofing. >(We'll no doubt have occasion to use this logic again...) > Hmm? How could that be safe? Let's say we have a composite type with a sql_identifier column, it's used in a table with data, and we drop the column. We need the pg_type information to parse the existing, so how could we skip attisdropped columns? >Looks good other than that nit. > >> BTW the query (including the RELKIND_COMPSITE_TYPE) was essentially just >> a lightly-massaged copy of old_9_6_check_for_unknown_data_type_usage, so >> that seems wrong too. > >Yeah, we should back-port this logic into that check too, IMO. > You mean the recursive CTE, removal of RELKIND_COMPOSITE_TYPE or the proposed change w.r.t. dropped columns? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12
From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On Sun, Oct 13, 2019 at 02:26:48PM -0400, Tom Lane wrote: >> Might be a good idea to exclude attisdropped columns in the part of the >> recursive query that's looking for sql_identifier columns of composite >> types. I'm not sure if composites can have dropped columns today, [ I checked this, they can ] >> but even if they can't it seems like a wise bit of future-proofing. >> (We'll no doubt have occasion to use this logic again...) > Hmm? How could that be safe? Let's say we have a composite type with a > sql_identifier column, it's used in a table with data, and we drop the > column. We need the pg_type information to parse the existing, so how > could we skip attisdropped columns? It works exactly like it does for table rowtypes. regression=# create type cfoo as (f1 int, f2 int, f3 int); CREATE TYPE regression=# alter type cfoo drop attribute f2; ALTER TYPE regression=# select attname,atttypid,attisdropped,attlen,attalign from pg_attribute where attrelid = 'cfoo'::regclass; attname | atttypid | attisdropped | attlen | attalign ------------------------------+----------+--------------+--------+---------- f1 | 23 | f | 4 | i ........pg.dropped.2........ | 0 | t | 4 | i f3 | 23 | f | 4 | i (3 rows) All we need to skip over the dead data is attlen/attalign, which are preserved in pg_attribute even if the pg_type row is gone. As this example shows, you don't really *have* to check attisdropped because atttypid will be set to zero. But the latter is just a defense measure in case somebody forgets to check attisdropped; you're not supposed to forget that. regards, tom lane
Re: BUG #16045: vacuum_db crash and illegal memory alloc afterpg_upgrade from PG11 to PG12
From
Tomas Vondra
Date:
On Mon, Oct 14, 2019 at 10:16:40AM -0400, Tom Lane wrote: >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> On Sun, Oct 13, 2019 at 02:26:48PM -0400, Tom Lane wrote: >>> Might be a good idea to exclude attisdropped columns in the part of the >>> recursive query that's looking for sql_identifier columns of composite >>> types. I'm not sure if composites can have dropped columns today, > >[ I checked this, they can ] > >>> but even if they can't it seems like a wise bit of future-proofing. >>> (We'll no doubt have occasion to use this logic again...) > >> Hmm? How could that be safe? Let's say we have a composite type with a >> sql_identifier column, it's used in a table with data, and we drop the >> column. We need the pg_type information to parse the existing, so how >> could we skip attisdropped columns? > >It works exactly like it does for table rowtypes. > >regression=# create type cfoo as (f1 int, f2 int, f3 int); >CREATE TYPE >regression=# alter type cfoo drop attribute f2; >ALTER TYPE >regression=# select attname,atttypid,attisdropped,attlen,attalign from pg_attribute where attrelid = 'cfoo'::regclass; > attname | atttypid | attisdropped | attlen | attalign >------------------------------+----------+--------------+--------+---------- > f1 | 23 | f | 4 | i > ........pg.dropped.2........ | 0 | t | 4 | i > f3 | 23 | f | 4 | i >(3 rows) > >All we need to skip over the dead data is attlen/attalign, which are >preserved in pg_attribute even if the pg_type row is gone. > >As this example shows, you don't really *have* to check attisdropped >because atttypid will be set to zero. But the latter is just a >defense measure in case somebody forgets to check attisdropped; >you're not supposed to forget that. > Aha! I forgot we copy the necessary stuff into pg_attribute. Thanks for clarifying, I'll polish and push the fix shortly. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #16045: vacuum_db crash and illegal memory alloc afterpg_upgrade from PG11 to PG12
From
Tomas Vondra
Date:
On Mon, Oct 14, 2019 at 06:35:38PM +0200, Tomas Vondra wrote: > ... > >Aha! I forgot we copy the necessary stuff into pg_attribute. Thanks for >clarifying, I'll polish and push the fix shortly. > I've pushed and backpatched the fix. Attached are similar fixes for the existing pg_upgrade checks for pg_catalog.line and pg_catalog.unknown types, which have the same issues with composite types and domains. There are some additional details & examples in the commit messages. I've kept this in two patches primarily because of backpatching - the line fix should go back up to 9.4, the unknown is for 10. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Re: BUG #16045: vacuum_db crash and illegal memory alloc afterpg_upgrade from PG11 to PG12
From
Justin Pryzby
Date:
On Tue, Oct 15, 2019 at 02:18:17AM +0200, Tomas Vondra wrote: > On Mon, Oct 14, 2019 at 06:35:38PM +0200, Tomas Vondra wrote: > >... > > > >Aha! I forgot we copy the necessary stuff into pg_attribute. Thanks for > >clarifying, I'll polish and push the fix shortly. Perhaps it'd be worth creating a test for on-disk format ? Like a table with a column for each core type, which is either SELECTed from after pg_upgrade, or pg_dump output compared before and after. Justin
Re: BUG #16045: vacuum_db crash and illegal memory alloc afterpg_upgrade from PG11 to PG12
From
Tomas Vondra
Date:
On Mon, Oct 14, 2019 at 11:41:18PM -0500, Justin Pryzby wrote: >On Tue, Oct 15, 2019 at 02:18:17AM +0200, Tomas Vondra wrote: >> On Mon, Oct 14, 2019 at 06:35:38PM +0200, Tomas Vondra wrote: >> >... >> > >> >Aha! I forgot we copy the necessary stuff into pg_attribute. Thanks for >> >clarifying, I'll polish and push the fix shortly. > >Perhaps it'd be worth creating a test for on-disk format ? > >Like a table with a column for each core type, which is either SELECTed from >after pg_upgrade, or pg_dump output compared before and after. > IMO that would be useful - we now have a couple of these checks for different data types (line, unknown, sql_identifier), with a couple of combinations each. And I've been looking if we do similar pg_upgrade tests, but I haven't found anything. I mean, we do pg_upgrade the cluster used for regression tests, but here we need to test a number of cases that are meant to abort the pg_upgrade. So we'd need a number of pg_upgrade runs, to test that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #16045: vacuum_db crash and illegal memory alloc afterpg_upgrade from PG11 to PG12
From
Tomas Vondra
Date:
On Tue, Oct 15, 2019 at 02:18:17AM +0200, Tomas Vondra wrote: >On Mon, Oct 14, 2019 at 06:35:38PM +0200, Tomas Vondra wrote: >>... >> >>Aha! I forgot we copy the necessary stuff into pg_attribute. Thanks for >>clarifying, I'll polish and push the fix shortly. >> > >I've pushed and backpatched the fix. Attached are similar fixes for the >existing pg_upgrade checks for pg_catalog.line and pg_catalog.unknown >types, which have the same issues with composite types and domains. > >There are some additional details & examples in the commit messages. > >I've kept this in two patches primarily because of backpatching - the >line fix should go back up to 9.4, the unknown is for 10. > I've just committed and pushed both fixes after some minor corrections. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12
From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > I've just committed and pushed both fixes after some minor corrections. Not quite right in 9.6 and before, according to crake. Looks like some issue with the CppAsString2'd constants? Did we even have CppAsString2 that far back? regards, tom lane
Re: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12
From
Tom Lane
Date:
I wrote: > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> I've just committed and pushed both fixes after some minor corrections. > Not quite right in 9.6 and before, according to crake. Looks like > some issue with the CppAsString2'd constants? Did we even have > CppAsString2 that far back? Yeah, we did. On closer inspection I suspect that we need to #include some other file to get the RELKIND_ constants in the old branches. regards, tom lane
Re: BUG #16045: vacuum_db crash and illegal memory alloc afterpg_upgrade from PG11 to PG12
From
Tomas Vondra
Date:
On Wed, Oct 16, 2019 at 03:26:42PM +0200, Tom Lane wrote: >I wrote: >> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >>> I've just committed and pushed both fixes after some minor corrections. > >> Not quite right in 9.6 and before, according to crake. Looks like >> some issue with the CppAsString2'd constants? Did we even have >> CppAsString2 that far back? > >Yeah, we did. On closer inspection I suspect that we need to #include >some other file to get the RELKIND_ constants in the old branches. > Oh! Looking. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #16045: vacuum_db crash and illegal memory alloc afterpg_upgrade from PG11 to PG12
From
Tomas Vondra
Date:
On Wed, Oct 16, 2019 at 03:26:42PM +0200, Tom Lane wrote: >I wrote: >> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >>> I've just committed and pushed both fixes after some minor corrections. > >> Not quite right in 9.6 and before, according to crake. Looks like >> some issue with the CppAsString2'd constants? Did we even have >> CppAsString2 that far back? > >Yeah, we did. On closer inspection I suspect that we need to #include >some other file to get the RELKIND_ constants in the old branches. > Yeah, the pg_class.h catalog was missing on pre-10 relases. It compiled just fine, so I haven't noticed that during the backpatching :-( Fixed, let's see if the buildfarm is happy with that. regads -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #16045: vacuum_db crash and illegal memory alloc afterpg_upgrade from PG11 to PG12
From
Bruce Momjian
Date:
On Tue, Oct 15, 2019 at 02:18:17AM +0200, Tomas Vondra wrote: > On Mon, Oct 14, 2019 at 06:35:38PM +0200, Tomas Vondra wrote: > > ... > > > > Aha! I forgot we copy the necessary stuff into pg_attribute. Thanks for > > clarifying, I'll polish and push the fix shortly. > > > > I've pushed and backpatched the fix. Attached are similar fixes for the > existing pg_upgrade checks for pg_catalog.line and pg_catalog.unknown > types, which have the same issues with composite types and domains. This comit added old_11_check_for_sql_identifier_data_type_usage(), but it did not use the clearer database error list format added to the master branch in commit 1634d36157. Attached is a patch to fix this, which I have committed. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +