Thread: BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

BUG #16045: vacuum_db crash and illegal memory alloc after pg_upgrade from PG11 to PG12

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      16045
Logged by:          Hans Buschmann
Email address:      buschmann@nidsa.net
PostgreSQL version: 12.0
Operating system:   Windows 10 64bit
Description:

I just did a pg_upgrade from pg 11.5 to pg 12.0 on my development machine
under Windows 64bit (both distributions from EDB).

cpsdb=# select version ();
                          version
------------------------------------------------------------
 PostgreSQL 12.0, compiled by Visual C++ build 1914, 64-bit
(1 row)

The pg_upgrade with --link went flawlessly, I started (only!) the new server
12.0 and could connect and access individual databases.

As recommended by the resulting analyze_new_cluster.bat I tried a full
vacuumdb with:

"N:/pgsql/bin/vacuumdb" -U postgres --all --analyze-only

which crashed with
vacuumdb: vacuuming database "cpsdb"
vacuumdb: error: vacuuming of table "admin.q_tbl_archiv" in database "cpsdb"
failed: ERROR:  compressed data is corrupted

I connected to the database through pgsql and looked at the table
"admin.q_tbl_archiv"

cpsdb=# \d+ q_tbl_archiv;
                                                   Table
"admin.q_tbl_archiv"
      Column      |                Type                | Collation |
Nullable | Default | Storage  | Stats target | Description

------------------+------------------------------------+-----------+----------+---------+----------+--------------+-------------
 table_name       | information_schema.sql_identifier  |           |
 |         | plain    |              |
 column_name      | information_schema.sql_identifier  |           |
 |         | plain    |              |
 ordinal_position | information_schema.cardinal_number |           |
 |         | plain    |              |
 col_qualifier    | text                               |           |
 |         | extended |              |
 id_column        | information_schema.sql_identifier  |           |
 |         | plain    |              |
 id_default       | information_schema.character_data  |           |
 |         | extended |              |
Access method: heap

When trying to select * from q_tbl_archiv I got:

cpsdb=# select * from q_tbl_archiv;
ERROR:  invalid memory alloc request size 18446744073709551613

This table was created a long time back under 9.5 or 9.6 with the (here
truncated) following command:


create table q_tbl_archiv as
with
qseason as (
select table_name,column_name, ordinal_position 
,replace(column_name,'_season','') as col_qualifier
-- ,'id_'||replace(column_name,'_season','') as id_column
from information_schema.columns 
where 
column_name like '%_season'
and ordinal_position < 10
and table_name in (
 'table1'
,'table2'
-- here truncated:
-- ... (here where all table of mine having columns like xxx_season)
-- to reproduce, change to your own tablenames in a test database
)
order by table_name
)
select qs.*,c.column_name as id_column, c.column_default as id_default
from 
    qseason qs
    left join information_schema.columns c on c.table_name=qs.table_name and
c.column_name like 'id_%'
;

Until now this table was always restored without error by migrating to a new
major version through pg_dump/initdb/pr_restore.

To verify the integrity of the table I restored the dump taken under pg_dump
from pg 11.5 just before the pg_upgrade to another machine.

The restore and analyze went OK and select * from q_tbl_archiv showed all
tuples, eg (edited):

cpsdb_dev=# select * from q_tbl_archiv;
        table_name        | column_name  | ordinal_position | col_qualifier
| id_column |                        id_default

--------------------------+--------------+------------------+---------------+-----------+----------------------------------------------------------
 table1                   | chm_season   |                2 | chm
|           |
 table2                   | cs_season    |                2 | cs
| id_cs     | nextval('table2_id_cs_seq'::regclass)
...

In conclusion, this seems to me like an error/omission of pg_upgrade.

It seems to handle these specially derived tables from information_schema
not correctly, resulting in failures of the upgraded database.

For me, this error is not so crucial, because this table is only used for
administrative purposes and can easily be restored from backup.

But I want to share my findings for the sake of other users of pg_upgrade.

Thanks for investigating!

Hans Buschmann


On Tue, Oct 08, 2019 at 05:08:53PM +0000, PG Bug reporting form wrote:
>The following bug has been logged on the website:
>
>Bug reference:      16045
>Logged by:          Hans Buschmann
>Email address:      buschmann@nidsa.net
>PostgreSQL version: 12.0
>Operating system:   Windows 10 64bit
>Description:
>
>I just did a pg_upgrade from pg 11.5 to pg 12.0 on my development machine
>under Windows 64bit (both distributions from EDB).
>
>cpsdb=# select version ();
>                          version
>------------------------------------------------------------
> PostgreSQL 12.0, compiled by Visual C++ build 1914, 64-bit
>(1 row)
>
>The pg_upgrade with --link went flawlessly, I started (only!) the new server
>12.0 and could connect and access individual databases.
>
>As recommended by the resulting analyze_new_cluster.bat I tried a full
>vacuumdb with:
>
>"N:/pgsql/bin/vacuumdb" -U postgres --all --analyze-only
>
>which crashed with
>vacuumdb: vacuuming database "cpsdb"
>vacuumdb: error: vacuuming of table "admin.q_tbl_archiv" in database "cpsdb"
>failed: ERROR:  compressed data is corrupted
>
>I connected to the database through pgsql and looked at the table
>"admin.q_tbl_archiv"
>
>cpsdb=# \d+ q_tbl_archiv;
>                                                   Table
>"admin.q_tbl_archiv"
>      Column      |                Type                | Collation |
>Nullable | Default | Storage  | Stats target | Description

>------------------+------------------------------------+-----------+----------+---------+----------+--------------+-------------
> table_name       | information_schema.sql_identifier  |           |
> |         | plain    |              |
> column_name      | information_schema.sql_identifier  |           |
> |         | plain    |              |
> ordinal_position | information_schema.cardinal_number |           |
> |         | plain    |              |
> col_qualifier    | text                               |           |
> |         | extended |              |
> id_column        | information_schema.sql_identifier  |           |
> |         | plain    |              |
> id_default       | information_schema.character_data  |           |
> |         | extended |              |
>Access method: heap
>
>When trying to select * from q_tbl_archiv I got:
>
>cpsdb=# select * from q_tbl_archiv;
>ERROR:  invalid memory alloc request size 18446744073709551613
>
>This table was created a long time back under 9.5 or 9.6 with the (here
>truncated) following command:
>
>
>create table q_tbl_archiv as
>with
>qseason as (
>select table_name,column_name, ordinal_position
>,replace(column_name,'_season','') as col_qualifier
>-- ,'id_'||replace(column_name,'_season','') as id_column
>from information_schema.columns
>where
>column_name like '%_season'
>and ordinal_position < 10
>and table_name in (
> 'table1'
>,'table2'
>-- here truncated:
>-- ... (here where all table of mine having columns like xxx_season)
>-- to reproduce, change to your own tablenames in a test database
>)
>order by table_name
>)
>select qs.*,c.column_name as id_column, c.column_default as id_default
>from
>    qseason qs
>    left join information_schema.columns c on c.table_name=qs.table_name and
>c.column_name like 'id_%'
>;
>
>Until now this table was always restored without error by migrating to a new
>major version through pg_dump/initdb/pr_restore.
>
>To verify the integrity of the table I restored the dump taken under pg_dump
>from pg 11.5 just before the pg_upgrade to another machine.
>
>The restore and analyze went OK and select * from q_tbl_archiv showed all
>tuples, eg (edited):
>
>cpsdb_dev=# select * from q_tbl_archiv;
>        table_name        | column_name  | ordinal_position | col_qualifier
>| id_column |                        id_default

>--------------------------+--------------+------------------+---------------+-----------+----------------------------------------------------------
> table1                   | chm_season   |                2 | chm
>|           |
> table2                   | cs_season    |                2 | cs
>| id_cs     | nextval('table2_id_cs_seq'::regclass)
>...
>
>In conclusion, this seems to me like an error/omission of pg_upgrade.
>

There's clearly something bad happening. It's a bit strange, though. Had
this been a data corruption issue, I'd expect the pg_dump to fail too,
but it succeeds.

>It seems to handle these specially derived tables from information_schema
>not correctly, resulting in failures of the upgraded database.
>

Well, I don't see how that should make any difference. It's a CTAS and
that should create a regular table, that's not an issue. I wonder if
there were some changes to the data types involved, but that would be
essentially a break in on-disk format and we're careful about not doing
that ...

>For me, this error is not so crucial, because this table is only used for
>administrative purposes and can easily be restored from backup.
>
>But I want to share my findings for the sake of other users of pg_upgrade.
>

OK, thanks. Could you maybe set

 log_error_verbosity = verbose

before invoking the vacuum (you can set that in that session)? That
should give us more details about where exactly the error is triggered.
Even better, if you could attach a debugger to the session, set
breakpoints on locations triggering 'invalid memory alloc request size'
and then show the backtrace (obviously, that's more complicated).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



FWIW I can reproduce this - it's enough to do this on the 11 cluster

create table q_tbl_archiv as
with
qseason as (
select table_name,column_name, ordinal_position
,replace(column_name,'_season','') as col_qualifier
-- ,'id_'||replace(column_name,'_season','') as id_column
from information_schema.columns
order by table_name
)
select qs.*,c.column_name as id_column, c.column_default as id_default
from
        qseason qs
        left join information_schema.columns c on c.table_name=qs.table_name and
c.column_name like 'id_%';


and then

    analyze q_tbl_archiv

which produces backtrace like this:

No symbol "stats" in current context.
(gdb) bt
#0  0x0000746095262951 in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
#1  0x0000000000890a8e in varstrfastcmp_locale (a1p=0x17716b4 "per_language\a", len1=<optimized out>, a2p=0x176af28
'\177'<repeats 136 times>, "\021\004", len2=-4, ssup=<optimized out>, ssup=<optimized out>) at varlena.c:2320
 
#2  0x0000000000890cb1 in varlenafastcmp_locale (x=24581808, y=24555300, ssup=0x7ffc649463f0) at varlena.c:2219
#3  0x00000000005b73b4 in ApplySortComparator (ssup=0x7ffc649463f0, isNull2=false, datum2=<optimized out>,
isNull1=false,datum1=<optimized out>) at ../../../src/include/utils/sortsupport.h:224
 
#4  compare_scalars (a=<optimized out>, b=<optimized out>, arg=0x7ffc649463e0) at analyze.c:2700
#5  0x00000000008f9953 in qsort_arg (a=a@entry=0x178fdc0, n=<optimized out>, n@entry=2158, es=es@entry=16,
cmp=cmp@entry=0x5b7390<compare_scalars>, arg=arg@entry=0x7ffc649463e0) at qsort_arg.c:140
 
#6  0x00000000005b86a6 in compute_scalar_stats (stats=0x176a208, fetchfunc=<optimized out>, samplerows=<optimized out>,
totalrows=2158)at analyze.c:2273
 
#7  0x00000000005b9d95 in do_analyze_rel (onerel=onerel@entry=0x74608c00d3e8, params=params@entry=0x7ffc64946970,
va_cols=va_cols@entry=0x0,acquirefunc=<optimized out>, relpages=22, inh=inh@entry=false, in_outer_xact=false,
elevel=13)
    at analyze.c:529
#8  0x00000000005bb2c9 in analyze_rel (relid=<optimized out>, relation=<optimized out>,
params=params@entry=0x7ffc64946970,va_cols=0x0, in_outer_xact=<optimized out>, bstrategy=<optimized out>) at
analyze.c:260
#9  0x000000000062c7b0 in vacuum (relations=0x1727120, params=params@entry=0x7ffc64946970, bstrategy=<optimized out>,
bstrategy@entry=0x0,isTopLevel=isTopLevel@entry=true) at vacuum.c:413
 
#10 0x000000000062cd49 in ExecVacuum (pstate=pstate@entry=0x16c9518, vacstmt=vacstmt@entry=0x16a82b8,
isTopLevel=isTopLevel@entry=true)at vacuum.c:199
 
#11 0x00000000007a6d64 in standard_ProcessUtility (pstmt=0x16a8618, queryString=0x16a77a8 "", context=<optimized out>,
params=0x0,queryEnv=0x0, dest=0x16a8710, completionTag=0x7ffc64946cb0 "") at utility.c:670
 
#12 0x00000000007a4006 in PortalRunUtility (portal=0x170f368, pstmt=0x16a8618, isTopLevel=<optimized out>,
setHoldSnapshot=<optimizedout>, dest=0x16a8710, completionTag=0x7ffc64946cb0 "") at pquery.c:1175
 
#13 0x00000000007a4b61 in PortalRunMulti (portal=portal@entry=0x170f368, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false,dest=dest@entry=0x16a8710, altdest=altdest@entry=0x16a8710,
 
    completionTag=completionTag@entry=0x7ffc64946cb0 "") at pquery.c:1321
#14 0x00000000007a5864 in PortalRun (portal=portal@entry=0x170f368, count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true,run_once=run_once@entry=true, dest=dest@entry=0x16a8710,
altdest=altdest@entry=0x16a8710,
    completionTag=0x7ffc64946cb0 "") at pquery.c:796
#15 0x00000000007a174e in exec_simple_query (query_string=0x16a77a8 "") at postgres.c:1215

Looking at compute_scalar_stats, the "stats" parameter does not seem
particularly healthy:

(gdb) p *stats
$3 = {attr = 0x10, attrtypid = 12, attrtypmod = 0, attrtype = 0x1762e00, attrcollid = 356, anl_context =
0x7f7f7f7e00000000,compute_stats = 0x100, minrows = 144, extra_data = 0x1762e00, stats_valid = false, stanullfrac = 0,
 
  stawidth = 0, stadistinct = 0, stakind = {0, 0, 0, 0, 0}, staop = {0, 0, 0, 0, 0}, stacoll = {0, 0, 0, 0, 0},
numnumbers= {0, 0, 0, 0, 0}, stanumbers = {0x0, 0x0, 0x0, 0x0, 0x0}, numvalues = {0, 0, 0, 0, 2139062142}, stavalues =
{
    0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f}, statypid =
{2139062143,2139062143, 2139062143, 2139062143, 2139062143}, statyplen = {32639, 32639, 32639, 32639, 32639},
 
  statypbyval = {127, 127, 127, 127, 127}, statypalign = "\177\177\177\177\177", tupattnum = 2139062143, rows =
0x7f7f7f7f7f7f7f7f,tupDesc = 0x7f7f7f7f7f7f7f7f, exprvals = 0x8, exprnulls = 0x4, rowstride = 24522240}
 

Not sure about the root cause yet.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> FWIW I can reproduce this - it's enough to do this on the 11 cluster

I failed to reproduce any problem from your example, but I was trying
in C locale on a Linux machine.  What environment are you testing?

            regards, tom lane



On Wed, Oct 09, 2019 at 03:59:07PM +0200, Tomas Vondra wrote:
>FWIW I can reproduce this - it's enough to do this on the 11 cluster
>
>create table q_tbl_archiv as
>with
>qseason as (
>select table_name,column_name, ordinal_position
>,replace(column_name,'_season','') as col_qualifier
>-- ,'id_'||replace(column_name,'_season','') as id_column
>from information_schema.columns
>order by table_name
>)
>select qs.*,c.column_name as id_column, c.column_default as id_default
>from
>       qseason qs
>       left join information_schema.columns c on c.table_name=qs.table_name and
>c.column_name like 'id_%';
>
>
>and then
>
>   analyze q_tbl_archiv
>
>which produces backtrace like this:
>
>No symbol "stats" in current context.
>(gdb) bt
>#0  0x0000746095262951 in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
>#1  0x0000000000890a8e in varstrfastcmp_locale (a1p=0x17716b4 "per_language\a", len1=<optimized out>, a2p=0x176af28
'\177'<repeats 136 times>, "\021\004", len2=-4, ssup=<optimized out>, ssup=<optimized out>) at varlena.c:2320
 
>#2  0x0000000000890cb1 in varlenafastcmp_locale (x=24581808, y=24555300, ssup=0x7ffc649463f0) at varlena.c:2219
>#3  0x00000000005b73b4 in ApplySortComparator (ssup=0x7ffc649463f0, isNull2=false, datum2=<optimized out>,
isNull1=false,datum1=<optimized out>) at ../../../src/include/utils/sortsupport.h:224
 
>#4  compare_scalars (a=<optimized out>, b=<optimized out>, arg=0x7ffc649463e0) at analyze.c:2700
>#5  0x00000000008f9953 in qsort_arg (a=a@entry=0x178fdc0, n=<optimized out>, n@entry=2158, es=es@entry=16,
cmp=cmp@entry=0x5b7390<compare_scalars>, arg=arg@entry=0x7ffc649463e0) at qsort_arg.c:140
 
>#6  0x00000000005b86a6 in compute_scalar_stats (stats=0x176a208, fetchfunc=<optimized out>, samplerows=<optimized
out>,totalrows=2158) at analyze.c:2273
 
>#7  0x00000000005b9d95 in do_analyze_rel (onerel=onerel@entry=0x74608c00d3e8, params=params@entry=0x7ffc64946970,
va_cols=va_cols@entry=0x0,acquirefunc=<optimized out>, relpages=22, inh=inh@entry=false, in_outer_xact=false,
elevel=13)
>   at analyze.c:529
>#8  0x00000000005bb2c9 in analyze_rel (relid=<optimized out>, relation=<optimized out>,
params=params@entry=0x7ffc64946970,va_cols=0x0, in_outer_xact=<optimized out>, bstrategy=<optimized out>) at
analyze.c:260
>#9  0x000000000062c7b0 in vacuum (relations=0x1727120, params=params@entry=0x7ffc64946970, bstrategy=<optimized out>,
bstrategy@entry=0x0,isTopLevel=isTopLevel@entry=true) at vacuum.c:413
 
>#10 0x000000000062cd49 in ExecVacuum (pstate=pstate@entry=0x16c9518, vacstmt=vacstmt@entry=0x16a82b8,
isTopLevel=isTopLevel@entry=true)at vacuum.c:199
 
>#11 0x00000000007a6d64 in standard_ProcessUtility (pstmt=0x16a8618, queryString=0x16a77a8 "", context=<optimized out>,
params=0x0,queryEnv=0x0, dest=0x16a8710, completionTag=0x7ffc64946cb0 "") at utility.c:670
 
>#12 0x00000000007a4006 in PortalRunUtility (portal=0x170f368, pstmt=0x16a8618, isTopLevel=<optimized out>,
setHoldSnapshot=<optimizedout>, dest=0x16a8710, completionTag=0x7ffc64946cb0 "") at pquery.c:1175
 
>#13 0x00000000007a4b61 in PortalRunMulti (portal=portal@entry=0x170f368, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false,dest=dest@entry=0x16a8710, altdest=altdest@entry=0x16a8710,
 
>   completionTag=completionTag@entry=0x7ffc64946cb0 "") at pquery.c:1321
>#14 0x00000000007a5864 in PortalRun (portal=portal@entry=0x170f368, count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true,run_once=run_once@entry=true, dest=dest@entry=0x16a8710,
altdest=altdest@entry=0x16a8710,
>   completionTag=0x7ffc64946cb0 "") at pquery.c:796
>#15 0x00000000007a174e in exec_simple_query (query_string=0x16a77a8 "") at postgres.c:1215
>
>Looking at compute_scalar_stats, the "stats" parameter does not seem
>particularly healthy:
>
>(gdb) p *stats
>$3 = {attr = 0x10, attrtypid = 12, attrtypmod = 0, attrtype = 0x1762e00, attrcollid = 356, anl_context =
0x7f7f7f7e00000000,compute_stats = 0x100, minrows = 144, extra_data = 0x1762e00, stats_valid = false, stanullfrac = 0,
 
> stawidth = 0, stadistinct = 0, stakind = {0, 0, 0, 0, 0}, staop = {0, 0, 0, 0, 0}, stacoll = {0, 0, 0, 0, 0},
numnumbers= {0, 0, 0, 0, 0}, stanumbers = {0x0, 0x0, 0x0, 0x0, 0x0}, numvalues = {0, 0, 0, 0, 2139062142}, stavalues =
{
>   0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f, 0x7f7f7f7f7f7f7f7f}, statypid =
{2139062143,2139062143, 2139062143, 2139062143, 2139062143}, statyplen = {32639, 32639, 32639, 32639, 32639},
 
> statypbyval = {127, 127, 127, 127, 127}, statypalign = "\177\177\177\177\177", tupattnum = 2139062143, rows =
0x7f7f7f7f7f7f7f7f,tupDesc = 0x7f7f7f7f7f7f7f7f, exprvals = 0x8, exprnulls = 0x4, rowstride = 24522240}
 
>
>Not sure about the root cause yet.
>

OK, a couple more observations - the table schema looks like this:

                              Table "public.q_tbl_archiv"
      Column      |                Type                | Collation | Nullable | Default 
------------------+------------------------------------+-----------+----------+---------
 table_name       | information_schema.sql_identifier  |           |          | 
 column_name      | information_schema.sql_identifier  |           |          | 
 ordinal_position | information_schema.cardinal_number |           |          | 
 col_qualifier    | text                               |           |          | 
 id_column        | information_schema.sql_identifier  |           |          | 
 id_default       | information_schema.character_data  |           |          | 

and I can succesfully do this:

  test=# analyze q_tbl_archiv (table_name, column_name, ordinal_position, id_column, id_default);
  ANALYZE

but as soon as I include the col_qualifier column, it fails:

  test=# analyze q_tbl_archiv (table_name, column_name, ordinal_position, id_column, id_default, col_qualifier);
  ERROR:  compressed data is corrupted

But it fails differently (with the segfault) when analyzing just the one
column:

  test=# analyze q_tbl_archiv (col_qualifier);
  server closed the connection unexpectedly
      This probably means the server terminated abnormally
      before or while processing the request.
  The connection to the server was lost. Attempting reset: Succeeded.

Moreover, there are some other interesting failures - I can do

  select max(table_name) from q_tbl_archiv;
  select max(column_name) from q_tbl_archiv;
  select max(ordinal_position) from q_tbl_archiv;

but as soon as I try doing that with col_qualifier, it crashes and
burns:

  select max(col_qualifier) from q_tbl_archiv;

The backtrace is rather strange in this case (a lot of missing calls,
etc.). However, when called for the next two columns, it still crashes,
but the backtraces look somewhat saner:

  select max(id_column) from q_tbl_archiv;

Program received signal SIGSEGV, Segmentation fault.
0x00007db3186c6617 in __strlen_avx2 () from /lib64/libc.so.6
(gdb) bt
#0  0x00007db3186c6617 in __strlen_avx2 () from /lib64/libc.so.6
#1  0x0000000000894ced in cstring_to_text (s=0x7db32ce38935 <error: Cannot access memory at address 0x7db32ce38935>) at
varlena.c:173
#2  name_text (fcinfo=<optimized out>) at varlena.c:3573
#3  0x000000000063860d in ExecInterpExpr (state=0x1136900, econtext=0x1135128, isnull=<optimized out>) at
execExprInterp.c:649
#4  0x000000000064f699 in ExecEvalExprSwitchContext (isNull=0x7ffcfd8f3b2f, econtext=<optimized out>, state=<optimized
out>)at ../../../src/include/executor/executor.h:307
 
#5  advance_aggregates (aggstate=0x1134ef0, aggstate=0x1134ef0) at nodeAgg.c:679
#6  agg_retrieve_direct (aggstate=0x1134ef0) at nodeAgg.c:1847
#7  ExecAgg (pstate=0x1134ef0) at nodeAgg.c:1572
#8  0x000000000063b58b in ExecProcNode (node=0x1134ef0) at ../../../src/include/executor/executor.h:239
#9  ExecutePlan (execute_once=<optimized out>, dest=0x1144248, direction=<optimized out>, numberTuples=0,
sendTuples=<optimizedout>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x1134ef0,
estate=0x1134c98)
    at execMain.c:1646
#10 standard_ExecutorRun (queryDesc=0x1094f18, direction=<optimized out>, count=0, execute_once=<optimized out>) at
execMain.c:364
#11 0x00000000007a43cc in PortalRunSelect (portal=0x10da368, forward=<optimized out>, count=0, dest=<optimized out>) at
pquery.c:929
#12 0x00000000007a5958 in PortalRun (portal=portal@entry=0x10da368, count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true,run_once=run_once@entry=true, dest=dest@entry=0x1144248,
altdest=altdest@entry=0x1144248,
    completionTag=0x7ffcfd8f3db0 "") at pquery.c:770
#13 0x00000000007a177e in exec_simple_query (query_string=0x10727a8 "select max(id_column) from q_tbl_archiv ;") at
postgres.c:1215
#14 0x00000000007a2f3f in PostgresMain (argc=<optimized out>, argv=argv@entry=0x109e400, dbname=<optimized out>,
username=<optimizedout>) at postgres.c:4236
 
#15 0x00000000007237ce in BackendRun (port=0x1097c30, port=0x1097c30) at postmaster.c:4437
#16 BackendStartup (port=0x1097c30) at postmaster.c:4128
#17 ServerLoop () at postmaster.c:1704
#18 0x000000000072458e in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x106c350) at postmaster.c:1377
#19 0x000000000047d101 in main (argc=3, argv=0x106c350) at main.c:228

  select max(id_default) from q_tbl_archiv;

Program received signal SIGABRT, Aborted.
0x00007db3185a1e35 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007db3185a1e35 in raise () from /lib64/libc.so.6
#1  0x00007db31858c895 in abort () from /lib64/libc.so.6
#2  0x00000000008b4470 in ExceptionalCondition (conditionName=conditionName@entry=0xabe49e "1",
errorType=errorType@entry=0x907128"unrecognized TOAST vartag", fileName=fileName@entry=0xa4965b "execTuples.c",
 
    lineNumber=lineNumber@entry=971) at assert.c:54
#3  0x00000000006466d3 in slot_deform_heap_tuple (natts=6, offp=0x1135170, tuple=<optimized out>, slot=0x1135128) at
execTuples.c:985
#4  tts_buffer_heap_getsomeattrs (slot=0x1135128, natts=<optimized out>) at execTuples.c:676
#5  0x00000000006489fc in slot_getsomeattrs_int (slot=slot@entry=0x1135128, attnum=6) at execTuples.c:1877
#6  0x00000000006379a3 in slot_getsomeattrs (attnum=<optimized out>, slot=0x1135128) at
../../../src/include/executor/tuptable.h:345
#7  ExecInterpExpr (state=0x11364b0, econtext=0x1134cd8, isnull=<optimized out>) at execExprInterp.c:441
#8  0x000000000064f699 in ExecEvalExprSwitchContext (isNull=0x7ffcfd8f3b2f, econtext=<optimized out>, state=<optimized
out>)at ../../../src/include/executor/executor.h:307
 
#9  advance_aggregates (aggstate=0x1134aa0, aggstate=0x1134aa0) at nodeAgg.c:679
#10 agg_retrieve_direct (aggstate=0x1134aa0) at nodeAgg.c:1847
#11 ExecAgg (pstate=0x1134aa0) at nodeAgg.c:1572
#12 0x000000000063b58b in ExecProcNode (node=0x1134aa0) at ../../../src/include/executor/executor.h:239
#13 ExecutePlan (execute_once=<optimized out>, dest=0x11439d8, direction=<optimized out>, numberTuples=0,
sendTuples=<optimizedout>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x1134aa0,
estate=0x1134848)
    at execMain.c:1646
#14 standard_ExecutorRun (queryDesc=0x1094f18, direction=<optimized out>, count=0, execute_once=<optimized out>) at
execMain.c:364
#15 0x00000000007a43cc in PortalRunSelect (portal=0x10da368, forward=<optimized out>, count=0, dest=<optimized out>) at
pquery.c:929
#16 0x00000000007a5958 in PortalRun (portal=portal@entry=0x10da368, count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true,run_once=run_once@entry=true, dest=dest@entry=0x11439d8,
altdest=altdest@entry=0x11439d8,
    completionTag=0x7ffcfd8f3db0 "") at pquery.c:770
#17 0x00000000007a177e in exec_simple_query (query_string=0x10727a8 "select max(id_default) from q_tbl_archiv ;") at
postgres.c:1215
#18 0x00000000007a2f3f in PostgresMain (argc=<optimized out>, argv=argv@entry=0x109e4f0, dbname=<optimized out>,
username=<optimizedout>) at postgres.c:4236
 
#19 0x00000000007237ce in BackendRun (port=0x10976f0, port=0x10976f0) at postmaster.c:4437
#20 BackendStartup (port=0x10976f0) at postmaster.c:4128
#21 ServerLoop () at postmaster.c:1704
#22 0x000000000072458e in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x106c350) at postmaster.c:1377
#23 0x000000000047d101 in main (argc=3, argv=0x106c350) at main.c:228


It's quite puzzling, though. If I had to guess, I'd say it's some sort
of memory management issue (either we're corrupting it somehow, or
perhaps using it after pfree).

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Wed, Oct 09, 2019 at 10:07:01AM -0400, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> FWIW I can reproduce this - it's enough to do this on the 11 cluster
>
>I failed to reproduce any problem from your example, but I was trying
>in C locale on a Linux machine.  What environment are you testing?
>
>            regards, tom lane

test=# show lc_collate ;
 lc_collate 
------------
 C.UTF-8
(1 row)


I can reproduce this pretty easily like this:

1) build 11

git checkout REL_11_STABLE
./configure --prefix=/home/user/pg-11 --enable-debug --enable-cassert && make -s clean && make -s -j4 install

2) build 12

git checkout REL_12_STABLE
./configure --prefix=/home/user/pg-12 --enable-debug --enable-cassert && make -s clean && make -s -j4 install

3) create the 11 cluster

/home/user/pg-11/bin/pg_ctl -D /tmp/data-11 init
/home/user/pg-11/bin/pg_ctl -D /tmp/data-11 -l /tmp/pg-11.log start
/home/user/pg-11/bin/createdb test
/home/user/pg-11/bin/psql test

4) create the table

    create table q_tbl_archiv as
    with
    qseason as (
    select table_name,column_name, ordinal_position
    ,replace(column_name,'_season','') as col_qualifier
    -- ,'id_'||replace(column_name,'_season','') as id_column
    from information_schema.columns
    order by table_name
    )
    select qs.*,c.column_name as id_column, c.column_default as id_default
    from
            qseason qs
            left join information_schema.columns c on c.table_name=qs.table_name and
    c.column_name like 'id_%';

5) shutdown the 11 cluster

  /home/user/pg-11/bin/pg_ctl -D /tmp/data-11 stop

6) init 12 cluster

  /home/user/pg-12/bin/pg_ctl -D /tmp/data-12 init

7) do the pg_upgrade thing

  /home/user/pg-12/bin/pg_upgrade -b /home/user/pg-11/bin -B /home/user/pg-12/bin -d /tmp/data-11 -D /tmp/data-12 -k

8) start 12 cluster

  /home/user/pg-12/bin/pg_ctl -D /tmp/data-12 -l /tmp/pg-12.log start

9) kabooom

  /home/user/pg-12/bin/psql test -c "analyze q_tbl_archiv"


On my system (Fedora 30 in x86_64) this reliably results a crash (and
various other crashes as demonstrated in my previous message).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Hi Tomas,

Nice that you could reproduce it.
This was just the way I followed.

For your Info, here are my no-standard config params:

                name                |         current_setting
------------------------------------+---------------------------------
 application_name                   | psql
 auto_explain.log_analyze           | on
 auto_explain.log_min_duration      | 0
 auto_explain.log_nested_statements | on
 client_encoding                    | WIN1252
 cluster_name                       | HB_DEV
 data_checksums                     | on
 DateStyle                          | ISO, DMY
 default_text_search_config         | pg_catalog.german
 dynamic_shared_memory_type         | windows
 effective_cache_size               | 8GB
 lc_collate                         | C
 lc_ctype                           | German_Germany.1252
 lc_messages                        | C
 lc_monetary                        | German_Germany.1252
 lc_numeric                         | German_Germany.1252
 lc_time                            | German_Germany.1252
 log_destination                    | stderr
 log_directory                      | N:/ZZ_log/pg_log_hbdev
 log_error_verbosity                | verbose
 log_file_mode                      | 0640
 log_line_prefix                    | WHB %a %t %i %e %2l:>
 log_statement                      | mod
 log_temp_files                     | 0
 log_timezone                       | CET
 logging_collector                  | on
 maintenance_work_mem               | 128MB
 max_connections                    | 100
 max_stack_depth                    | 2MB
 max_wal_size                       | 1GB
 min_wal_size                       | 80MB
 pg_stat_statements.max             | 5000
 pg_stat_statements.track           | all
 random_page_cost                   | 1
 search_path                        | public, archiv, ablage, admin
 server_encoding                    | UTF8
 server_version                     | 12.0
 shared_buffers                     | 1GB
 shared_preload_libraries           | auto_explain,pg_stat_statements
 temp_buffers                       | 32MB
 TimeZone                           | CET
 transaction_deferrable             | off
 transaction_isolation              | read committed
 transaction_read_only              | off
 update_process_title               | off
 wal_buffers                        | 16MB
 wal_segment_size                   | 16MB
 work_mem                           | 32MB
(48 rows)

Indeed, the database has UTF8 Encoding.

The Extended error-log (i have set auto_explain):



WHB psql 2019-10-09 15:45:03 CEST  XX000  7:> ERROR:  XX000: invalid memory alloc request size 18446744073709551613
WHB psql 2019-10-09 15:45:03 CEST  XX000  8:> LOCATION:  palloc,
d:\pginstaller_12.auto\postgres.windows-x64\src\backend\utils\mmgr\mcxt.c:934
WHB psql 2019-10-09 15:45:03 CEST  XX000  9:> STATEMENT:  select * from q_tbl_archiv;
WHB vacuumdb 2019-10-09 15:46:42 CEST  00000  1:> LOG:  00000: duration: 0.022 ms  plan:
    Query Text: SELECT pg_catalog.set_config('search_path', '', false);
    Result  (cost=0.00..0.01 rows=1 width=32) (actual time=0.014..0.015 rows=1 loops=1)
WHB vacuumdb 2019-10-09 15:46:42 CEST  00000  2:> LOCATION:  explain_ExecutorEnd,
d:\pginstaller_12.auto\postgres.windows-x64\contrib\auto_explain\auto_explain.c:415
WHB vacuumdb 2019-10-09 15:46:42 CEST  00000  3:> LOG:  00000: duration: 0.072 ms  plan:
    Query Text: SELECT datname FROM pg_database WHERE datallowconn ORDER BY 1;
    Sort  (cost=1.16..1.16 rows=1 width=64) (actual time=0.063..0.064 rows=14 loops=1)
      Sort Key: datname
      Sort Method: quicksort  Memory: 26kB
      ->  Seq Scan on pg_database  (cost=0.00..1.15 rows=1 width=64) (actual time=0.018..0.022 rows=14 loops=1)
            Filter: datallowconn
            Rows Removed by Filter: 1
WHB vacuumdb 2019-10-09 15:46:42 CEST  00000  4:> LOCATION:  explain_ExecutorEnd,
d:\pginstaller_12.auto\postgres.windows-x64\contrib\auto_explain\auto_explain.c:415
WHB vacuumdb 2019-10-09 15:46:43 CEST  00000  1:> LOG:  00000: duration: 0.027 ms  plan:
    Query Text: SELECT pg_catalog.set_config('search_path', '', false);
    Result  (cost=0.00..0.01 rows=1 width=32) (actual time=0.012..0.013 rows=1 loops=1)
WHB vacuumdb 2019-10-09 15:46:43 CEST  00000  2:> LOCATION:  explain_ExecutorEnd,
d:\pginstaller_12.auto\postgres.windows-x64\contrib\auto_explain\auto_explain.c:415
WHB vacuumdb 2019-10-09 15:46:43 CEST  00000  3:> LOG:  00000: duration: 1.036 ms  plan:
    Query Text: SELECT c.relname, ns.nspname FROM pg_catalog.pg_class c
     JOIN pg_catalog.pg_namespace ns ON c.relnamespace OPERATOR(pg_catalog.=) ns.oid
     LEFT JOIN pg_catalog.pg_class t ON c.reltoastrelid OPERATOR(pg_catalog.=) t.oid
     WHERE c.relkind OPERATOR(pg_catalog.=) ANY (array['r', 'm'])
     ORDER BY c.relpages DESC;
    Sort  (cost=56.56..56.59 rows=13 width=132) (actual time=0.843..0.854 rows=320 loops=1)
      Sort Key: c.relpages DESC
      Sort Method: quicksort  Memory: 110kB
      ->  Hash Join  (cost=1.23..56.32 rows=13 width=132) (actual time=0.082..0.649 rows=320 loops=1)
            Hash Cond: (c.relnamespace = ns.oid)
            ->  Seq Scan on pg_class c  (cost=0.00..55.05 rows=13 width=76) (actual time=0.034..0.545 rows=320 loops=1)
                  Filter: ((relkind)::text = ANY ('{r,m}'::text[]))
                  Rows Removed by Filter: 950
            ->  Hash  (cost=1.10..1.10 rows=10 width=68) (actual time=0.022..0.022 rows=10 loops=1)
                  Buckets: 1024  Batches: 1  Memory Usage: 9kB
                  ->  Seq Scan on pg_namespace ns  (cost=0.00..1.10 rows=10 width=68) (actual time=0.010..0.011 rows=10
loops=1)
WHB vacuumdb 2019-10-09 15:46:43 CEST  00000  4:> LOCATION:  explain_ExecutorEnd,
d:\pginstaller_12.auto\postgres.windows-x64\contrib\auto_explain\auto_explain.c:415
WHB vacuumdb 2019-10-09 15:46:43 CEST  00000  5:> LOG:  00000: duration: 0.011 ms  plan:
    Query Text: SELECT pg_catalog.set_config('search_path', '', false);
    Result  (cost=0.00..0.01 rows=1 width=32) (actual time=0.008..0.008 rows=1 loops=1)
WHB vacuumdb 2019-10-09 15:46:43 CEST  00000  6:> LOCATION:  explain_ExecutorEnd,
d:\pginstaller_12.auto\postgres.windows-x64\contrib\auto_explain\auto_explain.c:415
WHB  2019-10-09 15:47:01 CEST  00000 22:> LOG:  00000: server process (PID 4708) was terminated by exception 0xC0000005
WHB  2019-10-09 15:47:01 CEST  00000 23:> DETAIL:  Failed process was running: ANALYZE admin.q_tbl_archiv;
WHB  2019-10-09 15:47:01 CEST  00000 24:> HINT:  See C include file "ntstatus.h" for a description of the hexadecimal
value.
WHB  2019-10-09 15:47:01 CEST  00000 25:> LOCATION:  LogChildExit,
d:\pginstaller_12.auto\postgres.windows-x64\src\backend\postmaster\postmaster.c:3670
WHB  2019-10-09 15:47:01 CEST  00000 26:> LOG:  00000: terminating any other active server processes
WHB  2019-10-09 15:47:01 CEST  00000 27:> LOCATION:  HandleChildCrash,
d:\pginstaller_12.auto\postgres.windows-x64\src\backend\postmaster\postmaster.c:3400
WHB psql 2019-10-09 15:47:01 CEST  57P02 10:> WARNING:  57P02: terminating connection because of crash of another
serverprocess 
WHB psql 2019-10-09 15:47:01 CEST  57P02 11:> DETAIL:  The postmaster has commanded this server process to roll back
thecurrent transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 
WHB psql 2019-10-09 15:47:01 CEST  57P02 12:> HINT:  In a moment you should be able to reconnect to the database and
repeatyour command. 
WHB psql 2019-10-09 15:47:01 CEST  57P02 13:> LOCATION:  quickdie,
d:\pginstaller_12.auto\postgres.windows-x64\src\backend\tcop\postgres.c:2717
WHB  2019-10-09 15:47:02 CEST  57P02  3:> WARNING:  57P02: terminating connection because of crash of another server
process
WHB  2019-10-09 15:47:02 CEST  57P02  4:> DETAIL:  The postmaster has commanded this server process to roll back the
currenttransaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 
WHB  2019-10-09 15:47:02 CEST  57P02  5:> HINT:  In a moment you should be able to reconnect to the database and repeat
yourcommand. 
WHB  2019-10-09 15:47:02 CEST  57P02  6:> LOCATION:  quickdie,
d:\pginstaller_12.auto\postgres.windows-x64\src\backend\tcop\postgres.c:2717
WHB  2019-10-09 15:47:02 CEST  00000 28:> LOG:  00000: all server processes terminated; reinitializing
WHB  2019-10-09 15:47:02 CEST  00000 29:> LOCATION:  PostmasterStateMachine,
d:\pginstaller_12.auto\postgres.windows-x64\src\backend\postmaster\postmaster.c:3912
WHB  2019-10-09 15:47:02 CEST  00000  1:> LOG:  00000: database system was interrupted; last known up at 2019-10-09
15:46:03CEST 
WHB  2019-10-09 15:47:02 CEST  00000  2:> LOCATION:  StartupXLOG,
d:\pginstaller_12.auto\postgres.windows-x64\src\backend\access\transam\xlog.c:6277


The table was imported successively by pg_dump/pg_restore from the previous versions into pg11.

This was the same what I did on the other machine (pg 11.5). On this test machine I could successfully Export the table
withpg_dump -t. 

On the erroneous PG12 Cluster I succeeded to recreate a similar  table with the original create table Statements: no
Errors.

Under PG12 upgraded, I tried to select only the first column (select table_name from q_tbl_archiv) and got erroneaus
results(shown first 2 entries): 

cpsdb=# select table_name from q_tbl_archiv;
                 table_name
---------------------------------------------
 \x11chemmat\x17chm_season
 !collectionsheet\x15cs_season

It seems that the length Bytes are present in the Output.

Hope this Information helps.

Hans Buschmann




Well, I think I found the root cause. It's because of 7c15cef86d, which
changed the definition of sql_identifier so that it's a domain over name
instead of varchar. So we now have this:

  SELECT typname, typlen FROM pg_type WHERE typname = 'sql_identifier':

  -[ RECORD 1 ]--+---------------
  typname        | sql_identifier
  typlen         | -1

instead of this

  -[ RECORD 1 ]--+---------------
  typname        | sql_identifier
  typlen         | 64

Unfortunately, that seems very much like a break of on-disk format, and
after pg_upgrade any table containing sql_identifier columns is pretty
much guaranteed to be badly mangled. For example, the first row from the
table used in the original report looks like this on PostgreSQL 11:

  test=# select ctid, * from q_tbl_archiv limit 1;
  -[ RECORD 1 ]----+--------------------------
  ctid             | (0,1)
  table_name       | _pg_foreign_data_wrappers
  column_name      | foreign_data_wrapper_name
  ordinal_position | 5
  col_qualifier    | foreign_data_wrapper_name
  id_column        | 
  id_default       | 

while on PostgreSQL 12 after pg_upgrade it looks like this

  test=# select ctid, table_name, column_name, ordinal_position from q_tbl_archiv limit 1;:
  -[ RECORD 1 ]----+---------------------------------------------------------
  ctid             | (0,1)
  table_name       | 5_pg_foreign_data_wrappers5foreign_data_wrapper_name\x05
  column_name      | _data_wrapper_name
  ordinal_position | 0

Not sure what to do about this :-(


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> Well, I think I found the root cause. It's because of 7c15cef86d, which
> changed the definition of sql_identifier so that it's a domain over name
> instead of varchar.

Ah...

> Not sure what to do about this :-(

Fortunately, there should be close to zero people with user tables
depending on sql_identifier.  I think we should just add a test in
pg_upgrade that refuses to upgrade if there are any such columns.
It won't be the first such restriction.

            regards, tom lane



On Wed, Oct 09, 2019 at 07:18:45PM -0400, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> Well, I think I found the root cause. It's because of 7c15cef86d, which
>> changed the definition of sql_identifier so that it's a domain over name
>> instead of varchar.
>
>Ah...
>
>> Not sure what to do about this :-(
>
>Fortunately, there should be close to zero people with user tables
>depending on sql_identifier.  I think we should just add a test in
>pg_upgrade that refuses to upgrade if there are any such columns.
>It won't be the first such restriction.
>

Hmmm, yeah.  I agree the number of people using sql_identifier in user
tables is low, but OTOH we got this report within a week after release,
so maybe it's higher than we think.

Another option would be to teach pg_upgrade to switch the columns to
'text' or 'varchar', not sure if that's possible or how much work would
that be.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On Wed, Oct 09, 2019 at 07:18:45PM -0400, Tom Lane wrote:
>> Fortunately, there should be close to zero people with user tables
>> depending on sql_identifier.  I think we should just add a test in
>> pg_upgrade that refuses to upgrade if there are any such columns.
>> It won't be the first such restriction.

> Hmmm, yeah.  I agree the number of people using sql_identifier in user
> tables is low, but OTOH we got this report within a week after release,
> so maybe it's higher than we think.

True.

> Another option would be to teach pg_upgrade to switch the columns to
> 'text' or 'varchar', not sure if that's possible or how much work would
> that be.

I think it'd be a mess --- the actual hacking would have to happen in
pg_dump, I think, and it'd be a kluge because pg_dump doesn't normally
understand what server version its output is going to.  So we'd more
or less have to control it through a new pg_dump switch that pg_upgrade
would use.  Ick.

Also, even if we did try to silently convert such columns that way,
I bet we'd get other bug reports about "why'd my columns suddenly
change type?".  So I'd rather force the user to be involved.

            regards, tom lane



On 2019-10-09 19:41:54 -0400, Tom Lane wrote:
> Also, even if we did try to silently convert such columns that way,
> I bet we'd get other bug reports about "why'd my columns suddenly
> change type?".  So I'd rather force the user to be involved.

+1



On Wed, Oct 09, 2019 at 06:48:13PM -0700, Andres Freund wrote:
>On 2019-10-09 19:41:54 -0400, Tom Lane wrote:
>> Also, even if we did try to silently convert such columns that way,
>> I bet we'd get other bug reports about "why'd my columns suddenly
>> change type?".  So I'd rather force the user to be involved.
>
>+1

Fair enough, attached is a patch doing that, I think. Maybe the file
should be named differently, as it contains other objects than just
tables.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> Fair enough, attached is a patch doing that, I think. Maybe the file
> should be named differently, as it contains other objects than just
> tables.

Seems about right, though I notice it will not detect domains over
sql_identifier.  How much do we care about that?

To identify such domains, I think we'd need something like
WHERE attypid IN (recursive-WITH-query), which makes me nervous.
We did support those starting with 8.4, which is as far back as
pg_upgrade will go, so in theory it should work.  But I think we
had bugs with such cases in old releases.  Do we want to assume
that the source server has been updated enough to avoid any such
bugs?  The expense of such a query might be daunting, too.

            regards, tom lane



On Thu, Oct 10, 2019 at 10:19:12AM -0400, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> Fair enough, attached is a patch doing that, I think. Maybe the file
>> should be named differently, as it contains other objects than just
>> tables.
>
>Seems about right, though I notice it will not detect domains over
>sql_identifier.  How much do we care about that?
>
>To identify such domains, I think we'd need something like
>WHERE attypid IN (recursive-WITH-query), which makes me nervous.
>We did support those starting with 8.4, which is as far back as
>pg_upgrade will go, so in theory it should work.  But I think we
>had bugs with such cases in old releases.  Do we want to assume
>that the source server has been updated enough to avoid any such
>bugs?  The expense of such a query might be daunting, too.
>

Not sure.

Regarding bugs, I think it's fine to assume the users are running
sufficiently recent version - they may not, but then they're probably
subject to various other bugs (data corruption, queries). If they're
not, then they'll either get false positives (in which case they'll be
forced to update) or false negatives (which is just as if we did
nothing).

For the query cost, I think we can assume the domain hierarchies are not
particularly deep (in practice I'd expect just domains directly on the
sql_identifier type). And I doubt people are using that very widely,
it's probably more like this report - ad-hoc CTAS, so just a couple of
items. So I wouldn't expect it to be a huge deal in most cases. But even
if it takes a second or two, it's a one-time cost.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On Thu, Oct 10, 2019 at 10:19:12AM -0400, Tom Lane wrote:
>> To identify such domains, I think we'd need something like
>> WHERE attypid IN (recursive-WITH-query), which makes me nervous.
>> We did support those starting with 8.4, which is as far back as
>> pg_upgrade will go, so in theory it should work.  But I think we
>> had bugs with such cases in old releases.  Do we want to assume
>> that the source server has been updated enough to avoid any such
>> bugs?  The expense of such a query might be daunting, too.

> For the query cost, I think we can assume the domain hierarchies are not
> particularly deep (in practice I'd expect just domains directly on the
> sql_identifier type). And I doubt people are using that very widely,
> it's probably more like this report - ad-hoc CTAS, so just a couple of
> items. So I wouldn't expect it to be a huge deal in most cases. But even
> if it takes a second or two, it's a one-time cost.

What I was worried about was the planner possibly trying to apply the
atttypid restriction as a scan qual using a subplan, which might be rather
awful.  But it doesn't look like that happens.  I get a hash semijoin to
the CTE output, in all branches back to 8.4, on this trial query:

explain
with recursive sqlidoids(toid) as (
select 'information_schema.sql_identifier'::pg_catalog.regtype as toid
union
select oid from pg_catalog.pg_type, sqlidoids
  where typtype = 'd' and typbasetype = sqlidoids.toid
)                       
SELECT n.nspname, c.relname, a.attname 
FROM    pg_catalog.pg_class c, 
        pg_catalog.pg_namespace n, 
        pg_catalog.pg_attribute a 
WHERE   c.oid = a.attrelid AND 
        NOT a.attisdropped AND 
        a.atttypid in (select toid from sqlidoids) AND
        c.relkind IN ('r','v','i') and
        c.relnamespace = n.oid AND 
        n.nspname !~ '^pg_temp_' AND 
        n.nspname !~ '^pg_toast_temp_' AND 
        n.nspname NOT IN ('pg_catalog', 'information_schema');

            regards, tom lane



On Thu, Oct 10, 2019 at 04:14:20PM -0400, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On Thu, Oct 10, 2019 at 10:19:12AM -0400, Tom Lane wrote:
>>> To identify such domains, I think we'd need something like
>>> WHERE attypid IN (recursive-WITH-query), which makes me nervous.
>>> We did support those starting with 8.4, which is as far back as
>>> pg_upgrade will go, so in theory it should work.  But I think we
>>> had bugs with such cases in old releases.  Do we want to assume
>>> that the source server has been updated enough to avoid any such
>>> bugs?  The expense of such a query might be daunting, too.
>
>> For the query cost, I think we can assume the domain hierarchies are not
>> particularly deep (in practice I'd expect just domains directly on the
>> sql_identifier type). And I doubt people are using that very widely,
>> it's probably more like this report - ad-hoc CTAS, so just a couple of
>> items. So I wouldn't expect it to be a huge deal in most cases. But even
>> if it takes a second or two, it's a one-time cost.
>
>What I was worried about was the planner possibly trying to apply the
>atttypid restriction as a scan qual using a subplan, which might be rather
>awful.  But it doesn't look like that happens. 

OK.

> I get a hash semijoin to
>the CTE output, in all branches back to 8.4, on this trial query:
>
>explain
>with recursive sqlidoids(toid) as (
>select 'information_schema.sql_identifier'::pg_catalog.regtype as toid
>union
>select oid from pg_catalog.pg_type, sqlidoids
>  where typtype = 'd' and typbasetype = sqlidoids.toid
>)
>SELECT n.nspname, c.relname, a.attname
>FROM    pg_catalog.pg_class c,
>        pg_catalog.pg_namespace n,
>        pg_catalog.pg_attribute a
>WHERE   c.oid = a.attrelid AND
>        NOT a.attisdropped AND
>        a.atttypid in (select toid from sqlidoids) AND
>        c.relkind IN ('r','v','i') and
>        c.relnamespace = n.oid AND
>        n.nspname !~ '^pg_temp_' AND
>        n.nspname !~ '^pg_toast_temp_' AND
>        n.nspname NOT IN ('pg_catalog', 'information_schema');
>

I think that's not quite sufficient - the problem is that we can have
domains and composite types on sql_identifier, in some arbitrary order.
And the recursive CTE won't handle that the way it's written - it will
miss domains on composite types containing sql_identifier. And we have
quite a few of them in the information schema, so maybe someone created
a domain on one of those (however unlikely it may seem).

I think this recursive CTE does it correctly:

WITH RECURSIVE oids AS (
    -- type itself
    SELECT 'information_schema.sql_identifier'::regtype AS oid
    UNION ALL
    SELECT * FROM (
        -- domains on the type
        WITH x AS (SELECT oid FROM oids)
        SELECT t.oid FROM pg_catalog.pg_type t, x WHERE typbasetype = x.oid AND typtype = 'd'
        UNION
        -- composite types containing the type
        SELECT t.oid FROM pg_catalog.pg_type t, pg_catalog.pg_class c, pg_catalog.pg_attribute a, x
        WHERE t.typtype = 'c' AND
              t.oid = c.reltype AND
              c.oid = a.attrelid AND
              a.atttypid = x.oid
    ) foo
) 

I had to use CTE within CTE, because the 'oids' can be referenced only
once, but we have two subqueries there. Maybe there's a better solution.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



OK,

here is an updated patch, with the recursive CTE. I've done a fair
amount of testing on it on older versions (up to 9.4), and it seems to
work just fine.

Another thing that I noticed is that the query does not need to look at
RELKIND_COMPOSITE_TYPE, because we only really care about cases when
sql_identifier is stored on-disk. Composite type alone does not do that,
and the CTE includes OIDs of composite types that we then check against
relations and matviews.

Barring objections, I'll push this early next week.


BTW the query (including the RELKIND_COMPSITE_TYPE) was essentially just
a lightly-massaged copy of old_9_6_check_for_unknown_data_type_usage, so
that seems wrong too. The comment explicitly says:

 * Also check composite types, in case they are used for table columns.

but even a simple "create type c as (a unknown, b int)" without any
table using it enough to trigger the failure. But maybe it's
intentional, not sure.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> here is an updated patch, with the recursive CTE. I've done a fair
> amount of testing on it on older versions (up to 9.4), and it seems to
> work just fine.

Might be a good idea to exclude attisdropped columns in the part of the
recursive query that's looking for sql_identifier columns of composite
types.  I'm not sure if composites can have dropped columns today,
but even if they can't it seems like a wise bit of future-proofing.
(We'll no doubt have occasion to use this logic again...)

Looks good other than that nit.

> BTW the query (including the RELKIND_COMPSITE_TYPE) was essentially just
> a lightly-massaged copy of old_9_6_check_for_unknown_data_type_usage, so
> that seems wrong too.

Yeah, we should back-port this logic into that check too, IMO.

            regards, tom lane



On Sun, Oct 13, 2019 at 02:26:48PM -0400, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> here is an updated patch, with the recursive CTE. I've done a fair
>> amount of testing on it on older versions (up to 9.4), and it seems to
>> work just fine.
>
>Might be a good idea to exclude attisdropped columns in the part of the
>recursive query that's looking for sql_identifier columns of composite
>types.  I'm not sure if composites can have dropped columns today,
>but even if they can't it seems like a wise bit of future-proofing.
>(We'll no doubt have occasion to use this logic again...)
>

Hmm? How could that be safe? Let's say we have a composite type with a
sql_identifier column, it's used in a table with data, and we drop the
column. We need the pg_type information to parse the existing, so how
could we skip attisdropped columns?

>Looks good other than that nit.
>
>> BTW the query (including the RELKIND_COMPSITE_TYPE) was essentially just
>> a lightly-massaged copy of old_9_6_check_for_unknown_data_type_usage, so
>> that seems wrong too.
>
>Yeah, we should back-port this logic into that check too, IMO.
>

You mean the recursive CTE, removal of RELKIND_COMPOSITE_TYPE or the
proposed change w.r.t. dropped columns?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On Sun, Oct 13, 2019 at 02:26:48PM -0400, Tom Lane wrote:
>> Might be a good idea to exclude attisdropped columns in the part of the
>> recursive query that's looking for sql_identifier columns of composite
>> types.  I'm not sure if composites can have dropped columns today,

[ I checked this, they can ]

>> but even if they can't it seems like a wise bit of future-proofing.
>> (We'll no doubt have occasion to use this logic again...)

> Hmm? How could that be safe? Let's say we have a composite type with a
> sql_identifier column, it's used in a table with data, and we drop the
> column. We need the pg_type information to parse the existing, so how
> could we skip attisdropped columns?

It works exactly like it does for table rowtypes.

regression=# create type cfoo as (f1 int, f2 int, f3 int);
CREATE TYPE
regression=# alter type cfoo drop attribute f2;
ALTER TYPE
regression=# select attname,atttypid,attisdropped,attlen,attalign from pg_attribute where attrelid = 'cfoo'::regclass;
           attname            | atttypid | attisdropped | attlen | attalign
------------------------------+----------+--------------+--------+----------
 f1                           |       23 | f            |      4 | i
 ........pg.dropped.2........ |        0 | t            |      4 | i
 f3                           |       23 | f            |      4 | i
(3 rows)

All we need to skip over the dead data is attlen/attalign, which are
preserved in pg_attribute even if the pg_type row is gone.

As this example shows, you don't really *have* to check attisdropped
because atttypid will be set to zero.  But the latter is just a
defense measure in case somebody forgets to check attisdropped;
you're not supposed to forget that.

            regards, tom lane



On Mon, Oct 14, 2019 at 10:16:40AM -0400, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On Sun, Oct 13, 2019 at 02:26:48PM -0400, Tom Lane wrote:
>>> Might be a good idea to exclude attisdropped columns in the part of the
>>> recursive query that's looking for sql_identifier columns of composite
>>> types.  I'm not sure if composites can have dropped columns today,
>
>[ I checked this, they can ]
>
>>> but even if they can't it seems like a wise bit of future-proofing.
>>> (We'll no doubt have occasion to use this logic again...)
>
>> Hmm? How could that be safe? Let's say we have a composite type with a
>> sql_identifier column, it's used in a table with data, and we drop the
>> column. We need the pg_type information to parse the existing, so how
>> could we skip attisdropped columns?
>
>It works exactly like it does for table rowtypes.
>
>regression=# create type cfoo as (f1 int, f2 int, f3 int);
>CREATE TYPE
>regression=# alter type cfoo drop attribute f2;
>ALTER TYPE
>regression=# select attname,atttypid,attisdropped,attlen,attalign from pg_attribute where attrelid =
'cfoo'::regclass;
>           attname            | atttypid | attisdropped | attlen | attalign
>------------------------------+----------+--------------+--------+----------
> f1                           |       23 | f            |      4 | i
> ........pg.dropped.2........ |        0 | t            |      4 | i
> f3                           |       23 | f            |      4 | i
>(3 rows)
>
>All we need to skip over the dead data is attlen/attalign, which are
>preserved in pg_attribute even if the pg_type row is gone.
>
>As this example shows, you don't really *have* to check attisdropped
>because atttypid will be set to zero.  But the latter is just a
>defense measure in case somebody forgets to check attisdropped;
>you're not supposed to forget that.
>

Aha! I forgot we copy the necessary stuff into pg_attribute. Thanks for
clarifying, I'll polish and push the fix shortly.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Mon, Oct 14, 2019 at 06:35:38PM +0200, Tomas Vondra wrote:
> ...
>
>Aha! I forgot we copy the necessary stuff into pg_attribute. Thanks for
>clarifying, I'll polish and push the fix shortly.
>

I've pushed and backpatched the fix. Attached are similar fixes for the
existing pg_upgrade checks for pg_catalog.line and pg_catalog.unknown
types, which have the same issues with composite types and domains.

There are some additional details & examples in the commit messages.

I've kept this in two patches primarily because of backpatching - the
line fix should go back up to 9.4, the unknown is for 10.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment
On Tue, Oct 15, 2019 at 02:18:17AM +0200, Tomas Vondra wrote:
> On Mon, Oct 14, 2019 at 06:35:38PM +0200, Tomas Vondra wrote:
> >...
> >
> >Aha! I forgot we copy the necessary stuff into pg_attribute. Thanks for
> >clarifying, I'll polish and push the fix shortly.

Perhaps it'd be worth creating a test for on-disk format ?

Like a table with a column for each core type, which is either SELECTed from
after pg_upgrade, or pg_dump output compared before and after.

Justin



On Mon, Oct 14, 2019 at 11:41:18PM -0500, Justin Pryzby wrote:
>On Tue, Oct 15, 2019 at 02:18:17AM +0200, Tomas Vondra wrote:
>> On Mon, Oct 14, 2019 at 06:35:38PM +0200, Tomas Vondra wrote:
>> >...
>> >
>> >Aha! I forgot we copy the necessary stuff into pg_attribute. Thanks for
>> >clarifying, I'll polish and push the fix shortly.
>
>Perhaps it'd be worth creating a test for on-disk format ?
>
>Like a table with a column for each core type, which is either SELECTed from
>after pg_upgrade, or pg_dump output compared before and after.
>

IMO that would be useful - we now have a couple of these checks for
different data types (line, unknown, sql_identifier), with a couple of
combinations each. And I've been looking if we do similar pg_upgrade
tests, but I haven't found anything. I mean, we do pg_upgrade the
cluster used for regression tests, but here we need to test a number of
cases that are meant to abort the pg_upgrade. So we'd need a number of
pg_upgrade runs, to test that.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Tue, Oct 15, 2019 at 02:18:17AM +0200, Tomas Vondra wrote:
>On Mon, Oct 14, 2019 at 06:35:38PM +0200, Tomas Vondra wrote:
>>...
>>
>>Aha! I forgot we copy the necessary stuff into pg_attribute. Thanks for
>>clarifying, I'll polish and push the fix shortly.
>>
>
>I've pushed and backpatched the fix. Attached are similar fixes for the
>existing pg_upgrade checks for pg_catalog.line and pg_catalog.unknown
>types, which have the same issues with composite types and domains.
>
>There are some additional details & examples in the commit messages.
>
>I've kept this in two patches primarily because of backpatching - the
>line fix should go back up to 9.4, the unknown is for 10.
>

I've just committed and pushed both fixes after some minor corrections.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> I've just committed and pushed both fixes after some minor corrections.

Not quite right in 9.6 and before, according to crake.  Looks like
some issue with the CppAsString2'd constants?  Did we even have
CppAsString2 that far back?

            regards, tom lane



I wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> I've just committed and pushed both fixes after some minor corrections.

> Not quite right in 9.6 and before, according to crake.  Looks like
> some issue with the CppAsString2'd constants?  Did we even have
> CppAsString2 that far back?

Yeah, we did.  On closer inspection I suspect that we need to #include
some other file to get the RELKIND_ constants in the old branches.

            regards, tom lane



On Wed, Oct 16, 2019 at 03:26:42PM +0200, Tom Lane wrote:
>I wrote:
>> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>>> I've just committed and pushed both fixes after some minor corrections.
>
>> Not quite right in 9.6 and before, according to crake.  Looks like
>> some issue with the CppAsString2'd constants?  Did we even have
>> CppAsString2 that far back?
>
>Yeah, we did.  On closer inspection I suspect that we need to #include
>some other file to get the RELKIND_ constants in the old branches.
>

Oh! Looking.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Wed, Oct 16, 2019 at 03:26:42PM +0200, Tom Lane wrote:
>I wrote:
>> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>>> I've just committed and pushed both fixes after some minor corrections.
>
>> Not quite right in 9.6 and before, according to crake.  Looks like
>> some issue with the CppAsString2'd constants?  Did we even have
>> CppAsString2 that far back?
>
>Yeah, we did.  On closer inspection I suspect that we need to #include
>some other file to get the RELKIND_ constants in the old branches.
>

Yeah, the pg_class.h catalog was missing on pre-10 relases. It compiled
just fine, so I haven't noticed that during the backpatching :-(

Fixed, let's see if the buildfarm is happy with that.

regads

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Tue, Oct 15, 2019 at 02:18:17AM +0200, Tomas Vondra wrote:
> On Mon, Oct 14, 2019 at 06:35:38PM +0200, Tomas Vondra wrote:
> > ...
> > 
> > Aha! I forgot we copy the necessary stuff into pg_attribute. Thanks for
> > clarifying, I'll polish and push the fix shortly.
> > 
> 
> I've pushed and backpatched the fix. Attached are similar fixes for the
> existing pg_upgrade checks for pg_catalog.line and pg_catalog.unknown
> types, which have the same issues with composite types and domains.

This comit added old_11_check_for_sql_identifier_data_type_usage(), but
it did not use the clearer database error list format added to the
master branch in commit 1634d36157.  Attached is a patch to fix this,
which I have committed.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Attachment