Thread: BUG #18692: Segmentation fault when extending a varchar column with a gist index with custom signal length

The following bug has been logged on the website:

Bug reference:      18692
Logged by:          Nicolas Maus
Email address:      nicolas.maus@bertelsmann.de
PostgreSQL version: 16.4
Operating system:   SLES 15.5
Description:

When extending a varchar column with a gist index with a custom signal
length the Postgres server crashes with a segmentation fault.
Tested on 16.4 and 14.13 on SLES 15.5

How to reproduce:

create extension if not exists pg_trgm;
create table contains_trgm (trgm_column varchar(200));
-- important: this bug only occurs when a custom signal length (siglen=x) is
provided
create index gist_index on contains_trgm using gist (trgm_column
public.gist_trgm_ops (siglen=32));

-- will trigger segmentation fault (new column size must be bigger than old
one):
alter table contains_trgm alter column trgm_column type varchar(768);

Log output:
2024-11-06 14:55:09.806 CET   : LOG:  00000: server process (PID 130312) was
terminated by signal 11: Segmentation fault
2024-11-06 14:55:09.806 CET   : DETAIL:  Failed process was running: alter
table contains_trgm alter column trgm_column type varchar(768);
2024-11-06 14:55:09.806 CET   : LOCATION:  LogChildExit, postmaster.c:3689
2024-11-06 14:55:09.807 CET   : LOG:  00000: terminating any other active
server processes
2024-11-06 14:55:09.807 CET   : LOCATION:  HandleChildCrash,
postmaster.c:3490
2024-11-06 14:55:09.823 CET 172.28.110.149(56320) trigram_test ggdbpadm:
FATAL:  57P03: the database system is in recovery mode
2024-11-06 14:55:09.823 CET 172.28.110.149(56320) trigram_test ggdbpadm:
LOCATION:  ProcessStartupPacket, postmaster.c:2358
2024-11-06 14:55:09.875 CET   : LOG:  00000: all server processes
terminated; reinitializing
2024-11-06 14:55:09.875 CET   : LOCATION:  PostmasterStateMachine,
postmaster.c:3950
2024-11-06 14:55:09.947 CET   : LOG:  00000: database system was
interrupted; last known up at 2024-11-06 14:52:20 CET
2024-11-06 14:55:09.947 CET   : LOCATION:  StartupXLOG, xlog.c:5113
2024-11-06 14:55:10.064 CET   : LOG:  00000: database system was not
properly shut down; automatic recovery in progress
2024-11-06 14:55:10.064 CET   : LOCATION:  InitWalRecovery,
xlogrecovery.c:926
2024-11-06 14:55:10.072 CET   : LOG:  00000: redo starts at 0/8C45FFF8
2024-11-06 14:55:10.072 CET   : LOCATION:  PerformWalRecovery,
xlogrecovery.c:1689
2024-11-06 14:55:10.110 CET   : LOG:  00000: invalid record length at
0/8C8BFA20: expected at least 24, got 0
2024-11-06 14:55:10.110 CET   : LOCATION:  ReadRecord, xlogrecovery.c:3137
2024-11-06 14:55:10.113 CET   : LOG:  00000: redo done at 0/8C8BF908 system
usage: CPU: user: 0.00 s, system: 0.01 s, elapsed: 0.04 s
2024-11-06 14:55:10.113 CET   : LOCATION:  PerformWalRecovery,
xlogrecovery.c:1827
2024-11-06 14:55:10.119 CET   : LOG:  00000: checkpoint starting:
end-of-recovery immediate wait
2024-11-06 14:55:10.119 CET   : LOCATION:  LogCheckpointStart, xlog.c:6251
2024-11-06 14:55:10.181 CET   : LOG:  00000: checkpoint complete: wrote 957
buffers (5.8%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.041 s,
sync=0.011 s, total=0.063 s; sync files=303, longest=0.005 s, average=0.001
s; distance=4478 kB, estimate=4478 kB; lsn=0/8C8BFA20, redo lsn=0/8C8BFA20
2024-11-06 14:55:10.181 CET   : LOCATION:  LogCheckpointEnd, xlog.c:6351
2024-11-06 14:55:10.337 CET   : LOG:  00000: database system is ready to
accept connections
2024-11-06 14:55:10.337 CET   : LOCATION:  process_pm_child_exit,
postmaster.c:3110


PG Bug reporting form <noreply@postgresql.org> writes:
> When extending a varchar column with a gist index with a custom signal
> length the Postgres server crashes with a segmentation fault.

What this shows is that CompareOpclassOptions has never been tested
at all except with null inputs.  It's trying to call array_eq
without providing an FmgrInfo.

            regards, tom lane



On Wed, Nov 6, 2024 at 7:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> PG Bug reporting form <noreply@postgresql.org> writes:
> > When extending a varchar column with a gist index with a custom signal
> > length the Postgres server crashes with a segmentation fault.
>
> What this shows is that CompareOpclassOptions has never been tested
> at all except with null inputs.  It's trying to call array_eq
> without providing an FmgrInfo.

Thank you for inviting me to the thread.  I'm on and will provide a fix shortly.

------
Regards,
Alexander Korotkov
Supabase



Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Thu, Nov 7, 2024 at 8:47 AM Tender Wang <tndrwang@gmail.com> wrote:
>> Based on Tom's analysis, I  provide a POC patch. I'm not sure if it is right
>> to use DEFAULT_COLLATION_OID in the patch.

> Thank you.  But I'm not sure about DEFAULT_COLLATION_OID.

I don't quite trust that either.  But since we only care about
equality, wouldn't it be OK to use C_COLLATION_OID?

> Therefore, we can compare two text[] just with datumIsEqual().
> Attached patch implements this.

I think this is nonsense.  What about toasted datums, or even
just short-header ones?  The one coming from an on-disk tuple
is pretty likely to be short-header for plausible sizes of
the options, but the one we just constructed in memory will
not be.

            regards, tom lane





Alexander Korotkov <aekorotkov@gmail.com> 于2024年11月8日周五 10:22写道:
On Fri, Nov 8, 2024 at 4:05 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alexander Korotkov <aekorotkov@gmail.com> writes:
> > On Thu, Nov 7, 2024 at 8:47 AM Tender Wang <tndrwang@gmail.com> wrote:
> >> Based on Tom's analysis, I  provide a POC patch. I'm not sure if it is right
> >> to use DEFAULT_COLLATION_OID in the patch.
>
> > Thank you.  But I'm not sure about DEFAULT_COLLATION_OID.
>
> I don't quite trust that either.  But since we only care about
> equality, wouldn't it be OK to use C_COLLATION_OID?
>
> > Therefore, we can compare two text[] just with datumIsEqual().
> > Attached patch implements this.
>
> I think this is nonsense.  What about toasted datums, or even
> just short-header ones?  The one coming from an on-disk tuple
> is pretty likely to be short-header for plausible sizes of
> the options, but the one we just constructed in memory will
> not be.

You are correct.  I quickly skim trough the sources and didn't find a
function which compares detoasted contents of Datums excluding
headers.  So, yes, comparison using C-collation seems the most
reasonable option.

No objection from me. 


--
Thanks,
Tender Wang
Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Fri, Nov 8, 2024 at 4:05 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think this is nonsense.  What about toasted datums, or even
>> just short-header ones?

> You are correct.  I quickly skim trough the sources and didn't find a
> function which compares detoasted contents of Datums excluding
> headers.

Right, there isn't one AFAIK.  For context, datumIsEqual() is mainly
meant to support cases such as comparison of Const nodes in equal(),
where

  a) false match is unacceptable, but false no-match will at worst
     cause some performance loss;
  b) the two inputs have probably come through similar paths (eg,
     both are fresh from the parser, or both were on-disk), so that
     any toasting effects will be the same.
Also
  c) for many of the callers, it'd be unsafe to perform disk
     access to allow accurate comparison of out-of-line values.

Problem (a) would be all right for this case, and (c) too,
but not so much (b).

You can certainly imagine writing a more aggressive routine
that can cope with detoasting, but I don't think we've yet
found a case where that seemed worth the trouble.  Partial
solutions such as "cope with short-header but not compressed
values" have even narrower possible use-cases.

> So, yes, comparison using C-collation seems the most
> reasonable option.

WFM.  But I'm concerned that you couldn't build a test case
where the comparison fails.  Seems like we need one, in view
of this bug having escaped detection for so long.

            regards, tom lane



On Fri, Nov 8, 2024 at 7:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alexander Korotkov <aekorotkov@gmail.com> writes:
> > So, yes, comparison using C-collation seems the most
> > reasonable option.
>
> WFM.  But I'm concerned that you couldn't build a test case
> where the comparison fails.  Seems like we need one, in view
> of this bug having escaped detection for so long.

AFAICS, the only use case of CheckIndexCompatible() is ALTER TYPE.  In
that case we only serialize and deserialize opclass options without
any involvement of opclass or data type.  So, with the current usage
scenario CompareOpclassOptions() is only a paranoid check, but
CheckIndexCompatible() has to do it due to its general semantic.

Thus, if we have to build a test case where CompareOpclassOptions()
fails, I guess we need to write a new module for src/test/modules,
which would call CheckIndexCompatible() under different circumstances.

------
Regards,
Alexander Korotkov
Supabase



Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Fri, Nov 8, 2024 at 7:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> WFM.  But I'm concerned that you couldn't build a test case
>> where the comparison fails.  Seems like we need one, in view
>> of this bug having escaped detection for so long.

> AFAICS, the only use case of CheckIndexCompatible() is ALTER TYPE.  In
> that case we only serialize and deserialize opclass options without
> any involvement of opclass or data type.  So, with the current usage
> scenario CompareOpclassOptions() is only a paranoid check, but
> CheckIndexCompatible() has to do it due to its general semantic.

OK.  Well, let's settle for having a test case that at least runs
the code, even if only the success path is taken.

With the release freeze bearing down on us, do you want to wait
till after the releases to fix this?  It seems non-urgent given
how long it took for anyone to notice.

            regards, tom lane



On Fri, Nov 8, 2024 at 4:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alexander Korotkov <aekorotkov@gmail.com> writes:
> > On Fri, Nov 8, 2024 at 7:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> WFM.  But I'm concerned that you couldn't build a test case
> >> where the comparison fails.  Seems like we need one, in view
> >> of this bug having escaped detection for so long.
>
> > AFAICS, the only use case of CheckIndexCompatible() is ALTER TYPE.  In
> > that case we only serialize and deserialize opclass options without
> > any involvement of opclass or data type.  So, with the current usage
> > scenario CompareOpclassOptions() is only a paranoid check, but
> > CheckIndexCompatible() has to do it due to its general semantic.
>
> OK.  Well, let's settle for having a test case that at least runs
> the code, even if only the success path is taken.
>
> With the release freeze bearing down on us, do you want to wait
> till after the releases to fix this?  It seems non-urgent given
> how long it took for anyone to notice.

Thank you for noticing this.  I also think there is no urgency.  And I
will wait till the next release.

------
Regards,
Alexander Korotkov
Supabase