Thread: v12.0: segfault in reindex CONCURRENTLY

v12.0: segfault in reindex CONCURRENTLY

From

Justin Pryzby

Date:

12 October 2019, 00:44:46

One of our servers crashed last night like this:

< 2019-10-10 22:31:02.186 EDT postgres >STATEMENT:  REINDEX INDEX CONCURRENTLY
child.eric_umts_rnc_utrancell_hsdsch_eul_201910_site_idx
< 2019-10-10 22:31:02.399 EDT  >LOG:  server process (PID 29857) was terminated by signal 11: Segmentation fault
< 2019-10-10 22:31:02.399 EDT  >DETAIL:  Failed process was running: REINDEX INDEX CONCURRENTLY
child.eric_umts_rnc_utrancell_hsdsch_eul_201910_site_idx
< 2019-10-10 22:31:02.399 EDT  >LOG:  terminating any other active server processes

ts=# \d+ child.eric_umts_rnc_utrancell_hsdsch_eul_201910_site_idx
Index "child.eric_umts_rnc_utrancell_hsdsch_eul_201910_site_idx"
 Column  |  Type   | Key? | Definition | Storage | Stats target
---------+---------+------+------------+---------+--------------
 site_id | integer | yes  | site_id    | plain   |
btree, for table "child.eric_umts_rnc_utrancell_hsdsch_eul_201910"

That's an index on a table partition, but not itself a child of a relkind=I
index.

Unfortunately, there was no core file, and I'm still trying to reproduce it.

I can't see that the table was INSERTed into during the reindex...
But looks like it was SELECTed from, and the report finished within 1sec of the
crash:

(2019-10-10 22:30:50,485 - p1604 t140325365622592 - INFO): PID 1604 finished running report; est=None rows=552;
cols=83;[...] duration:12
 

postgres=# SELECT log_time, pid, session_id, left(message,99), detail FROM postgres_log_2019_10_10_2200 WHERE pid=29857
OR(log_time BETWEEN '2019-10-10 22:31:02.18' AND '2019-10-10 22:31:02.4' AND NOT message~'crash of another') ORDER BY
log_timeLIMIT 9;
 
 2019-10-10 22:30:24.441-04 | 29857 | 5d9fe93f.74a1 | temporary file: path
"base/pgsql_tmp/pgsql_tmp29857.0.sharedfileset/0.0",size 3096576      | 
 
 2019-10-10 22:30:24.442-04 | 29857 | 5d9fe93f.74a1 | temporary file: path
"base/pgsql_tmp/pgsql_tmp29857.0.sharedfileset/1.0",size 2809856      | 
 
 2019-10-10 22:30:24.907-04 | 29857 | 5d9fe93f.74a1 | process 29857 still waiting for ShareLock on virtual transaction
30/103010after 333.078 ms | Process holding the lock: 29671. Wait queue: 29857.
 
 2019-10-10 22:31:02.186-04 | 29857 | 5d9fe93f.74a1 | process 29857 acquired ShareLock on virtual transaction 30/103010
after37611.995 ms        | 
 
 2019-10-10 22:31:02.186-04 | 29671 | 5d9fe92a.73e7 | duration: 50044.778 ms  statement: SELECT fn, sz FROM
                        +| 
 
                            |       |               |                         (SELECT file_name fn, file_size_bytes sz,
                        +| 
 
                            |       |               |
                         | 
 
 2019-10-10 22:31:02.399-04 |  1161 | 5d9cad9e.489  | terminating any other active server processes
                         | 
 
 2019-10-10 22:31:02.399-04 |  1161 | 5d9cad9e.489  | server process (PID 29857) was terminated by signal 11:
Segmentationfault                  | Failed process was running: REINDEX INDEX CONCURRENTLY
child.eric_umts_rnc_utrancell_hsdsch_eul_201910_site_idx

Justin

Re: v12.0: segfault in reindex CONCURRENTLY

From

Michael Paquier

Date:

13 October 2019, 09:06:43

On Fri, Oct 11, 2019 at 07:44:46PM -0500, Justin Pryzby wrote:
> That's an index on a table partition, but not itself a child of a relkind=I
> index.

Interesting.  Testing with a partition tree, and indexes on leaves
which do not have dependencies with a parent I cannot reproduce
anything.  Perhaps you have some concurrent operations going on?

> Unfortunately, there was no core file, and I'm still trying to reproduce it.

Forgot to set ulimit -c?  Having a backtrace would surely help.
--
Michael

Attachment

signature.asc

Re: v12.0: segfault in reindex CONCURRENTLY

From

Justin Pryzby

Date:

13 October 2019, 13:03:21

On Sun, Oct 13, 2019 at 06:06:43PM +0900, Michael Paquier wrote:
> On Fri, Oct 11, 2019 at 07:44:46PM -0500, Justin Pryzby wrote:
> > Unfortunately, there was no core file, and I'm still trying to reproduce it.
> 
> Forgot to set ulimit -c?  Having a backtrace would surely help.

Fortunately (?) another server hit crashed last night.
(Doesn't appear to be relevant, but this table has no inheritence/partition-ness).

Looks like it's a race condition and dereferencing *holder=NULL.  The first
crash was probably the same bug, due to report query running during "reindex
CONCURRENTLY", and probably finished at nearly the same time as another locker.

Relevant code introduced here:

commit ab0dfc961b6a821f23d9c40c723d11380ce195a6
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date:   Tue Apr 2 15:18:08 2019 -0300

    Report progress of CREATE INDEX operations

Needs to be conditionalized (as anticipated by the comment)

+            if (holder)
                                pgstat_progress_update_param(PROGRESS_WAITFOR_CURRENT_PID,
                                                                                         holder->pid);


Core was generated by `postgres: postgres ts [local] REINDEX                  '.
Program terminated with signal 11, Segmentation fault.

#0  WaitForLockersMultiple (locktags=locktags@entry=0x1d30548, lockmode=lockmode@entry=5, progress=progress@entry=true)
atlmgr.c:911
 
#1  0x00000000005c2ac8 in ReindexRelationConcurrently (relationOid=relationOid@entry=17618, options=options@entry=0) at
indexcmds.c:3090
#2  0x00000000005c328a in ReindexIndex (indexRelation=<optimized out>, options=0, concurrent=<optimized out>) at
indexcmds.c:2352
#3  0x00000000007657fe in standard_ProcessUtility (pstmt=pstmt@entry=0x1d05468, queryString=queryString@entry=0x1d046e0
"REINDEXINDEX CONCURRENTLY loaded_cdr_files_filename",
 
    context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0, queryEnv=queryEnv@entry=0x0,
dest=dest@entry=0x1d05548,
    completionTag=completionTag@entry=0x7ffc05e6c0a0 "") at utility.c:787
#4  0x00007f21517204ef in pgss_ProcessUtility (pstmt=0x1d05468, queryString=0x1d046e0 "REINDEX INDEX CONCURRENTLY
loaded_cdr_files_filename",context=PROCESS_UTILITY_TOPLEVEL,
 
    params=0x0, queryEnv=0x0, dest=0x1d05548, completionTag=0x7ffc05e6c0a0 "") at pg_stat_statements.c:1006
#5  0x0000000000762816 in PortalRunUtility (portal=0x1d7a4e0, pstmt=0x1d05468, isTopLevel=<optimized out>,
setHoldSnapshot=<optimizedout>, dest=0x1d05548,
 
    completionTag=0x7ffc05e6c0a0 "") at pquery.c:1175
#6  0x0000000000763267 in PortalRunMulti (portal=portal@entry=0x1d7a4e0, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false,dest=dest@entry=0x1d05548,
 
    altdest=altdest@entry=0x1d05548, completionTag=completionTag@entry=0x7ffc05e6c0a0 "") at pquery.c:1328
#7  0x0000000000763e45 in PortalRun (portal=<optimized out>, count=9223372036854775807, isTopLevel=<optimized out>,
run_once=<optimizedout>, dest=0x1d05548, altdest=0x1d05548,
 
    completionTag=0x7ffc05e6c0a0 "") at pquery.c:796
#8  0x000000000075ff45 in exec_simple_query (query_string=<optimized out>) at postgres.c:1215
#9  0x0000000000761212 in PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=<optimized out>,
username=<optimizedout>) at postgres.c:4236
 
#10 0x0000000000483d02 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4431
#11 BackendStartup (port=0x1d2b340) at postmaster.c:4122
#12 ServerLoop () at postmaster.c:1704
#13 0x00000000006f0b1f in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1cff280) at postmaster.c:1377
#14 0x0000000000484c93 in main (argc=3, argv=0x1cff280) at main.c:228

bt f

#0  WaitForLockersMultiple (locktags=locktags@entry=0x1d30548, lockmode=lockmode@entry=5, progress=progress@entry=true)
atlmgr.c:911
 
        holder = 0x0
        lockholders = 0x1d9b778
        holders = <optimized out>
        lc = 0x1d9bf80
        total = <optimized out>
        done = 1
#1  0x00000000005c2ac8 in ReindexRelationConcurrently (relationOid=relationOid@entry=17618, options=options@entry=0) at
indexcmds.c:3090
        heapRelationIds = 0x1d30360
        indexIds = 0x1d303b0
        newIndexIds = <optimized out>
        relationLocks = <optimized out>
        lockTags = <optimized out>
        lc = 0x0
        lc2 = 0x0
        private_context = <optimized out>
        oldcontext = <optimized out>
        relkind = 105 'i'
        relationName = 0x0
        relationNamespace = 0x0
        ru0 = {tv = {tv_sec = 30592544, tv_usec = 7232025}, ru = {ru_utime = {tv_sec = 281483566645394, tv_usec =
75668733820930},ru_stime = {tv_sec = 0, tv_usec = 30592272}, {
 
              ru_maxrss = 0, __ru_maxrss_word = 0}, {ru_ixrss = 0, __ru_ixrss_word = 0}, {ru_idrss = 105,
__ru_idrss_word= 105}, {ru_isrss = -926385342574214656, 
 
              __ru_isrss_word = -926385342574214656}, {ru_minflt = 8924839, __ru_minflt_word = 8924839}, {ru_majflt =
0,__ru_majflt_word = 0}, {ru_nswap = 17618, 
 
              __ru_nswap_word = 17618}, {ru_inblock = 139781327898864, __ru_inblock_word = 139781327898864},
{ru_oublock= 30430312, __ru_oublock_word = 30430312}, {
 
              ru_msgsnd = 139781327898864, __ru_msgsnd_word = 139781327898864}, {ru_msgrcv = 139781327898864,
__ru_msgrcv_word= 139781327898864}, {ru_nsignals = 139781327898864, 
 
              __ru_nsignals_word = 139781327898864}, {ru_nvcsw = 139781327898864, __ru_nvcsw_word = 139781327898864},
{ru_nivcsw= 139781327898864, 
 
              __ru_nivcsw_word = 139781327898864}}}
        __func__ = "ReindexRelationConcurrently"
#2  0x00000000005c328a in ReindexIndex (indexRelation=<optimized out>, options=0, concurrent=<optimized out>) at
indexcmds.c:2352
        state = {concurrent = true, locked_table_oid = 17608}
        indOid = 17618
        irel = <optimized out>
        persistence = 112 'p'
...

Re: v12.0: segfault in reindex CONCURRENTLY

From

Justin Pryzby

Date:

13 October 2019, 16:24:26

Resending this message, which didn't make it to the list when I sent it
earlier.  (And, notified -www).

On Sun, Oct 13, 2019 at 06:06:43PM +0900, Michael Paquier wrote:
> On Fri, Oct 11, 2019 at 07:44:46PM -0500, Justin Pryzby wrote:
> > Unfortunately, there was no core file, and I'm still trying to reproduce it.
> 
> Forgot to set ulimit -c?  Having a backtrace would surely help.

Fortunately (?) another server hit crashed last night.
(Doesn't appear to be relevant, but this table has no inheritence/partition-ness).

Looks like it's a race condition and dereferencing *holder=NULL.  The first
crash was probably the same bug, due to report query running during "reindex
CONCURRENTLY", and probably finished at nearly the same time as another locker.

Relevant code introduced here:

commit ab0dfc961b6a821f23d9c40c723d11380ce195a6
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date:   Tue Apr 2 15:18:08 2019 -0300

    Report progress of CREATE INDEX operations

Needs to be conditionalized (as anticipated by the comment)

+            if (holder)
                                pgstat_progress_update_param(PROGRESS_WAITFOR_CURRENT_PID,
                                                                                         holder->pid);


Core was generated by `postgres: postgres ts [local] REINDEX                  '.
Program terminated with signal 11, Segmentation fault.

#0  WaitForLockersMultiple (locktags=locktags@entry=0x1d30548, lockmode=lockmode@entry=5, progress=progress@entry=true)
atlmgr.c:911
 
#1  0x00000000005c2ac8 in ReindexRelationConcurrently (relationOid=relationOid@entry=17618, options=options@entry=0) at
indexcmds.c:3090
#2  0x00000000005c328a in ReindexIndex (indexRelation=<optimized out>, options=0, concurrent=<optimized out>) at
indexcmds.c:2352
#3  0x00000000007657fe in standard_ProcessUtility (pstmt=pstmt@entry=0x1d05468, queryString=queryString@entry=0x1d046e0
"REINDEXINDEX CONCURRENTLY loaded_cdr_files_filename",
 
    context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0, queryEnv=queryEnv@entry=0x0,
dest=dest@entry=0x1d05548,
    completionTag=completionTag@entry=0x7ffc05e6c0a0 "") at utility.c:787
#4  0x00007f21517204ef in pgss_ProcessUtility (pstmt=0x1d05468, queryString=0x1d046e0 "REINDEX INDEX CONCURRENTLY
loaded_cdr_files_filename",context=PROCESS_UTILITY_TOPLEVEL,
 
    params=0x0, queryEnv=0x0, dest=0x1d05548, completionTag=0x7ffc05e6c0a0 "") at pg_stat_statements.c:1006
#5  0x0000000000762816 in PortalRunUtility (portal=0x1d7a4e0, pstmt=0x1d05468, isTopLevel=<optimized out>,
setHoldSnapshot=<optimizedout>, dest=0x1d05548,
 
    completionTag=0x7ffc05e6c0a0 "") at pquery.c:1175
#6  0x0000000000763267 in PortalRunMulti (portal=portal@entry=0x1d7a4e0, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false,dest=dest@entry=0x1d05548,
 
    altdest=altdest@entry=0x1d05548, completionTag=completionTag@entry=0x7ffc05e6c0a0 "") at pquery.c:1328
#7  0x0000000000763e45 in PortalRun (portal=<optimized out>, count=9223372036854775807, isTopLevel=<optimized out>,
run_once=<optimizedout>, dest=0x1d05548, altdest=0x1d05548,
 
    completionTag=0x7ffc05e6c0a0 "") at pquery.c:796
#8  0x000000000075ff45 in exec_simple_query (query_string=<optimized out>) at postgres.c:1215
#9  0x0000000000761212 in PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=<optimized out>,
username=<optimizedout>) at postgres.c:4236
 
#10 0x0000000000483d02 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4431
#11 BackendStartup (port=0x1d2b340) at postmaster.c:4122
#12 ServerLoop () at postmaster.c:1704
#13 0x00000000006f0b1f in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x1cff280) at postmaster.c:1377
#14 0x0000000000484c93 in main (argc=3, argv=0x1cff280) at main.c:228

bt f

#0  WaitForLockersMultiple (locktags=locktags@entry=0x1d30548, lockmode=lockmode@entry=5, progress=progress@entry=true)
atlmgr.c:911
 
        holder = 0x0
        lockholders = 0x1d9b778
        holders = <optimized out>
        lc = 0x1d9bf80
        total = <optimized out>
        done = 1
#1  0x00000000005c2ac8 in ReindexRelationConcurrently (relationOid=relationOid@entry=17618, options=options@entry=0) at
indexcmds.c:3090
        heapRelationIds = 0x1d30360
        indexIds = 0x1d303b0
        newIndexIds = <optimized out>
        relationLocks = <optimized out>
        lockTags = <optimized out>
        lc = 0x0
        lc2 = 0x0
        private_context = <optimized out>
        oldcontext = <optimized out>
        relkind = 105 'i'
        relationName = 0x0
        relationNamespace = 0x0
        ru0 = {tv = {tv_sec = 30592544, tv_usec = 7232025}, ru = {ru_utime = {tv_sec = 281483566645394, tv_usec =
75668733820930},ru_stime = {tv_sec = 0, tv_usec = 30592272}, {
 
              ru_maxrss = 0, __ru_maxrss_word = 0}, {ru_ixrss = 0, __ru_ixrss_word = 0}, {ru_idrss = 105,
__ru_idrss_word= 105}, {ru_isrss = -926385342574214656, 
 
              __ru_isrss_word = -926385342574214656}, {ru_minflt = 8924839, __ru_minflt_word = 8924839}, {ru_majflt =
0,__ru_majflt_word = 0}, {ru_nswap = 17618, 
 
              __ru_nswap_word = 17618}, {ru_inblock = 139781327898864, __ru_inblock_word = 139781327898864},
{ru_oublock= 30430312, __ru_oublock_word = 30430312}, {
 
              ru_msgsnd = 139781327898864, __ru_msgsnd_word = 139781327898864}, {ru_msgrcv = 139781327898864,
__ru_msgrcv_word= 139781327898864}, {ru_nsignals = 139781327898864, 
 
              __ru_nsignals_word = 139781327898864}, {ru_nvcsw = 139781327898864, __ru_nvcsw_word = 139781327898864},
{ru_nivcsw= 139781327898864, 
 
              __ru_nivcsw_word = 139781327898864}}}
        __func__ = "ReindexRelationConcurrently"
#2  0x00000000005c328a in ReindexIndex (indexRelation=<optimized out>, options=0, concurrent=<optimized out>) at
indexcmds.c:2352
        state = {concurrent = true, locked_table_oid = 17608}
        indOid = 17618
        irel = <optimized out>
        persistence = 112 'p'
...

Re: v12.0: segfault in reindex CONCURRENTLY

From

Alvaro Herrera

Date:

13 October 2019, 18:10:21

On 2019-Oct-13, Justin Pryzby wrote:

> Looks like it's a race condition and dereferencing *holder=NULL.  The first
> crash was probably the same bug, due to report query running during "reindex
> CONCURRENTLY", and probably finished at nearly the same time as another locker.

Ooh, right, makes sense.  There's another spot with the same mistake ...
this patch should fix it.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

reindex-crash.patch

Re: v12.0: segfault in reindex CONCURRENTLY

From

Justin Pryzby

Date:

13 October 2019, 18:14:51

On Sun, Oct 13, 2019 at 03:10:21PM -0300, Alvaro Herrera wrote:
> On 2019-Oct-13, Justin Pryzby wrote:
> 
> > Looks like it's a race condition and dereferencing *holder=NULL.  The first
> > crash was probably the same bug, due to report query running during "reindex
> > CONCURRENTLY", and probably finished at nearly the same time as another locker.
> 
> Ooh, right, makes sense.  There's another spot with the same mistake ...
> this patch should fix it.

I would maybe chop off the 2nd sentence, since conditionalizing indicates that
we do actually care.

+                        * If requested, publish who we're going to wait for.  This is not
+                        * 100% accurate if they're already gone, but we don't care.

Justin

Re: v12.0: segfault in reindex CONCURRENTLY

From

Alvaro Herrera

Date:

13 October 2019, 19:18:34

On 2019-Oct-13, Justin Pryzby wrote:

> On Sun, Oct 13, 2019 at 03:10:21PM -0300, Alvaro Herrera wrote:
> > On 2019-Oct-13, Justin Pryzby wrote:
> > 
> > > Looks like it's a race condition and dereferencing *holder=NULL.  The first
> > > crash was probably the same bug, due to report query running during "reindex
> > > CONCURRENTLY", and probably finished at nearly the same time as another locker.
> > 
> > Ooh, right, makes sense.  There's another spot with the same mistake ...
> > this patch should fix it.
> 
> I would maybe chop off the 2nd sentence, since conditionalizing indicates that
> we do actually care.
> 
> +                        * If requested, publish who we're going to wait for.  This is not
> +                        * 100% accurate if they're already gone, but we don't care.

True.  And we can copy the resulting comment to the other spot.

(FWIW I expect the crash is possible not just in reindex but also in
CREATE INDEX CONCURRENTLY.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

reindex-crash-2.patch

Re: v12.0: segfault in reindex CONCURRENTLY

From

Michael Paquier

Date:

13 October 2019, 23:57:16

On Sun, Oct 13, 2019 at 04:18:34PM -0300, Alvaro Herrera wrote:
> True.  And we can copy the resulting comment to the other spot.
>
> (FWIW I expect the crash is possible not just in reindex but also in
> CREATE INDEX CONCURRENTLY.)

I need to think about that, but shouldn't we have a way to reproduce
that case rather reliably with an isolation test?  The patch looks to
good to me, these are also the two places I spotted yesterday after a
quick lookup.  The only other caller is isTempNamespaceInUse() which
does its thing correctly.
--
Michael

Attachment

signature.asc

Re: v12.0: segfault in reindex CONCURRENTLY

From

Michael Paquier

Date:

15 October 2019, 06:35:47

On Mon, Oct 14, 2019 at 08:57:16AM +0900, Michael Paquier wrote:
> I need to think about that, but shouldn't we have a way to reproduce
> that case rather reliably with an isolation test?  The patch looks to
> good to me, these are also the two places I spotted yesterday after a
> quick lookup.  The only other caller is isTempNamespaceInUse() which
> does its thing correctly.

Actually, reindex-concurrently.spec stresses that, except that in
order to reproduce the failure we need to close the connection exactly
in the waiting loop before sending the progress report but after
looking at VirtualTransactionIdIsValid.  Using a debugger and a simple
checkpoint I can easily reproduce the crash, but we'd need more to
make that test case deterministic, like a termination with the correct
timing.

So, Alvaro, your patch looks good to me.  Could you apply it?
--
Michael

Attachment

signature.asc

Re: v12.0: segfault in reindex CONCURRENTLY

From

Alvaro Herrera

Date:

16 October 2019, 12:53:56

On 2019-Oct-15, Michael Paquier wrote:

> So, Alvaro, your patch looks good to me.  Could you apply it?

Thanks, pushed.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: v12.0: segfault in reindex CONCURRENTLY

From

Justin Pryzby

Date:

16 October 2019, 21:11:46

On Sun, Oct 13, 2019 at 04:18:34PM -0300, Alvaro Herrera wrote:
> (FWIW I expect the crash is possible not just in reindex but also in
> CREATE INDEX CONCURRENTLY.)

FWIW, for sake of list archives, and for anyone running v12 hoping to avoid
crashing, I believe we hit this for DROP INDEX CONCURRENTLY, although I don't
have the backtrace to prove it.

Justin

Re: v12.0: segfault in reindex CONCURRENTLY

From

Michael Paquier

Date:

17 October 2019, 00:49:12

On Wed, Oct 16, 2019 at 09:53:56AM -0300, Alvaro Herrera wrote:
> Thanks, pushed.

Thanks, Alvaro.
--
Michael

Attachment

signature.asc

Re: v12.0: segfault in reindex CONCURRENTLY

From

Michael Paquier

Date:

17 October 2019, 01:04:12

On Wed, Oct 16, 2019 at 04:11:46PM -0500, Justin Pryzby wrote:
> On Sun, Oct 13, 2019 at 04:18:34PM -0300, Alvaro Herrera wrote:
>> (FWIW I expect the crash is possible not just in reindex but also in
>> CREATE INDEX CONCURRENTLY.)
>
> FWIW, for sake of list archives, and for anyone running v12 hoping to avoid
> crashing, I believe we hit this for DROP INDEX CONCURRENTLY, although I don't
> have the backtrace to prove it.

You may not have a backtrace, but I think that you are right:
WaitForLockers() gets called in index_drop() with progress reporting
enabled.  index_drop() would also be taken by REINDEX CONCURRENTLY
through performMultipleDeletions() but we cannot know if it gets used
for REINDEX CONCURRENTLY or for DROP INDEX CONCURRENTLY as it goes
through the central deletion machinery, so we have to mark progress
reporting as true anyway.  Maybe that's worth a comment in index_drop
when calling WaitForLockers() because it is not actually that obvious,
say like that:
@@ -2157,7 +2157,10 @@ index_drop(Oid indexId, bool concurrent, bool
concurrent_lock_mode)

    /*
     * Wait till every transaction that saw the old index state has
-    * finished.
+    * finished.  Progress reporting is enabled here for REINDEX
+    * CONCURRENTLY, but not for DROP INDEX CONCURRENTLY.  Track
+    * the progress through WaitForLockers() anyway, the information
+    * will not show up if using DROP INDEX CONCURRENTLY.
     */
    WaitForLockers(heaplocktag, AccessExclusiveLock, true);

Thoughts?
--
Michael

Attachment

signature.asc

Re: v12.0: segfault in reindex CONCURRENTLY

From

Alvaro Herrera

Date:

17 October 2019, 08:33:22

On 2019-Oct-17, Michael Paquier wrote:

> You may not have a backtrace, but I think that you are right:
> WaitForLockers() gets called in index_drop() with progress reporting
> enabled.  index_drop() would also be taken by REINDEX CONCURRENTLY
> through performMultipleDeletions() but we cannot know if it gets used
> for REINDEX CONCURRENTLY or for DROP INDEX CONCURRENTLY as it goes
> through the central deletion machinery, so we have to mark progress
> reporting as true anyway.  Maybe that's worth a comment in index_drop
> when calling WaitForLockers() because it is not actually that obvious,
> say like that:

Hmm, I wonder if it isn't the right solution to set 'progress' to false
in that spot, instead.  index_drop says it must only be called by the
dependency machinery; are we depending on that to pass-through the need
to update progress status?  I'm going over that code now.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: v12.0: segfault in reindex CONCURRENTLY

From

Michael Paquier

Date:

17 October 2019, 08:50:24

On Thu, Oct 17, 2019 at 05:33:22AM -0300, Alvaro Herrera wrote:
> Hmm, I wonder if it isn't the right solution to set 'progress' to false
> in that spot, instead.  index_drop says it must only be called by the
> dependency machinery; are we depending on that to pass-through the need
> to update progress status?  I'm going over that code now.

pgstat_progress_end_command() is done for REINDEX CONCURRENTLY after
the concurrent drop, so it made sense to me to still report any PID
REINDEX CONC is waiting for at this stage.
--
Michael

Attachment

signature.asc

Re: v12.0: segfault in reindex CONCURRENTLY

From

Alvaro Herrera

Date:

17 October 2019, 09:56:48

On 2019-Oct-17, Michael Paquier wrote:

> On Thu, Oct 17, 2019 at 05:33:22AM -0300, Alvaro Herrera wrote:
> > Hmm, I wonder if it isn't the right solution to set 'progress' to false
> > in that spot, instead.  index_drop says it must only be called by the
> > dependency machinery; are we depending on that to pass-through the need
> > to update progress status?  I'm going over that code now.
> 
> pgstat_progress_end_command() is done for REINDEX CONCURRENTLY after
> the concurrent drop, so it made sense to me to still report any PID
> REINDEX CONC is waiting for at this stage.

Yeah, okay.  So let's talk about your proposed new comment.  First,
there are two spots where WaitForLockers is called in index_drop and
you're proposing to patch the second one.  I think we should patch the
first one and reference that one from the second one.  I propose
something like this (sorry for crude pasting):

     * Note: the reason we use actual lock acquisition here, rather than
     * just checking the ProcArray and sleeping, is that deadlock is
     * possible if one of the transactions in question is blocked trying
     * to acquire an exclusive lock on our table.  The lock code will
     * detect deadlock and error out properly.
     * 
     * Note: we report progress through WaitForLockers() unconditionally
     * here, even though it will only be used by REINDEX CONCURRENTLY and
     * not DROP INDEX CONCURRENTLY.
     */

and then

    /*
     * Wait till every transaction that saw the old index state has
-    * finished.
+    * finished.  See above about progress reporting.
     */

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: v12.0: segfault in reindex CONCURRENTLY

From

Michael Paquier

Date:

18 October 2019, 01:23:23

On Thu, Oct 17, 2019 at 06:56:48AM -0300, Alvaro Herrera wrote:
> On 2019-Oct-17, Michael Paquier wrote:
>> pgstat_progress_end_command() is done for REINDEX CONCURRENTLY after
>> the concurrent drop, so it made sense to me to still report any PID
>> REINDEX CONC is waiting for at this stage.
>
> Yeah, okay.  So let's talk about your proposed new comment.  First,
> there are two spots where WaitForLockers is called in index_drop and
> you're proposing to patch the second one.  I think we should patch the
> first one and reference that one from the second one.  I propose
> something like this (sorry for crude pasting):
>
> <comments>

What you are proposing here sounds fine to me.  Perhaps you would
prefer to adjust the code yourself?
--
Michael

Attachment

signature.asc

Re: v12.0: segfault in reindex CONCURRENTLY

From

Alvaro Herrera

Date:

18 October 2019, 10:30:37

On 2019-Oct-18, Michael Paquier wrote:

> What you are proposing here sounds fine to me.  Perhaps you would
> prefer to adjust the code yourself?

Sure thing, thanks, done :-)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: v12.0: segfault in reindex CONCURRENTLY

From

Michael Paquier

Date:

19 October 2019, 02:14:21

On Fri, Oct 18, 2019 at 07:30:37AM -0300, Alvaro Herrera wrote:
> Sure thing, thanks, done :-)

Thanks, Alvaro.
--
Michael

Attachment

signature.asc