Re: pgsql: Avoid race in RelationBuildDesc() affecting CREATE INDEX CONCURR - Mailing list pgsql-committers

From Tomas Vondra
Subject Re: pgsql: Avoid race in RelationBuildDesc() affecting CREATE INDEX CONCURR
Date
Msg-id df7b4c0b-7d92-f03f-75c4-9e08b269a716@enterprisedb.com
Whole thread Raw
In response to pgsql: Avoid race in RelationBuildDesc() affecting CREATE INDEX CONCURR  (Noah Misch <noah@leadboat.com>)
Responses Re: pgsql: Avoid race in RelationBuildDesc() affecting CREATE INDEX CONCURR
List pgsql-committers
On 10/24/21 03:40, Noah Misch wrote:
> Avoid race in RelationBuildDesc() affecting CREATE INDEX CONCURRENTLY.
> 
> CIC and REINDEX CONCURRENTLY assume backends see their catalog changes
> no later than each backend's next transaction start.  That failed to
> hold when a backend absorbed a relevant invalidation in the middle of
> running RelationBuildDesc() on the CIC index.  Queries that use the
> resulting index can silently fail to find rows.  Fix this for future
> index builds by making RelationBuildDesc() loop until it finishes
> without accepting a relevant invalidation.  It may be necessary to
> reindex to recover from past occurrences; REINDEX CONCURRENTLY suffices.
> Back-patch to 9.6 (all supported versions).
> 
> Noah Misch and Andrey Borodin, reviewed (in earlier versions) by Andres
> Freund.
> 
> Discussion: https://postgr.es/m/20210730022548.GA1940096@gust.leadboat.com
> 

Unfortunately, this seems to have broken CLOBBER_CACHE_ALWAYS builds. 
Since this commit, initdb never completes due to infinite retrying over 
and over (on the first RelationBuildDesc call).

We have a CLOBBER_CACHE_ALWAYS buildfarm machine "avocet", and that 
currently looks like this (top):

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ 
COMMAND
  2626 buildfa+  20   0  202888  21416  20084 R 98.34 0.531 151507:16 

/home/buildfarm/avocet/buildroot/REL9_6_STABLE/pgsql.build/tmp_install/home/buildfarm/avocet/buildroot/REL9_6_STABLE/inst/bin/postgres

--boot -x1 -F

Yep, that's 151507 minutes, i.e. 104 days in initdb :-/


I haven't looked at this very closely yet, but it seems the whole 
problem is we do this at the very beginning:

   in_progress_list[in_progress_offset].invalidated = false;

   /*
    * find the tuple in pg_class corresponding to the given relation id
    */
   pg_class_tuple = ScanPgRelation(targetRelId, true, false);

which seems entirely self-defeating, because ScanPgRelation acquires a 
lock (on pg_class), which accepts invalidations, which invalidates 
system caches (in clobber_cache_always), which sets promptly sets

   in_progress_list[in_progress_offset].invalidated = false;

guaranteeing an infinite loop.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-committers by date:

Previous
From: Robert Haas
Date:
Subject: pgsql: Remove MaxBackends variable in favor of GetMaxBackends() functio
Next
From: noreply@postgresql.org
Date:
Subject: pgsql: Tag refs/tags/REL_12_10 was created