Thread: heap_inplace_lock vs. autovacuum w/ LOCKTAG_TUPLE

heap_inplace_lock vs. autovacuum w/ LOCKTAG_TUPLE

From
Noah Misch
Date:
intra-grant-inplace-db.spec got a novel failure today:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sarus&dt=2024-10-26%2014%3A08%3A58

The isolationtester_waiting query is supposed to detect that step vac2 is
blocked.  vac2 didn't finish within the timeout, but isolationtester_waiting
never considered it blocked.  Most-relevant postmaster.log lines:

===
2024-10-26 14:22:51.791 UTC [2444667:6] isolation/intra-grant-inplace-db/s1 LOG:  statement: BEGIN;
2024-10-26 14:22:51.791 UTC [2444667:7] isolation/intra-grant-inplace-db/s1 LOG:  statement: 
        GRANT TEMP ON DATABASE isolation_regression TO regress_temp_grantee;
    
2024-10-26 14:22:51.791 UTC [2444668:6] isolation/intra-grant-inplace-db/s2 LOG:  statement: VACUUM (FREEZE);
2024-10-26 14:22:51.813 UTC [2444663:1] LOG:  skipping analyze of "pg_attribute" --- lock not available
2024-10-26 14:22:51.815 UTC [2444666:7] isolation/intra-grant-inplace-db/control connection LOG:  execute
isolationtester_waiting:SELECT pg_catalog.pg_isolation_test_session_is_blocked($1, '{2444667,2444668,2444669}')
 
2024-10-26 14:22:51.815 UTC [2444666:8] isolation/intra-grant-inplace-db/control connection DETAIL:  Parameters: $1 =
'2444668'
[... omitting most other lines from PID 2444666: >37000 isolationtester_waiting calls ...]
2024-10-26 14:22:51.851 UTC [2444663:2] LOG:  skipping analyze of "pg_class" --- lock not available
2024-10-26 14:22:51.892 UTC [2444663:3] LOG:  skipping analyze of "pg_depend" --- lock not available
2024-10-26 14:23:52.833 UTC [2444663:4] ERROR:  canceling autovacuum task
2024-10-26 14:23:52.833 UTC [2444663:5] CONTEXT:  while updating tuple (0,20) in relation "pg_database"
2024-10-26 14:24:52.768 UTC [2447791:1] ERROR:  canceling autovacuum task
2024-10-26 14:24:52.768 UTC [2447791:2] CONTEXT:  while updating tuple (0,20) in relation "pg_database"
2024-10-26 14:25:52.777 UTC [2451018:1] ERROR:  canceling autovacuum task
2024-10-26 14:25:52.777 UTC [2451018:2] CONTEXT:  while updating tuple (0,20) in relation "pg_database"
2024-10-26 14:26:52.789 UTC [2454071:1] ERROR:  canceling autovacuum task
2024-10-26 14:26:52.789 UTC [2454071:2] CONTEXT:  while updating tuple (0,20) in relation "pg_database"
2024-10-26 14:27:16.322 UTC [2442306:1] LOG:  checkpoint starting: time
2024-10-26 14:27:52.798 UTC [2457311:1] ERROR:  canceling autovacuum task
2024-10-26 14:27:52.798 UTC [2457311:2] CONTEXT:  while updating tuple (0,20) in relation "pg_database"
2024-10-26 14:28:51.755 UTC [2444666:37889] isolation/intra-grant-inplace-db/control connection LOG:  execute
isolationtester_waiting:SELECT pg_catalog.pg_isolation_test_session_is_blocked($1, '{2444667,2444668,2444669}')
 
2024-10-26 14:28:51.755 UTC [2444666:37890] isolation/intra-grant-inplace-db/control connection DETAIL:  Parameters: $1
='2444668'
 
2024-10-26 14:28:51.810 UTC [2463672:1] [unknown] LOG:  connection received: host=[local]
2024-10-26 14:28:51.812 UTC [2444668:7] isolation/intra-grant-inplace-db/s2 ERROR:  canceling statement due to user
request
2024-10-26 14:28:51.812 UTC [2444668:8] isolation/intra-grant-inplace-db/s2 CONTEXT:  while scanning block 0 of
relation"pg_catalog.pg_database"
 
2024-10-26 14:28:51.812 UTC [2444668:9] isolation/intra-grant-inplace-db/s2 STATEMENT:  VACUUM (FREEZE);
===

I would have expected these locking-relevant events:

- GRANT stamped a pg_database tuple with an xmax.
- auto-analyze sought to update the same tuple, so it acquired LOCKTAG_TUPLE
  and entered XactLockTableWait(xmax-from-GRANT).
- Non-auto VACUUM from step vac2 blocked on acquiring LOCKTAG_TUPLE.
  pg_isolation_test_session_is_blocked() considers this not blocked, since
  blocking on autovacuum doesn't count for test output purposes.
- Deadlock detector notices auto-analyze is blocking vac2, hence "canceling
  autovacuum task" after 1s.
- vac2 finishes its LOCKTAG_TUPLE acquisition and enters
  XactLockTableWait(xmax-from-GRANT).
- pg_isolation_test_session_is_blocked() now considers vac2 blocked.

The timeout suggests some subset of that didn't happen.  It's odd that a new
auto-analyze starts every minute, each of which exits due to blocking vac2.  I
likely got something wrong in commit aac2c9b or its parent (2024-09-24).  I'll
work on reproducing this.

Thanks,
nm



Re: heap_inplace_lock vs. autovacuum w/ LOCKTAG_TUPLE

From
Alexander Lakhin
Date:
Hello Noah,

27.10.2024 07:09, Noah Misch wrote:
> On Sat, Oct 26, 2024 at 11:49:36AM -0700, Noah Misch wrote:
>> intra-grant-inplace-db.spec got a novel failure today:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sarus&dt=2024-10-26%2014%3A08%3A58
>>
>> The isolationtester_waiting query is supposed to detect that step vac2 is
>> blocked.  vac2 didn't finish within the timeout, but isolationtester_waiting
>> never considered it blocked.
>> ... work on reproducing this.

FWIW, there was a similar failure in August: [1], and I also could not
reproduce that locally, yet wrote a preliminary analysis at [2] in the
Unsorted section, in the hope to see it again and continue investigation.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=iguana&dt=2024-08-29%2013%3A57%3A57
[2] https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures

Best regards,
Alexander