On 2022-Jan-10, Andrey Borodin wrote:
> 3. The tuple seems to be updated in a high-contention concurrency
> trigger function, autovacuum keeks in ~20-30 seconds after the message
> in logs
Hmm, I bet this is related.
> [ 2022-01-10 09:07:17.671 MSK [unknown],????,????_????s,310759,XX001 ]:ERROR:  t_xmin 696079792 is uncommitted in
tuple(1419011,109) to be updated in table "????s_statistics"
 
> [ 2022-01-10 09:07:17.671 MSK [unknown],????,????_????s,310759,XX001 ]:CONTEXT:  SQL statement "UPDATE
????_????s.????s_statisticsos
 
>              SET ????_????_found_ts = COALESCE(os.????_????_found_ts, NOW()),
>                  last_????_found_ts = NOW(),
>                  num_????s = os.num_????s + 1
>              WHERE ????_id = NEW.????_id"
>         PL/pgSQL function statistics__update_from_new_????() line 3 at SQL statement
> [ 2022-01-10 09:07:17.671 MSK [unknown],????,????_????s,310759,XX001 ]:STATEMENT:  
>         INSERT INTO ????_????s.????s_????s AS ????s 
> 
> 4. t_xmin is relatevely new, not ancient
> 5. pageinspect shows dead tuple after some time
> 6. no suspicious activity in logs nearby
> 7. vacuum (disable_page_skipping) and repack of indexes did not change anything
> 
> 
> I suspect this can be relatively new concurrency stuff. At least I never saw this before on clusters with clean
amcheckand heapcheck results.
 
Ah.  I've been thinking that it'd be some very old tuple that is in
trouble, but that seems to be proven false.  I think we should examine
the affected tuples more closely while they're in the state that causes
the problem.  Can you set up things so that pageinspect's
heap_page_items() is run on the broken page, before the problem
disappears?  Maybe extract the page number from the log line, have
get_raw_page() store the page in a separate table, so that we can run
heap_page_items() at leisure later.  I would also grep through
pg_waldump output to see what changes have been done to that tuple.
Maybe the reason the problem fixes itself is that something else deletes
the tuple.
> Alvaro, did you observe this on binaries from August 13 minor release or older?
Well, the only reports I have of this problem are with the original
error message that didn't give any clues at to what the problem was or
where to look for it, so I don't know if the xmin was recent or not.
-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/