Thread: Re: query hangs out

Re: query hangs out

From
Антон Глушаков
Date:
Thanks for the advice.
I tried to remove all indexes and constraints from the table - it did not help.
I have a copy of the data (before truncate) - I can test any hypothesis


вт, 20 мая 2025 г. в 18:25, Laurenz Albe <laurenz.albe@cybertec.at>:
On Tue, 2025-05-20 at 16:48 +0300, Антон Глушаков wrote:
> I encountered a very strange behavior.
> For any query (even a simple count(*) to one specific table (a small 30MB table with 3 indexes,
> without any specific data types - everything is standard out of the box vanilla Postgres) -
> the query hangs dead. Waited more than 24 hours - the query did not complete).
>
>
> Similarly, the vacuum process to the table hangs.
> Only Kill -9 with a full restart helps
>
> I get a backtrace, from it - I then examined the pg_multixact directory, which at the time of
> the problem had swelled to 900MB and had several thousand files.
> I excluded long and inactive transactions, as well as prepared statements.
>
> The workaround in the end was this - truncate the table (it was successful), then vacuum freeze
> each DB, and after that the files from pg_multixact disappeared.
>
> What could it be? vacuum\freeze\mulitxact  settings are default.
> At the same time, the value pg_database.datminmxid=1
> Could the problem with the hang be related to the many old files in pg_multixact ? (judging by the backtrace - yes)

I can't say for certain, but I have seen cases like that where index corruption sent
processes into an endless loop.  Next time you could try to rebuild the indexes.

Yours,
Laurenz Albe

Re: query hangs out

From
Антон Глушаков
Date:
Thank you.
I have investigated the logs to find the cause.
The only thing I found is that the problem started after a failover (we use patroni with synchronous replication).
The problem is unlikely on the hardware level, since we use a private cloud with enterprise-level hardware, and the checksum verification in postgres did not find any problems.
If this is a bug in postgres, and hackers are willing to investigate the problem, I can provide an archive with pgdata

ср, 21 мая 2025 г. в 18:09, Laurenz Albe <laurenz.albe@cybertec.at>:
On Wed, 2025-05-21 at 16:27 +0300, Антон Глушаков wrote:
> > What do you see at (47,13)?
>
> I have restored the data before the freeze, here is the content (47.13)
> SELECT * FROM heap_page_items(get_raw_page('"InboxState"', 47)) where lp = 13;
> -[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------------------------------------
> lp          | 13
> lp_off      | 4632
> lp_flags    | 1
> lp_len      | 100
> t_xmin      | 136385644
> t_xmax      | 136385520
> t_field3    | 0
> t_ctid      | (47,13)
> t_infomask2 | 11
> t_infomask  | 8337
> t_hoff      | 32
> t_bits      | 1111011000000000
> t_oid       |
> t_data      | \x3e8a7c00000000000100000090877a16b4b308dd9460898784c4af2dab692693d29bdf78bcf5153401000000ad43d6a5273608dd947a749769b943ab3cd8020001000000

That tuple should be visible.

So I'd say you should killed the other tuple with heap_force_kill().

Yours,
Laurenz Albe