Thread: Re: query hangs out
Thanks for the advice.
I tried to remove all indexes and constraints from the table - it did not help.
I have a copy of the data (before truncate) - I can test any hypothesis
I tried to remove all indexes and constraints from the table - it did not help.
I have a copy of the data (before truncate) - I can test any hypothesis
вт, 20 мая 2025 г. в 18:25, Laurenz Albe <laurenz.albe@cybertec.at>:
On Tue, 2025-05-20 at 16:48 +0300, Антон Глушаков wrote:
> I encountered a very strange behavior.
> For any query (even a simple count(*) to one specific table (a small 30MB table with 3 indexes,
> without any specific data types - everything is standard out of the box vanilla Postgres) -
> the query hangs dead. Waited more than 24 hours - the query did not complete).
>
>
> Similarly, the vacuum process to the table hangs.
> Only Kill -9 with a full restart helps
>
> I get a backtrace, from it - I then examined the pg_multixact directory, which at the time of
> the problem had swelled to 900MB and had several thousand files.
> I excluded long and inactive transactions, as well as prepared statements.
>
> The workaround in the end was this - truncate the table (it was successful), then vacuum freeze
> each DB, and after that the files from pg_multixact disappeared.
>
> What could it be? vacuum\freeze\mulitxact settings are default.
> At the same time, the value pg_database.datminmxid=1
> Could the problem with the hang be related to the many old files in pg_multixact ? (judging by the backtrace - yes)
I can't say for certain, but I have seen cases like that where index corruption sent
processes into an endless loop. Next time you could try to rebuild the indexes.
Yours,
Laurenz Albe
Thank you.
I have investigated the logs to find the cause.
The only thing I found is that the problem started after a failover (we use patroni with synchronous replication).
The problem is unlikely on the hardware level, since we use a private cloud with enterprise-level hardware, and the checksum verification in postgres did not find any problems.
If this is a bug in postgres, and hackers are willing to investigate the problem, I can provide an archive with pgdata
I have investigated the logs to find the cause.
The only thing I found is that the problem started after a failover (we use patroni with synchronous replication).
The problem is unlikely on the hardware level, since we use a private cloud with enterprise-level hardware, and the checksum verification in postgres did not find any problems.
If this is a bug in postgres, and hackers are willing to investigate the problem, I can provide an archive with pgdata
ср, 21 мая 2025 г. в 18:09, Laurenz Albe <laurenz.albe@cybertec.at>:
On Wed, 2025-05-21 at 16:27 +0300, Антон Глушаков wrote:
> > What do you see at (47,13)?
>
> I have restored the data before the freeze, here is the content (47.13)
> SELECT * FROM heap_page_items(get_raw_page('"InboxState"', 47)) where lp = 13;
> -[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------------------------------------
> lp | 13
> lp_off | 4632
> lp_flags | 1
> lp_len | 100
> t_xmin | 136385644
> t_xmax | 136385520
> t_field3 | 0
> t_ctid | (47,13)
> t_infomask2 | 11
> t_infomask | 8337
> t_hoff | 32
> t_bits | 1111011000000000
> t_oid |
> t_data | \x3e8a7c00000000000100000090877a16b4b308dd9460898784c4af2dab692693d29bdf78bcf5153401000000ad43d6a5273608dd947a749769b943ab3cd8020001000000
That tuple should be visible.
So I'd say you should killed the other tuple with heap_force_kill().
Yours,
Laurenz Albe