Thread: BUG #17330: EXPLAIN hangs and very long query plans
The following bug has been logged on the website: Bug reference: 17330 Logged by: Strahinja Kustudic Email address: strahinjak@nordeus.com PostgreSQL version: 10.19 Operating system: CentOS 7 Description: We had an issue with one of our production databases running Postgres 10.19 on CentOS 7. One of the most often executed queries started having very long query plans of 3000ms+, while the execution duration was 1ms-3ms. Query plans when everything is working regularly were around 1ms or less. While on the replica it didn't even want to finish EXPLAIN (without ANALYZE, just EXPLAIN!). EXPLAIN would just hang forever. To be precise we were running 10.10 at that time, but upgrading to 10.19 didn't help. We tried running ANALYZE on the whole database, but that didn't help. In the end, what helped is running pg_repack on the whole DB. This was strange because I thought that the query planner is using table statistics and the index schema to determine what plan to run, it shouldn't need table/index data to make a plan, but I don't know PG internals, so I might be wrong. Because we have backups created by WAL-G, I've restored the base backup with WALs from that time and I've managed to recreate the issue in Docker on my own laptop. The issue is that in Docker EXPLAIN runs forever, it never finishes, like the issue we had on the replica. This means I have a 100% repro of the issue and I can do whatever you tell me to do to find out what caused it. I already tried enabling `log_error_verbosity = verbose` but the server doesn't print anything while EXPLAIN is running and it never finishes, I haven't tried to leave it to run for hours (but I might). Is this a known bug in PG 10.X? Or would you like me to share more details about the table definitions, the exact query and anything you need?
On Thu, Dec 9, 2021 at 10:28 AM PG Bug reporting form <noreply@postgresql.org> wrote:
The following bug has been logged on the website:
Bug reference: 17330
Logged by: Strahinja Kustudic
Email address: strahinjak@nordeus.com
PostgreSQL version: 10.19
Operating system: CentOS 7
Description:
The issue is that in Docker EXPLAIN runs forever, it never
finishes, like the issue we had on the replica. This means I have a 100%
repro of the issue and I can do whatever you tell me to do to find out what
caused it.
Is this 100% CPU bound? 'perf top -p <pid>' would probably be my first tool to use here.
Also, it would be nice to know if an upgrade to a newer major version fixed the problem.
Cheers,
Jeff
On Thu, Dec 9, 2021 at 7:28 AM PG Bug reporting form <noreply@postgresql.org> wrote: > We had an issue with one of our production databases running Postgres 10.19 > on CentOS 7. One of the most often executed queries started having very long > query plans of 3000ms+, while the execution duration was 1ms-3ms. Query > plans when everything is working regularly were around 1ms or less. While on > the replica it didn't even want to finish EXPLAIN (without ANALYZE, just > EXPLAIN!). EXPLAIN would just hang forever. To be precise we were running > 10.10 at that time, but upgrading to 10.19 didn't help. We tried running > ANALYZE on the whole database, but that didn't help. In the end, what helped > is running pg_repack on the whole DB. This was strange because I thought > that the query planner is using table statistics and the index schema to > determine what plan to run, it shouldn't need table/index data to make a > plan, but I don't know PG internals, so I might be wrong. Sounds like a problem with get_actual_variable_range(), which can scan indexes at plan time to determine minimum or maximum values. This actually has been improved quite a bit since Postgres 10. So as Jeff said, seems like you might benefit from upgrading to a newer major version. v11 has improved things in this exact area. -- Peter Geoghegan
On Fri, Dec 10, 2021 at 2:31 AM Jeff Janes <jeff.janes@gmail.com> wrote:
On Thu, Dec 9, 2021 at 10:28 AM PG Bug reporting form <noreply@postgresql.org> wrote:The following bug has been logged on the website:
Bug reference: 17330
Logged by: Strahinja Kustudic
Email address: strahinjak@nordeus.com
PostgreSQL version: 10.19
Operating system: CentOS 7
Description:The issue is that in Docker EXPLAIN runs forever, it never
finishes, like the issue we had on the replica. This means I have a 100%
repro of the issue and I can do whatever you tell me to do to find out what
caused it.Is this 100% CPU bound? 'perf top -p <pid>' would probably be my first tool to use here.
I forgot to mention this important detail. When it happened on the production DB it was CPU bound. It was about 70% user time and 30% system time. There was no iowait because most of the DB was in page cache.
Check out the screenshot of the CPU usage on the production DB in the attachment. Here is the timeline:
- 11:00 we started deleting a few million rows
- 11:34 delete finished and instantly CPU went to ~70% user and ~30% system time
- 12:55 we configured PgBouncer to only allow just a few queries in parallel when system time went down, but user time was still 100%
On my Docker instance when I execute EXPLAIN it starts reading a lot of data. The indexes of the biggest table the query reads are 50GB, so my guess is that it reads those indexes.
I also allowed EXPLAIN in Docker to finish and it took almost 500 seconds. Can the index corruption cause this?
Regards,
Strahinja
Attachment
On Fri, Dec 10, 2021 at 2:36 AM Peter Geoghegan <pg@bowt.ie> wrote:
Sounds like a problem with get_actual_variable_range(), which can scan
indexes at plan time to determine minimum or maximum values.
This actually has been improved quite a bit since Postgres 10. So as
Jeff said, seems like you might benefit from upgrading to a newer
major version. v11 has improved things in this exact area.
On my Docker instance when I execute EXPLAIN it starts reading a lot of data. The indexes of the biggest table the query reads are 50GB, so my guess is that it reads those indexes.
I allowed EXPLAIN in Docker to finish and it took almost 500 seconds and it was reading data all the time. After I reindexed the biggest table, EXPLAIN finished instantly. Can the index corruption cause this?
Note that this started happening in production after we deleted a few million rows from the biggest table.
Regards,
Strahinja
Hi
pá 10. 12. 2021 v 10:25 odesílatel Strahinja Kustudić <strahinjak@nordeus.com> napsal:
On Fri, Dec 10, 2021 at 2:36 AM Peter Geoghegan <pg@bowt.ie> wrote:Sounds like a problem with get_actual_variable_range(), which can scan
indexes at plan time to determine minimum or maximum values.
This actually has been improved quite a bit since Postgres 10. So as
Jeff said, seems like you might benefit from upgrading to a newer
major version. v11 has improved things in this exact area.On my Docker instance when I execute EXPLAIN it starts reading a lot of data. The indexes of the biggest table the query reads are 50GB, so my guess is that it reads those indexes.I allowed EXPLAIN in Docker to finish and it took almost 500 seconds and it was reading data all the time. After I reindexed the biggest table, EXPLAIN finished instantly. Can the index corruption cause this?
The index can be bloated.
Note that this started happening in production after we deleted a few million rows from the biggest table.
it is maybe same issue
Regards
Pavel
Regards,Strahinja
On Fri, Dec 10, 2021 at 10:53 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:
Hipá 10. 12. 2021 v 10:25 odesílatel Strahinja Kustudić <strahinjak@nordeus.com> napsal:On Fri, Dec 10, 2021 at 2:36 AM Peter Geoghegan <pg@bowt.ie> wrote:Sounds like a problem with get_actual_variable_range(), which can scan
indexes at plan time to determine minimum or maximum values.
This actually has been improved quite a bit since Postgres 10. So as
Jeff said, seems like you might benefit from upgrading to a newer
major version. v11 has improved things in this exact area.On my Docker instance when I execute EXPLAIN it starts reading a lot of data. The indexes of the biggest table the query reads are 50GB, so my guess is that it reads those indexes.I allowed EXPLAIN in Docker to finish and it took almost 500 seconds and it was reading data all the time. After I reindexed the biggest table, EXPLAIN finished instantly. Can the index corruption cause this?The index can be bloated.
Is bloat just enough to make it slower, or the index needs to be corrupted?
Note that this started happening in production after we deleted a few million rows from the biggest table.it is maybe same issue
Looks similar because there were deletes after which query plans went to a few seconds.
I just upgraded Postgres to 12 with pg_upgrade, did ANALYZE of the whole DB and the issue is similar, but EXPLAIN now finishes in 19 seconds instead of 500 seconds, which means the issue is still there. Here is the whole EXPLAIN ANALYZE (I added ANALYZE so that it displays planning duration):
Sort (cost=1946.04..1946.16 rows=46 width=642) (actual time=10.929..10.932 rows=20 loops=1)
Sort Key: m.delivery_start DESC
Sort Method: quicksort Memory: 44kB
-> Nested Loop Left Join (cost=10.19..1944.77 rows=46 width=642) (actual time=2.305..10.401 rows=20 loops=1)
Join Filter: ((c.game_name = csu.game_name) AND (um.user_id = csu.user_id) AND (c.id = csu.campaign_id))
Filter: ((c.id IS NULL) OR (c.active AND (now() >= c.campaign_start) AND ((now() <= c.campaign_end) OR (c.campaign_end IS NULL)) AND ((c.segment_ids <@ '{5632,-2,2,5634,9731,5640,5641,9738,13,3599,5648,5655,13343,5
664,13344,13345,13346,13347,13348,13350,13351,13352,5673,13353,13354,1067,13355,13356,13357,13358,13359,5680,13360,6705,13361,13362,13363,13364,13365,13366,5687,13367,13368,13369,20538,13370,13371,13372,13373,6718,13374,190
07,19008,13378,5699,13379,13380,19013,3142,3143,19015,13385,5707,6731,20555,19019,3149,19021,3150,591,19023,592,19024,6737,19027,19028,5717,6743,17496,19033,13402,19035,6750,17502,19039,17504,19041,19043,19044,17508,6757,17
510,19047,104,19048,17512,17514,7788,6764,621,19053,623,19055,7792,625,19059,7796,1141,19061,19063,7800,19064,5241,19067,19068,3713,19073,1667,19075,1668,19079,19081,19083,19084,5262,19087,19088,8338,8339,7828,19093,20631,1
9095,20632,20633,12955,19099,19101,19103,19104,19107,19108,19113,19115,8876,19119,3760,19121,19123,19124,3785,4814,16605,4324,16622,16623,21253,21254,4359,21258,5901,5915,4900,22832,22835,22836,9013,3895,3902,1865,1866,1379
,1380,4978,7027,13685,14203,2428,2429,21887,21888,4485,22932,22933,21911,21417,432,434,20418,968,969,470,7642,991,992,10209,1512,1513,23024,9201,9207,23032,9208,9214}'::integer[]) OR (csu.user_id = '-XXXXXXXXXXXXXXXXXX'::b
igint))))
Rows Removed by Filter: 11
-> Nested Loop (cost=10.03..1910.89 rows=163 width=696) (actual time=2.029..10.092 rows=31 loops=1)
-> Hash Left Join (cost=9.46..695.80 rows=61 width=684) (actual time=0.729..5.991 rows=55 loops=1)
Hash Cond: ((m.game_name = c.game_name) AND (m.campaign_id = c.id))
-> Index Scan using mail_pkey on mail m (cost=0.27..686.29 rows=61 width=552) (actual time=0.297..5.524 rows=55 loops=1)
Filter: (active AND (game_name = 'topeleven'::game_name) AND (now() >= delivery_start) AND ((now() <= delivery_end) OR (delivery_end IS NULL)))
Rows Removed by Filter: 97
-> Hash (cost=7.83..7.83 rows=91 width=136) (actual time=0.411..0.412 rows=91 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 26kB
-> Seq Scan on campaign c (cost=0.00..7.83 rows=91 width=136) (actual time=0.265..0.371 rows=91 loops=1)
Filter: (game_name = 'topeleven'::game_name)
Rows Removed by Filter: 55
-> Index Scan using user_mail_pkey on user_mail um (cost=0.57..19.88 rows=4 width=24) (actual time=0.074..0.074 rows=1 loops=55)
Index Cond: ((user_id = '-XXXXXXXXXXXXXXXXXX'::bigint) AND (game_name = 'topeleven'::game_name) AND (mail_id = m.id))
Filter: ((m.mail_type <> 'question'::mail_type) OR (mail_status <> 'completed'::mail_status))
-> Materialize (cost=0.15..28.18 rows=1 width=16) (actual time=0.009..0.009 rows=0 loops=31)
-> Index Only Scan using campaign_static_users_pkey on campaign_static_users csu (cost=0.15..28.17 rows=1 width=16) (actual time=0.266..0.266 rows=0 loops=1)
Index Cond: ((user_id = '-XXXXXXXXXXXXXXXXXX'::bigint) AND (game_name = 'topeleven'::game_name))
Heap Fetches: 0
Planning Time: 19408.862 ms
Execution Time: 10.978 ms
(27 rows)
Sort Key: m.delivery_start DESC
Sort Method: quicksort Memory: 44kB
-> Nested Loop Left Join (cost=10.19..1944.77 rows=46 width=642) (actual time=2.305..10.401 rows=20 loops=1)
Join Filter: ((c.game_name = csu.game_name) AND (um.user_id = csu.user_id) AND (c.id = csu.campaign_id))
Filter: ((c.id IS NULL) OR (c.active AND (now() >= c.campaign_start) AND ((now() <= c.campaign_end) OR (c.campaign_end IS NULL)) AND ((c.segment_ids <@ '{5632,-2,2,5634,9731,5640,5641,9738,13,3599,5648,5655,13343,5
664,13344,13345,13346,13347,13348,13350,13351,13352,5673,13353,13354,1067,13355,13356,13357,13358,13359,5680,13360,6705,13361,13362,13363,13364,13365,13366,5687,13367,13368,13369,20538,13370,13371,13372,13373,6718,13374,190
07,19008,13378,5699,13379,13380,19013,3142,3143,19015,13385,5707,6731,20555,19019,3149,19021,3150,591,19023,592,19024,6737,19027,19028,5717,6743,17496,19033,13402,19035,6750,17502,19039,17504,19041,19043,19044,17508,6757,17
510,19047,104,19048,17512,17514,7788,6764,621,19053,623,19055,7792,625,19059,7796,1141,19061,19063,7800,19064,5241,19067,19068,3713,19073,1667,19075,1668,19079,19081,19083,19084,5262,19087,19088,8338,8339,7828,19093,20631,1
9095,20632,20633,12955,19099,19101,19103,19104,19107,19108,19113,19115,8876,19119,3760,19121,19123,19124,3785,4814,16605,4324,16622,16623,21253,21254,4359,21258,5901,5915,4900,22832,22835,22836,9013,3895,3902,1865,1866,1379
,1380,4978,7027,13685,14203,2428,2429,21887,21888,4485,22932,22933,21911,21417,432,434,20418,968,969,470,7642,991,992,10209,1512,1513,23024,9201,9207,23032,9208,9214}'::integer[]) OR (csu.user_id = '-XXXXXXXXXXXXXXXXXX'::b
igint))))
Rows Removed by Filter: 11
-> Nested Loop (cost=10.03..1910.89 rows=163 width=696) (actual time=2.029..10.092 rows=31 loops=1)
-> Hash Left Join (cost=9.46..695.80 rows=61 width=684) (actual time=0.729..5.991 rows=55 loops=1)
Hash Cond: ((m.game_name = c.game_name) AND (m.campaign_id = c.id))
-> Index Scan using mail_pkey on mail m (cost=0.27..686.29 rows=61 width=552) (actual time=0.297..5.524 rows=55 loops=1)
Filter: (active AND (game_name = 'topeleven'::game_name) AND (now() >= delivery_start) AND ((now() <= delivery_end) OR (delivery_end IS NULL)))
Rows Removed by Filter: 97
-> Hash (cost=7.83..7.83 rows=91 width=136) (actual time=0.411..0.412 rows=91 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 26kB
-> Seq Scan on campaign c (cost=0.00..7.83 rows=91 width=136) (actual time=0.265..0.371 rows=91 loops=1)
Filter: (game_name = 'topeleven'::game_name)
Rows Removed by Filter: 55
-> Index Scan using user_mail_pkey on user_mail um (cost=0.57..19.88 rows=4 width=24) (actual time=0.074..0.074 rows=1 loops=55)
Index Cond: ((user_id = '-XXXXXXXXXXXXXXXXXX'::bigint) AND (game_name = 'topeleven'::game_name) AND (mail_id = m.id))
Filter: ((m.mail_type <> 'question'::mail_type) OR (mail_status <> 'completed'::mail_status))
-> Materialize (cost=0.15..28.18 rows=1 width=16) (actual time=0.009..0.009 rows=0 loops=31)
-> Index Only Scan using campaign_static_users_pkey on campaign_static_users csu (cost=0.15..28.17 rows=1 width=16) (actual time=0.266..0.266 rows=0 loops=1)
Index Cond: ((user_id = '-XXXXXXXXXXXXXXXXXX'::bigint) AND (game_name = 'topeleven'::game_name))
Heap Fetches: 0
Planning Time: 19408.862 ms
Execution Time: 10.978 ms
(27 rows)
Just to mention again, reindexing the biggest table fixes the issue, but I'm more curious why did this happen and how can we prevent it from happening again? Or is it already fixed so that it cannot happen in the future?
Regards,
Strahinja
pá 10. 12. 2021 v 16:52 odesílatel Strahinja Kustudić <strahinjak@nordeus.com> napsal:
On Fri, Dec 10, 2021 at 10:53 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:Hipá 10. 12. 2021 v 10:25 odesílatel Strahinja Kustudić <strahinjak@nordeus.com> napsal:On Fri, Dec 10, 2021 at 2:36 AM Peter Geoghegan <pg@bowt.ie> wrote:Sounds like a problem with get_actual_variable_range(), which can scan
indexes at plan time to determine minimum or maximum values.
This actually has been improved quite a bit since Postgres 10. So as
Jeff said, seems like you might benefit from upgrading to a newer
major version. v11 has improved things in this exact area.On my Docker instance when I execute EXPLAIN it starts reading a lot of data. The indexes of the biggest table the query reads are 50GB, so my guess is that it reads those indexes.I allowed EXPLAIN in Docker to finish and it took almost 500 seconds and it was reading data all the time. After I reindexed the biggest table, EXPLAIN finished instantly. Can the index corruption cause this?The index can be bloated.Is bloat just enough to make it slower, or the index needs to be corrupted?
corruption and bloatings are two very different things. Bloated index is slower. Corrupted index returns bad results.
Note that this started happening in production after we deleted a few million rows from the biggest table.it is maybe same issueLooks similar because there were deletes after which query plans went to a few seconds.I just upgraded Postgres to 12 with pg_upgrade, did ANALYZE of the whole DB and the issue is similar, but EXPLAIN now finishes in 19 seconds instead of 500 seconds, which means the issue is still there. Here is the whole EXPLAIN ANALYZE (I added ANALYZE so that it displays planning duration):Sort (cost=1946.04..1946.16 rows=46 width=642) (actual time=10.929..10.932 rows=20 loops=1)
Sort Key: m.delivery_start DESC
Sort Method: quicksort Memory: 44kB
-> Nested Loop Left Join (cost=10.19..1944.77 rows=46 width=642) (actual time=2.305..10.401 rows=20 loops=1)
Join Filter: ((c.game_name = csu.game_name) AND (um.user_id = csu.user_id) AND (c.id = csu.campaign_id))
Filter: ((c.id IS NULL) OR (c.active AND (now() >= c.campaign_start) AND ((now() <= c.campaign_end) OR (c.campaign_end IS NULL)) AND ((c.segment_ids <@ '{5632,-2,2,5634,9731,5640,5641,9738,13,3599,5648,5655,13343,5
664,13344,13345,13346,13347,13348,13350,13351,13352,5673,13353,13354,1067,13355,13356,13357,13358,13359,5680,13360,6705,13361,13362,13363,13364,13365,13366,5687,13367,13368,13369,20538,13370,13371,13372,13373,6718,13374,190
07,19008,13378,5699,13379,13380,19013,3142,3143,19015,13385,5707,6731,20555,19019,3149,19021,3150,591,19023,592,19024,6737,19027,19028,5717,6743,17496,19033,13402,19035,6750,17502,19039,17504,19041,19043,19044,17508,6757,17
510,19047,104,19048,17512,17514,7788,6764,621,19053,623,19055,7792,625,19059,7796,1141,19061,19063,7800,19064,5241,19067,19068,3713,19073,1667,19075,1668,19079,19081,19083,19084,5262,19087,19088,8338,8339,7828,19093,20631,1
9095,20632,20633,12955,19099,19101,19103,19104,19107,19108,19113,19115,8876,19119,3760,19121,19123,19124,3785,4814,16605,4324,16622,16623,21253,21254,4359,21258,5901,5915,4900,22832,22835,22836,9013,3895,3902,1865,1866,1379
,1380,4978,7027,13685,14203,2428,2429,21887,21888,4485,22932,22933,21911,21417,432,434,20418,968,969,470,7642,991,992,10209,1512,1513,23024,9201,9207,23032,9208,9214}'::integer[]) OR (csu.user_id = '-XXXXXXXXXXXXXXXXXX'::b
igint))))
Rows Removed by Filter: 11
-> Nested Loop (cost=10.03..1910.89 rows=163 width=696) (actual time=2.029..10.092 rows=31 loops=1)
-> Hash Left Join (cost=9.46..695.80 rows=61 width=684) (actual time=0.729..5.991 rows=55 loops=1)
Hash Cond: ((m.game_name = c.game_name) AND (m.campaign_id = c.id))
-> Index Scan using mail_pkey on mail m (cost=0.27..686.29 rows=61 width=552) (actual time=0.297..5.524 rows=55 loops=1)
Filter: (active AND (game_name = 'topeleven'::game_name) AND (now() >= delivery_start) AND ((now() <= delivery_end) OR (delivery_end IS NULL)))
Rows Removed by Filter: 97
-> Hash (cost=7.83..7.83 rows=91 width=136) (actual time=0.411..0.412 rows=91 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 26kB
-> Seq Scan on campaign c (cost=0.00..7.83 rows=91 width=136) (actual time=0.265..0.371 rows=91 loops=1)
Filter: (game_name = 'topeleven'::game_name)
Rows Removed by Filter: 55
-> Index Scan using user_mail_pkey on user_mail um (cost=0.57..19.88 rows=4 width=24) (actual time=0.074..0.074 rows=1 loops=55)
Index Cond: ((user_id = '-XXXXXXXXXXXXXXXXXX'::bigint) AND (game_name = 'topeleven'::game_name) AND (mail_id = m.id))
Filter: ((m.mail_type <> 'question'::mail_type) OR (mail_status <> 'completed'::mail_status))
-> Materialize (cost=0.15..28.18 rows=1 width=16) (actual time=0.009..0.009 rows=0 loops=31)
-> Index Only Scan using campaign_static_users_pkey on campaign_static_users csu (cost=0.15..28.17 rows=1 width=16) (actual time=0.266..0.266 rows=0 loops=1)
Index Cond: ((user_id = '-XXXXXXXXXXXXXXXXXX'::bigint) AND (game_name = 'topeleven'::game_name))
Heap Fetches: 0
Planning Time: 19408.862 ms
Execution Time: 10.978 ms
(27 rows)Just to mention again, reindexing the biggest table fixes the issue, but I'm more curious why did this happen and how can we prevent it from happening again? Or is it already fixed so that it cannot happen in the future?
Maybe VACUUM can solve this issue too.
Massive delete is a common issue - you can try to use partitioning and then istead massive delete, use DROP partition.
Regards
Pavel
Regards,Strahinja
=?UTF-8?Q?Strahinja_Kustudi=C4=87?= <strahinjak@nordeus.com> writes: > On Fri, Dec 10, 2021 at 10:53 AM Pavel Stehule <pavel.stehule@gmail.com> > wrote: >> pá 10. 12. 2021 v 10:25 odesílatel Strahinja Kustudić < >> strahinjak@nordeus.com> napsal: >>> I allowed EXPLAIN in Docker to finish and it took almost 500 seconds and >>> it was reading data all the time. After I reindexed the biggest table, >>> EXPLAIN finished instantly. Can the index corruption cause this? >> The index can be bloated. > Is bloat just enough to make it slower, or the index needs to be corrupted? bloat != corruption. The problem you're dealing with here is a whole lot of dead index entries at the very end of the index range (could be either the lowest or highest entries). The planner is wasting time scanning through those to find the extremal non-dead value for a range estimate. REINDEX got rid of the dead entries; a VACUUM would have done as well. As noted upthread, more recent PG versions are a bit smarter about such cases. regards, tom lane
On Fri, Dec 10, 2021 at 5:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
bloat != corruption. The problem you're dealing with here is a whole lot
of dead index entries at the very end of the index range (could be either
the lowest or highest entries). The planner is wasting time scanning
through those to find the extremal non-dead value for a range estimate.
REINDEX got rid of the dead entries; a VACUUM would have done as well.
I completely understand that bloat is different from corruption. It was just really strange that some mass deletes caused this now because we have mass deletes regularly on this DB and it has been running for a few years without any issues. You're right about VACUUM also fixing it, I just tried it inside Docker and after VACUUM finished it instantly fixed the issue. The downside is that VACUUM took a long time to finish.
As noted upthread, more recent PG versions are a bit smarter about
such cases.
As far as I understand there is not much we can do except upgrade to future major versions, because on this DB we cannot partition data in that way to be able to drop old partitions.
Thanks again for your help :)
Regards,
Strahinja