Thread: Query with straightforward plan changes and becomes 6520 times slower after VACUUM ANALYZE

Query with straightforward plan changes and becomes 6520 times slower after VACUUM ANALYZE

From

Jan Kort

Date:

18 May 2021, 12:46:24

Hi,

I tried this on both PostgreSQL 12.7 and 13.3, I tried both VACUUM ANALYZE and VACUUM FULL ANALYZE, they both seem to change a straightforward plan to a reversed and complex plan that make the query slow. The only thing that works is to do a pg_dump, pg_restore, but that is not feasible.

It seems to depend on specific data, I tried generating other variations of around 1M records, but so far the attached DB (also generated) is the only one where I could reproduce it, I can’t figure out what is so special about this data.

Given the following record counts:

GFO_ZAKEN: 1048587

GFO_ZAKEN_KOSTEN: 3

GFO_ZAKEN_TYPECODE: 4

GFO_ZAKEN.ZAAKTYPECODE_ID has a value 9 times and the rest is null

When I run this query:

EXPLAIN ANALYZE

SELECT GFO_ZAKEN_KOSTEN.ID AS GFO_ZAKEN_KOSTEN_ID,

GFO_ZAKEN.ID AS GFO_ZAKEN_ID,

GFO_ZAKEN_TYPECODE.OMSCHRIJVING AS ZAAKTYPECODE_ID

FROM TRIAL.GFO_ZAKEN_KOSTEN

JOIN TRIAL.GFO_ZAKEN ON GFO_ZAKEN_KOSTEN.GFO_ZAKEN_ID = GFO_ZAKEN.ID

JOIN TRIAL.GFO_ZAKEN_TYPECODE ON GFO_ZAKEN.ZAAKTYPECODE_ID = GFO_ZAKEN_TYPECODE.ID

It is taking 500ms or so, which I associate with a full table scan, but they are just simple referential constraints and corresponding indexes:

CREATE TABLE TRIAL.GFO_ZAKEN_TYPECODE (ID INTEGER PRIMARY KEY, OMSCHRIJVING CHARACTER VARYING(4000));

CREATE TABLE TRIAL.GFO_ZAKEN (ID INTEGER PRIMARY KEY, ZAAKTYPECODE_ID INTEGER,

CONSTRAINT ZAAKTYPECODE_IDC1 FOREIGN KEY (ZAAKTYPECODE_ID) REFERENCES TRIAL.GFO_ZAKEN_TYPECODE (ID));

CREATE INDEX GFO_ZAKENO18 ON TRIAL.GFO_ZAKEN USING BTREE (ZAAKTYPECODE_ID);

CREATE TABLE TRIAL.GFO_ZAKEN_KOSTEN (ID INTEGER PRIMARY KEY, GFO_ZAKEN_ID INTEGER,

CONSTRAINT GFO_ZAKEN_IDC1 FOREIGN KEY (GFO_ZAKEN_ID) REFERENCES TRIAL.GFO_ZAKEN (ID));

CREATE INDEX GFO_ZAKEN_KOSTENO14 ON TRIAL.GFO_ZAKEN_KOSTEN USING BTREE (GFO_ZAKEN_ID ASC NULLS LAST);

After pg_restore it gives a straightforward plan, starting with the gfo_zaken_kosten primary key and continuing with the join on gfo_zaken_kosten.gfo_zaken_id:

Nested Loop (cost=0.56..3.64 rows=1 width=524) (actual time=0.036..0.037 rows=1 loops=1)

-> Nested Loop (cost=0.43..3.48 rows=1 width=12) (actual time=0.030..0.030 rows=1 loops=1)

-> Seq Scan on gfo_zaken_kosten (cost=0.00..1.04 rows=1 width=8) (actual time=0.016..0.016 rows=1 loops=1)

Filter: (id = 13)

Rows Removed by Filter: 2

-> Index Scan using gfo_zakenp on gfo_zaken (cost=0.43..2.45 rows=1 width=8) (actual time=0.012..0.012 rows=1 loops=1)

Index Cond: (id = gfo_zaken_kosten.gfo_zaken_id)

-> Index Scan using gfo_zaken_typecodep on gfo_zaken_typecode (cost=0.13..0.15 rows=1 width=520) (actual time=0.005..0.005 rows=1 loops=1)

Index Cond: (id = gfo_zaken.zaaktypecode_id)

Planning Time: 1.538 ms

Execution Time: 0.095 ms

After VACUUM ANALYZE the plan becomes inefficient again, and does not start with the gfo_zaken_kosten primary key, the plan starts at the wrong end with an index scan on 1M rows:

Merge Join (cost=1.48..1.59 rows=1 width=159) (actual time=619.374..619.376 rows=1 loops=1)

Merge Cond: (gfo_zaken.id = gfo_zaken_kosten.gfo_zaken_id)

-> Nested Loop (cost=0.43..96503.47 rows=1048587 width=155) (actual time=0.022..619.359 rows=9 loops=1)

Join Filter: (gfo_zaken.zaaktypecode_id = gfo_zaken_typecode.id)

Rows Removed by Join Filter: 4194316

-> Index Scan using gfo_zakenp on gfo_zaken (cost=0.43..33587.23 rows=1048587 width=8) (actual time=0.006..141.167 rows=1048587 loops=1)

-> Materialize (cost=0.00..1.06 rows=4 width=155) (actual time=0.000..0.000 rows=4 loops=1048587)

-> Seq Scan on gfo_zaken_typecode (cost=0.00..1.04 rows=4 width=155) (actual time=0.011..0.012 rows=4 loops=1)

-> Sort (cost=1.05..1.05 rows=1 width=8) (actual time=0.012..0.012 rows=1 loops=1)

Sort Key: gfo_zaken_kosten.gfo_zaken_id

Sort Method: quicksort Memory: 25kB

-> Seq Scan on gfo_zaken_kosten (cost=0.00..1.04 rows=1 width=8) (actual time=0.008..0.008 rows=1 loops=1)

Filter: (id = 13)

Rows Removed by Filter: 2

Planning Time: 69.151 ms

Execution Time: 619.410 ms

So 6520 times slower after vacuum analyze then before.

Doing VACUUM ANALYZE again doesn’t make it better, only pg_dump + pg_restore will go back to the original plan, but then it will break again on the first VACUUM ANALYZE.

I attached a 800K test DB with all sensitive data removed.

I tried both the default config without changes, and the default config with our settings appended:

max_connections = 1000

shared_buffers = 512MB

effective_cache_size = 6GB

work_mem = 10485kB

maintenance_work_mem = 512MB

min_wal_size = 1GB

max_wal_size = 2GB

checkpoint_completion_target = 0.7

wal_buffers = 16MB

default_statistics_target = 100

random_page_cost = 1

wal_sync_method = open_datasync

fsync = on

synchronous_commit = off

Doing a VACUUM ANALYZE shouldn’t change a straightforward plan.

Regards,

Jan Kort

Attachment

trialdb.zip

Re: Query with straightforward plan changes and becomes 6520 times slower after VACUUM ANALYZE

From

Pavel Stehule

Date:

18 May 2021, 14:02:09

út 18. 5. 2021 v 15:36 odesílatel Jan Kort <jan.kort@genetics.nl> napsal:

Hi,

I tried this on both PostgreSQL 12.7 and 13.3, I tried both VACUUM ANALYZE and VACUUM FULL ANALYZE, they both seem to change a straightforward plan to a reversed and complex plan that make the query slow. The only thing that works is to do a pg_dump, pg_restore, but that is not feasible.

It seems to depend on specific data, I tried generating other variations of around 1M records, but so far the attached DB (also generated) is the only one where I could reproduce it, I can’t figure out what is so special about this data.

Given the following record counts:

GFO_ZAKEN: 1048587
GFO_ZAKEN_KOSTEN: 3
GFO_ZAKEN_TYPECODE: 4
GFO_ZAKEN.ZAAKTYPECODE_ID has a value 9 times and the rest is null

When I run this query:

EXPLAIN ANALYZE
SELECT GFO_ZAKEN_KOSTEN.ID AS GFO_ZAKEN_KOSTEN_ID,
       GFO_ZAKEN.ID AS GFO_ZAKEN_ID,
       GFO_ZAKEN_TYPECODE.OMSCHRIJVING AS ZAAKTYPECODE_ID
FROM TRIAL.GFO_ZAKEN_KOSTEN
  JOIN TRIAL.GFO_ZAKEN ON GFO_ZAKEN_KOSTEN.GFO_ZAKEN_ID = GFO_ZAKEN.ID
JOIN TRIAL.GFO_ZAKEN_TYPECODE ON GFO_ZAKEN.ZAAKTYPECODE_ID = GFO_ZAKEN_TYPECODE.ID

It is taking 500ms or so, which I associate with a full table scan, but they are just simple referential constraints and corresponding indexes:

CREATE TABLE TRIAL.GFO_ZAKEN_TYPECODE (ID INTEGER PRIMARY KEY, OMSCHRIJVING CHARACTER VARYING(4000));
CREATE TABLE TRIAL.GFO_ZAKEN (ID INTEGER PRIMARY KEY, ZAAKTYPECODE_ID INTEGER,
CONSTRAINT ZAAKTYPECODE_IDC1 FOREIGN KEY (ZAAKTYPECODE_ID) REFERENCES TRIAL.GFO_ZAKEN_TYPECODE (ID));
CREATE INDEX GFO_ZAKENO18 ON TRIAL.GFO_ZAKEN USING BTREE (ZAAKTYPECODE_ID);
CREATE TABLE TRIAL.GFO_ZAKEN_KOSTEN (ID INTEGER PRIMARY KEY, GFO_ZAKEN_ID INTEGER,
CONSTRAINT GFO_ZAKEN_IDC1 FOREIGN KEY (GFO_ZAKEN_ID) REFERENCES TRIAL.GFO_ZAKEN (ID));
CREATE INDEX GFO_ZAKEN_KOSTENO14 ON TRIAL.GFO_ZAKEN_KOSTEN USING BTREE (GFO_ZAKEN_ID ASC NULLS LAST);

After pg_restore it gives a straightforward plan, starting with the gfo_zaken_kosten primary key and continuing with the join on gfo_zaken_kosten.gfo_zaken_id:

Nested Loop (cost=0.56..3.64 rows=1 width=524) (actual time=0.036..0.037 rows=1 loops=1)
-> Nested Loop (cost=0.43..3.48 rows=1 width=12) (actual time=0.030..0.030 rows=1 loops=1)
        -> Seq Scan on gfo_zaken_kosten (cost=0.00..1.04 rows=1 width=8) (actual time=0.016..0.016 rows=1 loops=1)
              Filter: (id = 13)
              Rows Removed by Filter: 2
        -> Index Scan using gfo_zakenp on gfo_zaken (cost=0.43..2.45 rows=1 width=8) (actual time=0.012..0.012 rows=1 loops=1)
              Index Cond: (id = gfo_zaken_kosten.gfo_zaken_id)
-> Index Scan using gfo_zaken_typecodep on gfo_zaken_typecode (cost=0.13..0.15 rows=1 width=520) (actual time=0.005..0.005 rows=1 loops=1)
        Index Cond: (id = gfo_zaken.zaaktypecode_id)
Planning Time: 1.538 ms
Execution Time: 0.095 ms

After VACUUM ANALYZE the plan becomes inefficient again, and does not start with the gfo_zaken_kosten primary key, the plan starts at the wrong end with an index scan on 1M rows:

Merge Join (cost=1.48..1.59 rows=1 width=159) (actual time=619.374..619.376 rows=1 loops=1)
Merge Cond: (gfo_zaken.id = gfo_zaken_kosten.gfo_zaken_id)
-> Nested Loop (cost=0.43..96503.47 rows=1048587 width=155) (actual time=0.022..619.359 rows=9 loops=1)
        Join Filter: (gfo_zaken.zaaktypecode_id = gfo_zaken_typecode.id)
        Rows Removed by Join Filter: 4194316
        -> Index Scan using gfo_zakenp on gfo_zaken (cost=0.43..33587.23 rows=1048587 width=8) (actual time=0.006..141.167 rows=1048587 loops=1)
        -> Materialize (cost=0.00..1.06 rows=4 width=155) (actual time=0.000..0.000 rows=4 loops=1048587)
              -> Seq Scan on gfo_zaken_typecode (cost=0.00..1.04 rows=4 width=155) (actual time=0.011..0.012 rows=4 loops=1)
-> Sort (cost=1.05..1.05 rows=1 width=8) (actual time=0.012..0.012 rows=1 loops=1)
        Sort Key: gfo_zaken_kosten.gfo_zaken_id
        Sort Method: quicksort Memory: 25kB
        -> Seq Scan on gfo_zaken_kosten (cost=0.00..1.04 rows=1 width=8) (actual time=0.008..0.008 rows=1 loops=1)
              Filter: (id = 13)
              Rows Removed by Filter: 2
Planning Time: 69.151 ms
Execution Time: 619.410 ms

This is not a bug. You can see very bad estimation on join. Postgres expects one value has the same probability in both tables. In your case, it is not true. Unfortunately, Postgres has not multi table statistics, so there is no easy solution. Usually you need to divide your query to two. And maybe you can check your data, why the predicate gfo_zaken.zaaktypecode_id = gfo_zaken_typecode.id is almost ivalid.

Regards

Pavel Stehule

So 6520 times slower after vacuum analyze then before.

Doing VACUUM ANALYZE again doesn’t make it better, only pg_dump + pg_restore will go back to the original plan, but then it will break again on the first VACUUM ANALYZE.

I attached a 800K test DB with all sensitive data removed.

I tried both the default config without changes, and the default config with our settings appended:

max_connections = 1000
shared_buffers = 512MB
effective_cache_size = 6GB
work_mem = 10485kB
maintenance_work_mem = 512MB
min_wal_size = 1GB
max_wal_size = 2GB
checkpoint_completion_target = 0.7
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1
wal_sync_method = open_datasync
fsync = on
synchronous_commit = off

Doing a VACUUM ANALYZE shouldn’t change a straightforward plan.

Regards,

Jan Kort

RE: Query with straightforward plan changes and becomes 6520 times slower after VACUUM ANALYZE

From

Jan Kort

Date:

18 May 2021, 16:22:24

Hi Pavel,

Thank you for the reply.

The query is normally efficient.

I have removed more of the clutter, for example the indexes aren't important in this specific case:

DROP INDEX TRIAL.GFO_ZAKENO18;

DROP INDEX TRIAL.GFO_ZAKEN_KOSTENO14;

The GFO_ZAKEN_TYPECODE part isn't important either, I can reproduce it with just:

EXPLAIN ANALYZE SELECT *

FROM TRIAL.GFO_ZAKEN_KOSTEN

JOIN TRIAL.GFO_ZAKEN ON GFO_ZAKEN_KOSTEN.GFO_ZAKEN_ID = GFO_ZAKEN.ID

WHERE GFO_ZAKEN_KOSTEN.ID = 13

So it should always pick the GFO_ZAKEN_KOSTEN.ID = 13 and follow the referential constraint in the JOIN now.

And in most cases it will do this.

But, not in the specific case I have with just 3 GFO_ZAKEN_KOSTEN records.

The slightest change in the data will cause the query to behave normal, the planner says 0.050ms again.

For example if I add 1 record with:

insert into trial.gfo_zaken_kosten (id, gfo_zaken_id) values (1000, 1000);

So you now have GFO_ZAKEN_KOSTEN:

ID GFO_ZAKEN_ID

11 98

12 98

13 1049601

1000 1000

After a VACUUM FULL ANALYZE the query goes back to normal.

Nested Loop (cost=0.43..9.50 rows=1 width=16) (actual time=0.038..0.039 rows=1 loops=1)

-> Seq Scan on gfo_zaken_kosten (cost=0.00..1.05 rows=1 width=8) (actual time=0.033..0.034 rows=1 loops=1)

Filter: (id = 13)

Rows Removed by Filter: 3

-> Index Scan using gfo_zaken_pkey on gfo_zaken (cost=0.43..8.45 rows=1 width=8) (actual time=0.003..0.003 rows=1 loops=1)

Index Cond: (id = gfo_zaken_kosten.gfo_zaken_id)

Planning Time: 0.138 ms

Execution Time: 0.050 ms

After removing the GFO_ZAKEN_KOSTEN with ID = 1000 again and running VACUUM FULL ANALYZE again it becomes slow again. Just to make sure I switched between the 3 and 4 records a few times and could reproduce this every time. I also tried different values for ID and GFO_ZAKEN_ID and more than 1 record, but anything 4 or higher worked, even 1M records in GFO_ZAKEN_KOSTEN was just as efficient.

Then I removed GFO_ZAKEN_KOSTEN record with ID = 11, giving 3 records, that fixed it too, so it isn't just 3 records that is the problem.

I put back the GFO_ZAKEN_KOSTEN record with ID = 11, giving 4 records, that worked.

Then I removed GFO_ZAKEN_KOSTEN record with ID = 1000, giving the original 3 records, it stopped working.

Then I removed GFO_ZAKEN_KOSTEN record with ID = 12, giving 2 records, that worked.

I put back GFO_ZAKEN_KOSTEN record with ID = 12, giving 3 records, that stopped working.

Then I removed GFO_ZAKEN_KOSTEN record with ID = 11, giving 2 records, that worked.

So my guess is that it is 3 records, where 2 have the same reference that causes the planner to chose a different route.

Met vriendelijke groet,

Jan Kort | Helpdesk Medewerker

Afdeling R&D, Projecten en Helpdesk

cid:image006.png@01D716A1.3ACBF5F0

E    jan.kort@genetics.nl
I    www.genetics.nl
T    (+31) 36 54 00 850
M    (+31) 6 485 138 32
A    Postbus 1268, 1300 BG Almere
B    Bouwmeesterweg 8, 1333 LC Almere

De informatie in dit bericht kan vertrouwelijk zijn. Zij is uitsluitend bestemd voor de geadresseerde. Indien u dit bericht onterecht ontvangt, wordt u verzocht de inhoud niet te gebruiken en de afzender direct te informeren door dit bericht te retourneren. Genetics is niet aansprakelijk voor de overdracht van de inhoud van dit e-mailbericht. Evenmin is Genetics aansprakelijk voor eventuele vertragingen.

Op alle overeenkomsten zijn onze algemene voorwaarden van Nederland ICT van toepassing. Deze zijn gedeponeerd bij de Kamer van Koophandel onder nummer 30174840, tenzij anders schriftelijk overeengekomen.

Genetics hecht veel waarde aan de kwaliteit en veiligheid van haar dienstverlening. Om dit proces te borgen is Genetics door Kiwa gecertificeerd voor ISO 9001, ISO/IEC 27001 en de AVG.

P Denk a.u.b. aan het milieu voordat u deze e-mail uitprint

Van: Pavel Stehule <pavel.stehule@gmail.com>
Verzonden: dinsdag 18 mei 2021 16:02
Aan: Jan Kort <jan.kort@genetics.nl>
CC: pgsql-bugs@lists.postgresql.org
Onderwerp: Re: Query with straightforward plan changes and becomes 6520 times slower after VACUUM ANALYZE

út 18. 5. 2021 v 15:36 odesílatel Jan Kort <jan.kort@genetics.nl> napsal:

Hi,

I tried this on both PostgreSQL 12.7 and 13.3, I tried both VACUUM ANALYZE and VACUUM FULL ANALYZE, they both seem to change a straightforward plan to a reversed and complex plan that make the query slow. The only thing that works is to do a pg_dump, pg_restore, but that is not feasible.

It seems to depend on specific data, I tried generating other variations of around 1M records, but so far the attached DB (also generated) is the only one where I could reproduce it, I can’t figure out what is so special about this data.

Given the following record counts:

GFO_ZAKEN: 1048587
GFO_ZAKEN_KOSTEN: 3
GFO_ZAKEN_TYPECODE: 4
GFO_ZAKEN.ZAAKTYPECODE_ID has a value 9 times and the rest is null

When I run this query:

EXPLAIN ANALYZE
SELECT GFO_ZAKEN_KOSTEN.ID AS GFO_ZAKEN_KOSTEN_ID,
       GFO_ZAKEN.ID AS GFO_ZAKEN_ID,
       GFO_ZAKEN_TYPECODE.OMSCHRIJVING AS ZAAKTYPECODE_ID
FROM TRIAL.GFO_ZAKEN_KOSTEN
  JOIN TRIAL.GFO_ZAKEN ON GFO_ZAKEN_KOSTEN.GFO_ZAKEN_ID = GFO_ZAKEN.ID
JOIN TRIAL.GFO_ZAKEN_TYPECODE ON GFO_ZAKEN.ZAAKTYPECODE_ID = GFO_ZAKEN_TYPECODE.ID

It is taking 500ms or so, which I associate with a full table scan, but they are just simple referential constraints and corresponding indexes:

CREATE TABLE TRIAL.GFO_ZAKEN_TYPECODE (ID INTEGER PRIMARY KEY, OMSCHRIJVING CHARACTER VARYING(4000));
CREATE TABLE TRIAL.GFO_ZAKEN (ID INTEGER PRIMARY KEY, ZAAKTYPECODE_ID INTEGER,
CONSTRAINT ZAAKTYPECODE_IDC1 FOREIGN KEY (ZAAKTYPECODE_ID) REFERENCES TRIAL.GFO_ZAKEN_TYPECODE (ID));
CREATE INDEX GFO_ZAKENO18 ON TRIAL.GFO_ZAKEN USING BTREE (ZAAKTYPECODE_ID);
CREATE TABLE TRIAL.GFO_ZAKEN_KOSTEN (ID INTEGER PRIMARY KEY, GFO_ZAKEN_ID INTEGER,
CONSTRAINT GFO_ZAKEN_IDC1 FOREIGN KEY (GFO_ZAKEN_ID) REFERENCES TRIAL.GFO_ZAKEN (ID));
CREATE INDEX GFO_ZAKEN_KOSTENO14 ON TRIAL.GFO_ZAKEN_KOSTEN USING BTREE (GFO_ZAKEN_ID ASC NULLS LAST);

After pg_restore it gives a straightforward plan, starting with the gfo_zaken_kosten primary key and continuing with the join on gfo_zaken_kosten.gfo_zaken_id:

Nested Loop (cost=0.56..3.64 rows=1 width=524) (actual time=0.036..0.037 rows=1 loops=1)
-> Nested Loop (cost=0.43..3.48 rows=1 width=12) (actual time=0.030..0.030 rows=1 loops=1)
        -> Seq Scan on gfo_zaken_kosten (cost=0.00..1.04 rows=1 width=8) (actual time=0.016..0.016 rows=1 loops=1)
              Filter: (id = 13)
              Rows Removed by Filter: 2
        -> Index Scan using gfo_zakenp on gfo_zaken (cost=0.43..2.45 rows=1 width=8) (actual time=0.012..0.012 rows=1 loops=1)
              Index Cond: (id = gfo_zaken_kosten.gfo_zaken_id)
-> Index Scan using gfo_zaken_typecodep on gfo_zaken_typecode (cost=0.13..0.15 rows=1 width=520) (actual time=0.005..0.005 rows=1 loops=1)
        Index Cond: (id = gfo_zaken.zaaktypecode_id)
Planning Time: 1.538 ms
Execution Time: 0.095 ms

After VACUUM ANALYZE the plan becomes inefficient again, and does not start with the gfo_zaken_kosten primary key, the plan starts at the wrong end with an index scan on 1M rows:

Merge Join (cost=1.48..1.59 rows=1 width=159) (actual time=619.374..619.376 rows=1 loops=1)
Merge Cond: (gfo_zaken.id = gfo_zaken_kosten.gfo_zaken_id)
-> Nested Loop (cost=0.43..96503.47 rows=1048587 width=155) (actual time=0.022..619.359 rows=9 loops=1)
        Join Filter: (gfo_zaken.zaaktypecode_id = gfo_zaken_typecode.id)
        Rows Removed by Join Filter: 4194316
        -> Index Scan using gfo_zakenp on gfo_zaken (cost=0.43..33587.23 rows=1048587 width=8) (actual time=0.006..141.167 rows=1048587 loops=1)
        -> Materialize (cost=0.00..1.06 rows=4 width=155) (actual time=0.000..0.000 rows=4 loops=1048587)
              -> Seq Scan on gfo_zaken_typecode (cost=0.00..1.04 rows=4 width=155) (actual time=0.011..0.012 rows=4 loops=1)
-> Sort (cost=1.05..1.05 rows=1 width=8) (actual time=0.012..0.012 rows=1 loops=1)
        Sort Key: gfo_zaken_kosten.gfo_zaken_id
        Sort Method: quicksort Memory: 25kB
        -> Seq Scan on gfo_zaken_kosten (cost=0.00..1.04 rows=1 width=8) (actual time=0.008..0.008 rows=1 loops=1)
              Filter: (id = 13)
              Rows Removed by Filter: 2
Planning Time: 69.151 ms
Execution Time: 619.410 ms

Regards

Pavel Stehule

So 6520 times slower after vacuum analyze then before.

Doing VACUUM ANALYZE again doesn’t make it better, only pg_dump + pg_restore will go back to the original plan, but then it will break again on the first VACUUM ANALYZE.

I attached a 800K test DB with all sensitive data removed.

I tried both the default config without changes, and the default config with our settings appended:

max_connections = 1000
shared_buffers = 512MB
effective_cache_size = 6GB
work_mem = 10485kB
maintenance_work_mem = 512MB
min_wal_size = 1GB
max_wal_size = 2GB
checkpoint_completion_target = 0.7
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1
wal_sync_method = open_datasync
fsync = on
synchronous_commit = off

Doing a VACUUM ANALYZE shouldn’t change a straightforward plan.

Regards,

Jan Kort

Attachment

Re: Query with straightforward plan changes and becomes 6520 times slower after VACUUM ANALYZE

From

Pavel Stehule

Date:

18 May 2021, 16:54:51

út 18. 5. 2021 v 18:22 odesílatel Jan Kort <jan.kort@genetics.nl> napsal:

Hi Pavel,

Thank you for the reply.

The query is normally efficient.

I have removed more of the clutter, for example the indexes aren't important in this specific case:

DROP INDEX TRIAL.GFO_ZAKENO18;
DROP INDEX TRIAL.GFO_ZAKEN_KOSTENO14;

The GFO_ZAKEN_TYPECODE part isn't important either, I can reproduce it with just:

EXPLAIN ANALYZE SELECT *
  FROM TRIAL.GFO_ZAKEN_KOSTEN
JOIN TRIAL.GFO_ZAKEN ON GFO_ZAKEN_KOSTEN.GFO_ZAKEN_ID = GFO_ZAKEN.ID
WHERE GFO_ZAKEN_KOSTEN.ID = 13

So it should always pick the GFO_ZAKEN_KOSTEN.ID = 13 and follow the referential constraint in the JOIN now.

And in most cases it will do this.

But, not in the specific case I have with just 3 GFO_ZAKEN_KOSTEN records.

The slightest change in the data will cause the query to behave normal, the planner says 0.050ms again.

For example if I add 1 record with:

insert into trial.gfo_zaken_kosten (id, gfo_zaken_id) values (1000, 1000);

So you now have GFO_ZAKEN_KOSTEN:

ID GFO_ZAKEN_ID
11           98
12           98
13           1049601
1000       1000

After a VACUUM FULL ANALYZE the query goes back to normal.

Nested Loop (cost=0.43..9.50 rows=1 width=16) (actual time=0.038..0.039 rows=1 loops=1)
-> Seq Scan on gfo_zaken_kosten (cost=0.00..1.05 rows=1 width=8) (actual time=0.033..0.034 rows=1 loops=1)
        Filter: (id = 13)
        Rows Removed by Filter: 3
-> Index Scan using gfo_zaken_pkey on gfo_zaken (cost=0.43..8.45 rows=1 width=8) (actual time=0.003..0.003 rows=1 loops=1)
        Index Cond: (id = gfo_zaken_kosten.gfo_zaken_id)
Planning Time: 0.138 ms
Execution Time: 0.050 ms

After removing the GFO_ZAKEN_KOSTEN with ID = 1000 again and running VACUUM FULL ANALYZE again it becomes slow again. Just to make sure I switched between the 3 and 4 records a few times and could reproduce this every time. I also tried different values for ID and GFO_ZAKEN_ID and more than 1 record, but anything 4 or higher worked, even 1M records in GFO_ZAKEN_KOSTEN was just as efficient.

Then I removed GFO_ZAKEN_KOSTEN record with ID = 11, giving 3 records, that fixed it too, so it isn't just 3 records that is the problem.

I put back the GFO_ZAKEN_KOSTEN record with ID = 11, giving 4 records, that worked.

Then I removed GFO_ZAKEN_KOSTEN record with ID = 1000, giving the original 3 records, it stopped working.

Then I removed GFO_ZAKEN_KOSTEN record with ID = 12, giving 2 records, that worked.

I put back GFO_ZAKEN_KOSTEN record with ID = 12, giving 3 records, that stopped working.

Then I removed GFO_ZAKEN_KOSTEN record with ID = 11, giving 2 records, that worked.

So my guess is that it is 3 records, where 2 have the same reference that causes the planner to chose a different route.

I have no idea, but maybe the sampling in ANALYZE doesn't work well when there is very low cardinality.

You can check if related column statistics have valid value or not.

https://pages.saclay.inria.fr/maxime.buron/posts/checking-postgres-statistics.html

Regards

Pavel

Met vriendelijke groet,

Jan Kort | Helpdesk Medewerker
Afdeling R&D, Projecten en Helpdesk

E    jan.kort@genetics.nl
I    www.genetics.nl
T    (+31) 36 54 00 850
M    (+31) 6 485 138 32
A    Postbus 1268, 1300 BG Almere
B    Bouwmeesterweg 8, 1333 LC Almere

De informatie in dit bericht kan vertrouwelijk zijn. Zij is uitsluitend bestemd voor de geadresseerde. Indien u dit bericht onterecht ontvangt, wordt u verzocht de inhoud niet te gebruiken en de afzender direct te informeren door dit bericht te retourneren. Genetics is niet aansprakelijk voor de overdracht van de inhoud van dit e-mailbericht. Evenmin is Genetics aansprakelijk voor eventuele vertragingen.
Op alle overeenkomsten zijn onze algemene voorwaarden van Nederland ICT van toepassing. Deze zijn gedeponeerd bij de Kamer van Koophandel onder nummer 30174840, tenzij anders schriftelijk overeengekomen.
Genetics hecht veel waarde aan de kwaliteit en veiligheid van haar dienstverlening. Om dit proces te borgen is Genetics door Kiwa gecertificeerd voor ISO 9001, ISO/IEC 27001 en de AVG.

P Denk a.u.b. aan het milieu voordat u deze e-mail uitprint

Van: Pavel Stehule <pavel.stehule@gmail.com>
Verzonden: dinsdag 18 mei 2021 16:02
Aan: Jan Kort <jan.kort@genetics.nl>
CC: pgsql-bugs@lists.postgresql.org
Onderwerp: Re: Query with straightforward plan changes and becomes 6520 times slower after VACUUM ANALYZE

Hi

út 18. 5. 2021 v 15:36 odesílatel Jan Kort <jan.kort@genetics.nl> napsal:
Hi,

I tried this on both PostgreSQL 12.7 and 13.3, I tried both VACUUM ANALYZE and VACUUM FULL ANALYZE, they both seem to change a straightforward plan to a reversed and complex plan that make the query slow. The only thing that works is to do a pg_dump, pg_restore, but that is not feasible.

It seems to depend on specific data, I tried generating other variations of around 1M records, but so far the attached DB (also generated) is the only one where I could reproduce it, I can’t figure out what is so special about this data.

Given the following record counts:

GFO_ZAKEN: 1048587
GFO_ZAKEN_KOSTEN: 3
GFO_ZAKEN_TYPECODE: 4
GFO_ZAKEN.ZAAKTYPECODE_ID has a value 9 times and the rest is null

When I run this query:

EXPLAIN ANALYZE
SELECT GFO_ZAKEN_KOSTEN.ID AS GFO_ZAKEN_KOSTEN_ID,
       GFO_ZAKEN.ID AS GFO_ZAKEN_ID,
       GFO_ZAKEN_TYPECODE.OMSCHRIJVING AS ZAAKTYPECODE_ID
FROM TRIAL.GFO_ZAKEN_KOSTEN
  JOIN TRIAL.GFO_ZAKEN ON GFO_ZAKEN_KOSTEN.GFO_ZAKEN_ID = GFO_ZAKEN.ID
JOIN TRIAL.GFO_ZAKEN_TYPECODE ON GFO_ZAKEN.ZAAKTYPECODE_ID = GFO_ZAKEN_TYPECODE.ID

It is taking 500ms or so, which I associate with a full table scan, but they are just simple referential constraints and corresponding indexes:

CREATE TABLE TRIAL.GFO_ZAKEN_TYPECODE (ID INTEGER PRIMARY KEY, OMSCHRIJVING CHARACTER VARYING(4000));
CREATE TABLE TRIAL.GFO_ZAKEN (ID INTEGER PRIMARY KEY, ZAAKTYPECODE_ID INTEGER,
CONSTRAINT ZAAKTYPECODE_IDC1 FOREIGN KEY (ZAAKTYPECODE_ID) REFERENCES TRIAL.GFO_ZAKEN_TYPECODE (ID));
CREATE INDEX GFO_ZAKENO18 ON TRIAL.GFO_ZAKEN USING BTREE (ZAAKTYPECODE_ID);
CREATE TABLE TRIAL.GFO_ZAKEN_KOSTEN (ID INTEGER PRIMARY KEY, GFO_ZAKEN_ID INTEGER,
CONSTRAINT GFO_ZAKEN_IDC1 FOREIGN KEY (GFO_ZAKEN_ID) REFERENCES TRIAL.GFO_ZAKEN (ID));
CREATE INDEX GFO_ZAKEN_KOSTENO14 ON TRIAL.GFO_ZAKEN_KOSTEN USING BTREE (GFO_ZAKEN_ID ASC NULLS LAST);

After pg_restore it gives a straightforward plan, starting with the gfo_zaken_kosten primary key and continuing with the join on gfo_zaken_kosten.gfo_zaken_id:

Nested Loop (cost=0.56..3.64 rows=1 width=524) (actual time=0.036..0.037 rows=1 loops=1)
-> Nested Loop (cost=0.43..3.48 rows=1 width=12) (actual time=0.030..0.030 rows=1 loops=1)
        -> Seq Scan on gfo_zaken_kosten (cost=0.00..1.04 rows=1 width=8) (actual time=0.016..0.016 rows=1 loops=1)
              Filter: (id = 13)
              Rows Removed by Filter: 2
        -> Index Scan using gfo_zakenp on gfo_zaken (cost=0.43..2.45 rows=1 width=8) (actual time=0.012..0.012 rows=1 loops=1)
              Index Cond: (id = gfo_zaken_kosten.gfo_zaken_id)
-> Index Scan using gfo_zaken_typecodep on gfo_zaken_typecode (cost=0.13..0.15 rows=1 width=520) (actual time=0.005..0.005 rows=1 loops=1)
        Index Cond: (id = gfo_zaken.zaaktypecode_id)
Planning Time: 1.538 ms
Execution Time: 0.095 ms

After VACUUM ANALYZE the plan becomes inefficient again, and does not start with the gfo_zaken_kosten primary key, the plan starts at the wrong end with an index scan on 1M rows:

Merge Join (cost=1.48..1.59 rows=1 width=159) (actual time=619.374..619.376 rows=1 loops=1)
Merge Cond: (gfo_zaken.id = gfo_zaken_kosten.gfo_zaken_id)
-> Nested Loop (cost=0.43..96503.47 rows=1048587 width=155) (actual time=0.022..619.359 rows=9 loops=1)
        Join Filter: (gfo_zaken.zaaktypecode_id = gfo_zaken_typecode.id)
        Rows Removed by Join Filter: 4194316
        -> Index Scan using gfo_zakenp on gfo_zaken (cost=0.43..33587.23 rows=1048587 width=8) (actual time=0.006..141.167 rows=1048587 loops=1)
        -> Materialize (cost=0.00..1.06 rows=4 width=155) (actual time=0.000..0.000 rows=4 loops=1048587)
              -> Seq Scan on gfo_zaken_typecode (cost=0.00..1.04 rows=4 width=155) (actual time=0.011..0.012 rows=4 loops=1)
-> Sort (cost=1.05..1.05 rows=1 width=8) (actual time=0.012..0.012 rows=1 loops=1)
        Sort Key: gfo_zaken_kosten.gfo_zaken_id
        Sort Method: quicksort Memory: 25kB
        -> Seq Scan on gfo_zaken_kosten (cost=0.00..1.04 rows=1 width=8) (actual time=0.008..0.008 rows=1 loops=1)
              Filter: (id = 13)
              Rows Removed by Filter: 2
Planning Time: 69.151 ms
Execution Time: 619.410 ms

This is not a bug. You can see very bad estimation on join. Postgres expects one value has the same probability in both tables. In your case, it is not true. Unfortunately, Postgres has not multi table statistics, so there is no easy solution. Usually you need to divide your query to two. And maybe you can check your data, why the predicate gfo_zaken.zaaktypecode_id = gfo_zaken_typecode.id is almost ivalid.

Regards

Pavel Stehule

So 6520 times slower after vacuum analyze then before.

Doing VACUUM ANALYZE again doesn’t make it better, only pg_dump + pg_restore will go back to the original plan, but then it will break again on the first VACUUM ANALYZE.

I attached a 800K test DB with all sensitive data removed.

I tried both the default config without changes, and the default config with our settings appended:

max_connections = 1000
shared_buffers = 512MB
effective_cache_size = 6GB
work_mem = 10485kB
maintenance_work_mem = 512MB
min_wal_size = 1GB
max_wal_size = 2GB
checkpoint_completion_target = 0.7
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 1
wal_sync_method = open_datasync
fsync = on
synchronous_commit = off

Doing a VACUUM ANALYZE shouldn’t change a straightforward plan.

Regards,

Jan Kort

Attachment

Re: Query with straightforward plan changes and becomes 6520 times slower after VACUUM ANALYZE

From

David Rowley

Date:

19 May 2021, 00:35:05

On Wed, 19 May 2021 at 01:36, Jan Kort <jan.kort@genetics.nl> wrote:
> After VACUUM ANALYZE the plan becomes inefficient again, and does not start with the gfo_zaken_kosten primary key,
theplan starts at the wrong end with an index scan on 1M rows:

>
> Merge Join  (cost=1.48..1.59 rows=1 width=159) (actual time=619.374..619.376 rows=1 loops=1)
>   Merge Cond: (gfo_zaken.id = gfo_zaken_kosten.gfo_zaken_id)
>   ->  Nested Loop  (cost=0.43..96503.47 rows=1048587 width=155) (actual time=0.022..619.359 rows=9 loops=1)
>         Join Filter: (gfo_zaken.zaaktypecode_id = gfo_zaken_typecode.id)
>         Rows Removed by Join Filter: 4194316
>         ->  Index Scan using gfo_zakenp on gfo_zaken  (cost=0.43..33587.23 rows=1048587 width=8) (actual
time=0.006..141.167rows=1048587 loops=1)

>         ->  Materialize  (cost=0.00..1.06 rows=4 width=155) (actual time=0.000..0.000 rows=4 loops=1048587)
>               ->  Seq Scan on gfo_zaken_typecode  (cost=0.00..1.04 rows=4 width=155) (actual time=0.011..0.012 rows=4
loops=1)
>   ->  Sort  (cost=1.05..1.05 rows=1 width=8) (actual time=0.012..0.012 rows=1 loops=1)
>         Sort Key: gfo_zaken_kosten.gfo_zaken_id
>         Sort Method: quicksort  Memory: 25kB
>         ->  Seq Scan on gfo_zaken_kosten  (cost=0.00..1.04 rows=1 width=8) (actual time=0.008..0.008 rows=1 loops=1)
>               Filter: (id = 13)
>               Rows Removed by Filter: 2
> Planning Time: 69.151 ms
> Execution Time: 619.410 ms

It looks like something is a bit weird with the merge join costing
code. Merge join not a very good choice in this case as out of the 3
values you have in gfo_zaken_kosten, 2 are at the very start of the
sorted merge join input in the gfo_zaken table and the 3rd is right at
the end.  That means the merge join must read all of gfo_zaken to join
the 3 rows in gfo_zaken_kosten.

Here's a minimal case to reproduce:

drop table if exists million,three;
create table million (id int primary key);
create table three (id int primary key, million_id int not null);

insert into million select x from generate_series(1,1000000) x;
insert into three values(1,1),(2,1),(3,1000000);

analyze million,three;
explain analyze select * from million m inner join three t on m.id =
t.million_id;

Gives: Merge Join  (cost=1.49..1.56 rows=3 width=12)

The weird thing is that when I just put the two rows in the "three"
table, that Merge Join is the planner's last choice:

truncate three;
insert into three values(1,1),(3,1000000);
analyze three;

set enable_nestloop=0; set enable_hashjoin=0; set
max_parallel_Workers_per_Gather=0;
explain analyze select * from million m inner join three t on m.id =
t.million_id;

Gives me: Merge Join  (cost=1.46..32909.49 rows=2 width=12)

A total cost of 32909.49 is quite a bit higher than the 1.56 of when
the table had 3 rows. You'd expect the cost could only drop if we
removed a row.

The first choice is a parameterized nested loop:

Nested Loop  (cost=0.42..9.91 rows=2 width=12)

So it appears that the total cost of the merge join of 1.56 is pretty
unrealistic.

I'll debug this and see if I can see what's going on.

(I was a bit worried that this might have been down to the fairly new
code that uses the presence of foreign keys to help with join
selectivity estimations. It appears that's not the case.)

David

Re: Query with straightforward plan changes and becomes 6520 times slower after VACUUM ANALYZE

From

David Rowley

Date:

19 May 2021, 02:13:09

On Wed, 19 May 2021 at 12:35, David Rowley <dgrowleyml@gmail.com> wrote:
> I'll debug this and see if I can see what's going on.

This is down to the fact that when there are 2 items in the "three"
table we have a histogram in the stats and we can get the upper bound
by just looking at the final histogram bucket.

However, when there are 3 items, or more accurately, there's a
duplicate, we don't build a histogram and just have a
most-common-values list instead.  I've not yet checked the logic for
what we include in the MCV list but it appears in this case we only
store the item that's duplicated, in this case, the value 1.  We
calculate the end bound of the merge join with that value, which is a
bad choice as that makes it appear that the end bound is very close to
or the same as the start bound, making the merge join seem very cheap.

truncate three;
insert into three values(1,1),(2,1),(3,1000000);
analyze three;
select most_common_vals,histogram_bounds from pg_stats where tablename
= 'three' and attname = 'million_id';

-[ RECORD 1 ]----+----
most_common_vals | {1}
histogram_bounds |

truncate three;
insert into three values(1,1),(3,1000000);
analyze three;
select most_common_vals,histogram_bounds from pg_stats where tablename
= 'three' and attname = 'million_id';

-[ RECORD 1 ]----+------------
most_common_vals |
histogram_bounds | {1,1000000}

I'm not really sure the best way to make this better.  It seems like
an unfortunate bad case. I'm not sure if it'd be better to try and be
more inclusive when building MCV lists. But then, what would the logic
be when we don't have enough buckets to store more items.

There is some code in get_variable_range() that's #ifdef'd out to get
the actual range, when possible. There's a comment explaining why we
don't do that.

/*
* XXX It's very tempting to try to use the actual column min and max, if
* we can get them relatively-cheaply with an index probe.  However, since
* this function is called many times during join planning, that could
* have unpleasant effects on planning speed.  Need more investigation
* before enabling this.
*/
#ifdef NOT_USED
if (get_actual_variable_range(root, vardata, sortop, collation, min, max))
return true;
#endif

For this case, it might be ok to have done that since we have
RestrictInfo.scansel_cache. It does not seem as likely that we'll end
up doing get_actual_variable_range too often. However, there are
likely other cases that are not as well cached which we could make too
slow if we did call get_actual_variable_range.

That makes me think the best fix would be to do something better
during ANALYZE and maybe try and include some more upper bound MCVs.
I'm not yet too sure what drawbacks there might be from doing that.

David

Re: Query with straightforward plan changes and becomes 6520 times slower after VACUUM ANALYZE

From

David Rowley

Date:

19 May 2021, 07:51:58

On Wed, 19 May 2021 at 14:13, David Rowley <dgrowleyml@gmail.com> wrote:
> That makes me think the best fix would be to do something better
> during ANALYZE and maybe try and include some more upper bound MCVs.
> I'm not yet too sure what drawbacks there might be from doing that.

I had a quick look at the ANALYZE code and see that we only consider
items for the MCV list when they appear more than once in the analyzed
set.  When it comes to the histogram, we only consider making the
histogram if there are at least 2 items that are not covered in the
MCV list. The comment mentions:

/*
* Generate a histogram slot entry if there are at least two distinct
* values not accounted for in the MCV list.  (This ensures the
* histogram won't collapse to empty or a singleton.)
*/

So given we only have 2 distinct values in the "three" table and one
of those is tracked in the MCV list, there are not enough values
remaining to build a histogram.

It seems you've hit about the worst-case here. If you'd had 1 more
value in the gfo_zaken_kosten table then that would have been enough
to build a histogram. Or if the 98 value was not duplicated then we'd
not have built an MCV list and built a histogram with the two values
instead.

If you want a workaround, you could do:

alter table trial.gfo_zaken_kosten alter column gfo_zaken_id set statistics 0;
delete from pg_statistic where
starelid='trial.gfo_zaken_kosten'::regclass and staattnum=2;
analyze trial.gfo_zaken_kosten;

that's a bit dirty though. You'd need to do:

alter table trial.gfo_zaken_kosten alter column gfo_zaken_id set statistics -1;
analyze trial.gfo_zaken_kosten;

if that table was to ever change, else you might get bad plans due to
lack of stats.

David

RE: Query with straightforward plan changes and becomes 6520 times slower after VACUUM ANALYZE

From

Jan Kort

Date:

19 May 2021, 10:55:20

Hi David,

Ok so the problem is that the MCV list is created, because there are groups of common values, but then this has no
effect,because there are not enough single values to create a histogram?

That is not such a rare case as I would have hoped, for example this would then also cause the same behavior:

values(1,1),(2,1),(3,1000000),(4,2),(5,2),(6,3),(7,3);

And it seems to do.

Ranges can then be longer too:

values(1,1),(2,1),(4,1),(3,1000000);

for some reason I can't reverse the ranges:

values(1,1000000),(2,1000000),(3,1);

I don’t understand why that doesn't create a bad plan, there seems to be order involved too.

I admit that this is still somewhat rare, but much less rare than I was hoping for.

A few more things:

Using text as primary key seems to perform good too, even if it keeps the same "merge" plan:

create table million (id text primary key); create table three (id text primary key, million_id text not null);

What the high value is and how close it is to the low value seems to matter:

values(1,1),(2,1),(3,100000);
values(1,1),(2,1),(3,10000);
values(1,1),(2,1),(3,1000);

Gets progressively less bad, again the plan is not changed, but the execution of it becomes more efficient.

It seems like it goes through the records from 1 till say 10000 one by one.

Regards,

Jan

-----Oorspronkelijk bericht-----
Van: David Rowley <dgrowleyml@gmail.com> 
Verzonden: woensdag 19 mei 2021 09:52
Aan: Jan Kort <jan.kort@genetics.nl>
CC: pgsql-bugs@lists.postgresql.org
Onderwerp: Re: Query with straightforward plan changes and becomes 6520 times slower after VACUUM ANALYZE

On Wed, 19 May 2021 at 14:13, David Rowley <dgrowleyml@gmail.com> wrote:
> That makes me think the best fix would be to do something better 
> during ANALYZE and maybe try and include some more upper bound MCVs.
> I'm not yet too sure what drawbacks there might be from doing that.

I had a quick look at the ANALYZE code and see that we only consider items for the MCV list when they appear more than
oncein the analyzed set.  When it comes to the histogram, we only consider making the histogram if there are at least 2
itemsthat are not covered in the MCV list. The comment mentions:

/*
* Generate a histogram slot entry if there are at least two distinct
* values not accounted for in the MCV list.  (This ensures the
* histogram won't collapse to empty or a singleton.) */

So given we only have 2 distinct values in the "three" table and one of those is tracked in the MCV list, there are not
enoughvalues remaining to build a histogram.

It seems you've hit about the worst-case here. If you'd had 1 more value in the gfo_zaken_kosten table then that would
havebeen enough to build a histogram. Or if the 98 value was not duplicated then we'd not have built an MCV list and
builta histogram with the two values instead.

If you want a workaround, you could do:

alter table trial.gfo_zaken_kosten alter column gfo_zaken_id set statistics 0; delete from pg_statistic where
starelid='trial.gfo_zaken_kosten'::regclassand staattnum=2; analyze trial.gfo_zaken_kosten;

that's a bit dirty though. You'd need to do:

alter table trial.gfo_zaken_kosten alter column gfo_zaken_id set statistics -1; analyze trial.gfo_zaken_kosten;

if that table was to ever change, else you might get bad plans due to lack of stats.

David

Re: Query with straightforward plan changes and becomes 6520 times slower after VACUUM ANALYZE

From

David Rowley

Date:

19 May 2021, 11:46:53

On Wed, 19 May 2021 at 22:55, Jan Kort <jan.kort@genetics.nl> wrote:
> Ok so the problem is that the MCV list is created, because there are groups of common values, but then this has no
effect,because there are not enough single values to create a histogram? 

The problem is mainly that the join costing code for the merge join
looks at the statistics on the column to get the upper bound of the
merge.  Ideally this value would be 1000000 in my "three" table.
However, the MCV list sees the value of 1 as a most common value due
to it appearing twice. When we consider building a histogram with the
remaining values, we see we don't have enough to build one. We need at
least 2 values.  This means when the merge join costing code tries to
get the upper bound of the merge (ideally 1000000), it only has the
MCV list to go on, which just contains the value 1.  The 1000000 value
is not tracked at all. This makes it appear that the merge join will
start and end with the same value, which is very cheap, and it costs
that path accordingly.

> That is not such a rare case as I would have hoped, for example this would then also cause the same behavior:
>
> values(1,1),(2,1),(3,1000000),(4,2),(5,2),(6,3),(7,3);

Yeah, technically you can add enough duplicate values to fill the MCV
list then just not enough to build the histogram.  However, to
maintain the problem case, the MCVs that you do add will need to be
very early in the range.  Your last MCV here is 3, which would make
the estimated merge range of 1 to 3, which is not as cheap as 1 to 1

> And it seems to do.
>
> Ranges can then be longer too:
>
> values(1,1),(2,1),(4,1),(3,1000000);
>
> for some reason I can't reverse the ranges:
>
> values(1,1000000),(2,1000000),(3,1);
>
> I don’t understand why that doesn't create a bad plan, there seems to be order involved too.

That's because the 1000000 value will make it into the MCV list and
extend the estimated range of the merge join.

> Using text as primary key seems to perform good too, even if it keeps the same "merge" plan:

That will be because of the change in sort order between text and int.
e.g 1000000 sorts before 2, for example.  I don't think you should
work around the problem this way.

> What the high value is and how close it is to the low value seems to matter:

Yeah, that's because merge join can end by reading fewer values from
the "million" table.  When the highest value is 1000000 it needs to
read the entire "million" table.

> values(1,1),(2,1),(3,100000);
> values(1,1),(2,1),(3,10000);
> values(1,1),(2,1),(3,1000);
>
> Gets progressively less bad, again the plan is not changed, but the execution of it becomes more efficient.
>
> It seems like it goes through the records from 1 till say 10000 one by one.

This is how Merge Join works. See
https://en.wikipedia.org/wiki/Sort-merge_join . There's a c# example
there that shows how you have to consume a value from the side with
the lowest value until you find an equal value, where you join, or a
higher value, where you must consume the other side until you find
another equal value.  The join ends when you run out of values on one
side.

David