Home > mailing lists

n_distinct off by a factor of 1000 - Mailing list pgsql-general

From	Klaudie Willis
Subject	n_distinct off by a factor of 1000
Date	June 23, 2020 15:42:18
Msg-id	6bPRS-MZZcCSPnyGrlbezd8xcvHM0xPQCjDiFH86jAsT4npY3SjXxsis6sUehdrJfHKeP222YgnrGiNgv6pwNC7wowUs3DgEnl5K4Dr7tW4=@protonmail.com Whole thread Raw
Responses	Re: n_distinct off by a factor of 1000 Re: n_distinct off by a factor of 1000 Re: n_distinct off by a factor of 1000
List	pgsql-general

Tree view

Friends,

I run Postgresql 12.3, on Windows. I have just discovered a pretty significant problem with Postgresql and my data. I have a large table, 500M rows, 50 columns. It is split in 3 partitions by Year. In addition to the primary key, one of the columns is indexed, and I do lookups on this.

Select * from bigtable b where b.instrument_ref in (x,y,z,...)

limit 1000

It responded well with sub-second response, and it uses the index of the column. However, when I changed it to:

Select * from bigtable b where b.instrument_ref in (x,y,z,)

limit 10000 -- (notice 10K now)

The planner decided to do a full table scan on the entire 500M row table! And that did not work very well. First I had no clue as to why it did so, and when I disabled sequential scan the query immediately returned. But I should not have to do so.

I got my first hint of why this problem occurs when I looked at the statistics. For the column in question, "instrument_ref" the statistics claimed it to be:

The default_statistics_target=500, and analyze has been run.

select * from pg_stats where attname like 'instr%_ref'; -- Result: 40.000

select count(distinct instrumentid_ref) from bigtable -- Result: 33 385 922 (!!)

That is an astonishing difference of almost a 1000X.

When the planner only thinks there are 40K different values, then it makes sense to switch to table scan in order to fill the limit=10.000. But it is wrong, very wrong, an the query returns in 100s of seconds instead of a few.

I have tried to increase the statistics target to 5000, and it helps, but it reduces the error to 100X. Still crazy high.

I understand that this is a known problem. I have read previous posts about it, still I have never seen anyone reach such a high difference factor.

I have considered these fixes:

- hardcode the statistics to a particular ratio of the total number of rows

- randomize the rows more, so that it does not suffer from page clustering. However, this has probably other implications

Feel free to comment :)

pgsql-general by date:

From: Tom Lane
Date: 23 June 2020, 04:43:21
Subject: Re: Can the current session be notified and refreshed with a new credentials context?

From: Ron
Date: 23 June 2020, 15:51:23
Subject: Re: n_distinct off by a factor of 1000

n_distinct off by a factor of 1000 - Mailing list pgsql-general

Previous

Next