Thread: Duplicate WHERE condition changes performance and plan

Duplicate WHERE condition changes performance and plan

From

"singh400@gmail.com"

Date:

15 April 2020, 19:55:53

Hi,

We have an odd issue where specifying the same where clause twice causes PG to pick a much more efficent plan. We would like to know why.

Query A (this is the 'slow' query):
UPDATE problem_instance SET processed = false
FROM problem
WHERE problem.id = problem_instance.problem_id
AND problem.status != 2
AND processed = true;

Query B (this is the 'fast' query):
UPDATE problem_instance SET processed = false
FROM problem
WHERE problem.id = problem_instance.problem_id
AND problem.status != 2
AND problem.status != 2
AND processed = true;

The EXPLAIN ANALYZE for both queries can be found here:-
Query A: https://explain.depesz.com/s/lFuy
Query B: https://explain.depesz.com/s/Jqmv

The table definitions (including the indexes) can be found here:-
public.problem: https://gist.github.com/indy-singh/e90ee6d23d053d32c2564501720353df
public.problem_instance: https://gist.github.com/indy-singh/3c77096b91c89428752cf314d8e20286

Data stats:-
public.problem has around 10,000 rows and once the condition status != 2 is applied there are around 800 rows left.
public.problem_instance has around 592,000 rows and once the condition processed = true is applied there are around 370,000 rows left.

PG version:
PostgreSQL 9.5.19 on x86_64-pc-linux-gnu (Debian 9.5.19-1.pgdg90+1), compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit

-- SELECT relname, relpages, reltuples, relallvisible, relkind, relnatts, relhassubclass, reloptions, pg_table_size(oid) FROM pg_class WHERE relname='TABLE_NAME'
Table metadata:-
public.problem: https://gist.github.com/indy-singh/ff34a3b6e45432ea4be2bf0b5038e0be
public.problem_instance: https://gist.github.com/indy-singh/a09fe66c8a8840b7661ce9726ebcab71

Last Vacuum:-
public.problem: 2020-04-14 23:11:47.51056+01
public.problem_instance: 2020-04-14 20:11:04.187138+01

Last Analyze:
public.problem: 2020-04-14 23:11:47.592878+01
public.problem_instance: 2020-04-14 20:11:04.508432+01

Server Configuration: https://gist.github.com/indy-singh/8386d59206af042d365e5cd49fbae68f

I tried my best getting all the information up front, please let me know if I missed anything.

Thanks,

Indy

Re: Duplicate WHERE condition changes performance and plan

From

David Rowley

Date:

15 April 2020, 23:57:49

On Thu, 16 Apr 2020 at 07:56, singh400@gmail.com <singh400@gmail.com> wrote:
> We have an odd issue where specifying the same where clause twice causes PG to pick a much more efficent plan. We
wouldlike to know why.

> The EXPLAIN ANALYZE for both queries can be found here:-
> Query A: https://explain.depesz.com/s/lFuy
> Query B: https://explain.depesz.com/s/Jqmv

This is basically down to just a poor join selectivity estimation.
The selectivity estimation on the duplicate not equal clause is not
removed by the planner and the selectivity of that is taking into
account twice which reduces the selectivity of the table named
"problem". With that selectivity taken into account, the query planner
thinks a nested loop will be a more optimal plan, to which it seems to
be.

Join selectivity estimations can use the most common values lists as
you may see if you look at the pg_stats view for the tables and
columns involved in the join condition.  Perhaps ID columns are not
good candidates to get an MCV list in the stats.  In that case, the
ndistinct estimate will be used.  If there's no MCV list in the stats
then check ndistinct is reasonably accurate. If there is an MCV list,
then you can make that bigger by increasing the statistics targets on
the join columns and running ANALYZE. Note: Planning can become slower
when you increase the statistics targets.

Starting with PostgreSQL 9.6, foreign keys are also used to help with
join selectivity estimations. I see you have a suitable foreign key
from the schema you posted.  You might want to add that to the list of
reasons to upgrade.

David

Re: Duplicate WHERE condition changes performance and plan

From

"singh400@gmail.com"

Date:

20 April 2020, 00:50:17

> Starting with PostgreSQL 9.6, foreign keys are also used to help with
> join selectivity estimations. I see you have a suitable foreign key
> from the schema you posted.  You might want to add that to the list of
> reasons to upgrade.

Apologies for the delay in response. I've had "PostgreSQL 9.6.3,
compiled by Visual C++ build 1800, 64-bit" setup at home for a while
and after importing the data across I'm still seeing the same
behaviour.

Even after upgrading my local install of PG to "PostgreSQL 12.2,
compiled by Visual C++ build 1914, 64-bit" and I'm still seeing the
same behaviour.

Plans for PG12:-
Query A: https://explain.depesz.com/s/zrVD
Query B: https://explain.depesz.com/s/ZLWe

The settings for my home setup are left at default, nothing special.

Indy

Re: Duplicate WHERE condition changes performance and plan

From

Justin Pryzby

Date:

20 April 2020, 02:33:57

On Wed, Apr 15, 2020 at 08:55:53PM +0100, singh400@gmail.com wrote:
> We have an odd issue where specifying the same where clause twice causes PG
> to pick a much more efficent plan. We would like to know why.

> Query B (this is the 'fast' query):
> UPDATE problem_instance SET processed = false
> FROM problem
> WHERE problem.id = problem_instance.problem_id
> AND problem.status != 2
> AND problem.status != 2
> AND processed = true;

When you specify redundant condition, it results in an underestimate, as
expected:

Index Scan using problem_id_idx1 on public.problem (cost=0.28..624.68 ROWS=73 width=14) (actual time=0.011..0.714
ROWS=841loops=1)

    Filter: ((problem.status <> 2) AND (problem.status <> 2))

In this case, doing an index scans on problem_instance is apparently faster
than an seq scan.

I think the duplicate condition is fooling the planner, and by chance it's
giving a better plan.  That might indicate that your settings aren't ideal.
Maybe random_page_cost should be lower, which would encourage index scans.  If
you're using SSD storage, or if the DB is small compared with shared_buffers or
RAM, then random_page_cost should be closer to seq_page_cost.

How large are the indexes? problem_id_idx1 ?

On Mon, Apr 20, 2020 at 01:50:17AM +0100, singh400@gmail.com wrote:
> Even after upgrading my local install of PG to "PostgreSQL 12.2,
> compiled by Visual C++ build 1914, 64-bit" and I'm still seeing the
> same behaviour.

> Server Configuration:
> https://gist.github.com/indy-singh/8386d59206af042d365e5cd49fbae68f
> shared_buffers     2GB     configuration file
> effective_cache_size     6GB     configuration file

Note, until v10, the documentation said this:

https://www.postgresql.org/docs/9.6/runtime-config-resource.html
|Also, on Windows, large values for shared_buffers aren't as effective. You may
|find better results keeping the setting relatively low and using the operating
|system cache more instead. The useful range for shared_buffers on Windows
|systems is generally from 64MB to 512MB.

It would be interesting to know

-- 
Justin

Re: Duplicate WHERE condition changes performance and plan

From

"singh400@gmail.com"

Date:

24 April 2020, 16:33:25

> If you're using SSD storage, or if the DB is small compared with shared_buffers or RAM, then random_page_cost should
becloser to seq_page_cost.
 

I don't *think* we are using SSDs  but I'll need to confirm that though.

> How large are the indexes? problem_id_idx1 ?

Using the query from here:
https://wiki.postgresql.org/wiki/Index_Maintenance#Index_size.2Fusage_statistics
Output here: https://gist.github.com/indy-singh/e33eabe5cc937043c93b42a8783b3bfb

I've setup a repo here where it is possible to reproduce the weird
behaviour I'm getting:-

https://github.com/indy-singh/postgres-duplicate-where-conditon

That contains the data (amended to remove any private information) as
well as the statements need to recreate tables, indices, and
constraints,

I think after some trial and error this is something to do with the
size of the table and statistics. I've been trying to put together a
Short, Self Contained, Correct example (http://sscce.org/) and the
problem only appears when fill problem_instance.message with junk, but
I have to do it in two steps as outlined in the README in repo.

Indy

Re: Duplicate WHERE condition changes performance and plan

From

"singh400@gmail.com"

Date:

03 May 2020, 11:39:56

> I don't *think* we are using SSDs  but I'll need to confirm that though.

Confirmed we are not using SSDs but '10K RPM SAS in RAID-10.'

I've also been hunt for other queries that show this behaviour too,
and I've found one. The PG settings/versions will be different in this
example due to the earlier example being for our internal CI/CD tool
which is hosted a on local instance of PG. This example is directly
from our production servers.

Query C (slow):-
SELECT COUNT(1)
FROM proposal.proposal
INNER JOIN proposal.note ON proposal.note.proposal_reference =
proposal.proposal.reference
WHERE 1 = 1
AND proposal.proposal.system_id = 11
AND proposal.proposal.legacy_organisation_id IN (6, 7, 11, 16, 18, 44,
200, 218, 233, 237, 259, 47)
AND proposal.proposal.has_been_anonymised = false
AND proposal.note.legacy_read_by IS NULL
AND proposal.note.type_id IN (1, 4, 9)
AND proposal.note.entry_time > '2020-04-01'
AND proposal.note.entry_time < '2020-05-01';

Query D (fast):-
SELECT COUNT(1)
FROM proposal.proposal
INNER JOIN proposal.note ON proposal.note.proposal_reference =
proposal.proposal.reference
WHERE 1 = 1
AND proposal.proposal.system_id = 11
AND proposal.proposal.legacy_organisation_id IN (6, 7, 11, 16, 18, 44,
200, 218, 233, 237, 259, 47)
AND proposal.proposal.has_been_anonymised = false
AND proposal.proposal.has_been_anonymised = false
AND proposal.note.legacy_read_by IS NULL
AND proposal.note.type_id IN (1, 4, 9)
AND proposal.note.entry_time > '2020-04-01'
AND proposal.note.entry_time < '2020-05-01';

The EXPLAIN ANALYZE for both queries can be found here:-
Query C: https://explain.depesz.com/s/5Mbu
Query D: https://explain.depesz.com/s/jVnH

The table definitions (including the indexes) can be found here:-
proposal.proposal:
https://gist.github.com/indy-singh/6ccd86ff859e7cdad2ec1bf73a61445c
proposal.note: https://gist.github.com/indy-singh/6c1f85ad15cb92e138447a91d8cf3ecb

Data stats:-
proposal.proposal has 10,324,779 rows and once the table specific
conditions are applied there are 39,223 rows left.
proposal.note has 28,97,698 rows and once the table specific
conditions are applied there are 54,359 rows left.

PG version:
PostgreSQL 9.5.17 on x86_64-pc-linux-gnu (Debian 9.5.17-1.pgdg90+1),
compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit

-- SELECT relname, relpages, reltuples, relallvisible, relkind,
relnatts, relhassubclass, reloptions, pg_table_size(oid) FROM pg_class
WHERE relname='TABLE_NAME'
Table metadata:-
proposal.proposal:
https://gist.github.com/indy-singh/24e7ec8f3d4e2c3ac73f724cea52f9de
proposal.note: https://gist.github.com/indy-singh/104d6ec7ef8179461eb4f91c121615e0

Index Stats:-
proposal.proposal:
https://gist.github.com/indy-singh/1d41d15addb543bcdafc8641b9d7f036
proposal.note: https://gist.github.com/indy-singh/7a698dec98dd8ef2808345d1802e6b6a

Last Vacuum:-
proposal.proposal: Never
proposal.note: 2020-04-17 15:10:57.256013+01

Last Analyze:
proposal.proposal: Never
proposal.note: 2020-04-07 11:48:49.689622+01

Server Configuration:
https://gist.github.com/indy-singh/b19134873f266ee6ce2b9815504d130c

Indy

Re: Duplicate WHERE condition changes performance and plan

From

Michael Lewis

Date:

04 May 2020, 17:11:41

Why not vacuum analyze both tables to ensure stats are up to date?

Have you customized default_statistics_target from 100? It may be that 250 would give you a more complete sample of the table without increasing the size of the stats tables too much such that planning time increases hugely.

Do you know if any of these columns are correlated? Custom stats with CREATE STATISTICS may help the planner make better decisions if so.

I usually hesitate to put any boolean field in an index. Do you need the proposal.has_been_anonymised false values only, if so you could add that to a WHERE condition on the index instead of including it as the leading column.