Home > mailing lists

Re: Reducing planning time of large IN queries on primary key / unique columns - Mailing list pgsql-hackers

From	Souvik Bhattacherjee
Subject	Re: Reducing planning time of large IN queries on primary key / unique columns
Date	August 12, 2022 01:04:15
Msg-id	CAJCGbZO1FwytN51tdfxJBNyPzuXgCixXio6tyevT09+p=0OjXw@mail.gmail.com Whole thread Raw
In response to	Reducing planning time of large IN queries on primary key / unique columns (Souvik Bhattacherjee <pgsdbhacker@gmail.com>)
Responses	Re: Reducing planning time of large IN queries on primary key / unique columns
List	pgsql-hackers

Tree view

(Re-posting with better formatting)

Hi hackers,

At ServiceNow, we frequently encounter queries with very large IN lists where the number of elements in the IN list range from a

few hundred to several thousand. For a significant fraction of the queries, the IN clauses are constructed on primary key columns.

While planning these queries, Postgres query planner loops over every element in the IN clause, computing the selectivity of each

element and then uses that as an input to compute the total selectivity of the IN clause. For IN clauses on primary key or unique

columns, it is easy to see that the selectivity of the IN predicate is given by (number of elements in the IN clause / table cardinality)

and is independent of the selectivity of the individual elements. We use this observation to avoid computing the selectivities of the

individual elements. This results in an improvement in the planning time especially when the number of elements in the IN clause

is relatively large.

The table below demonstrates the improvement in planning time (averaged over 3 runs) for IN queries of the form

SELECT COUNT(*) FROM table_a WHERE sys_id IN ('000356e61b568510eabcca2b234bcb08', '00035846db2f24101ad7f256b9961925', ...).

Here sys_id is the primary key column of type VARCHAR(32) and the table cardinality of table_a is around 10M.

Number of IN list elements | Planning time w/o optimization (in ms) | Planning time w/ optimization (in ms) | Speedup

------------------------------------|--------------------------------------------------- | --------------------------------------------------|--------------

500 | 0.371 | 0.236 | 1.57

5000 | 2.019 | 0.874 | 2.31

50000 | 19.886 | 8.273 | 2.40

Similar to IN clauses, the selectivity of NOT IN clauses on a primary key or unique column can be computed by not computing the

selectivities of individual elements. The query used is of the form SELECT COUNT(*) FROM table_a WHERE sys_id NOT IN

('000356e61b568510eabcca2b234bcb08', '00035846db2f24101ad7f256b9961925', ...).

Number of NOT IN list elements | Planning time w/o optimization (in ms) | Planning time w/ optimization (in ms) | Speedup

-------------------------------------------|--------------------------------------------------- | --------------------------------------------------|--------------

500 | 0.380 | 0.248 | 1.53

5000 | 2.534 | 0.854 | 2.97

50000 | 21.316 | 9.009 | 2.36

We also obtain planning time of queries on a primary key column of type INTEGER with 10M elements for both IN and NOT in queries.

Number of IN list elements | Planning time w/o optimization (in ms) | Planning time w/ optimization (in ms) | Speedup

------------------------------------|--------------------------------------------------- | --------------------------------------------------|--------------

500 | 0.370 | 0.208 | 1.78

5000 | 1.998 | 0.816 | 2.45

50000 | 18.073 | 6.750 | 2.67

Number of NOT IN list elements | Planning time w/o optimization (in ms) | Planning time w/ optimization (in ms) | Speedup

-------------------------------------------|--------------------------------------------------- | --------------------------------------------------|--------------

500 | 0.342 | 0.203 | 1.68

5000 | 2.073 | 0.822 | 3.29

50000 |19.551 | 6.738 | 2.90

We see that the planning time of queries on unique columns are identical to that we observed for primary key columns.

The resulting patch file for the changes above is small and we are happy to polish it up and share.

Best,

Souvik Bhattacherjee

(ServiceNow)

On Thu, Aug 11, 2022 at 2:42 PM Souvik Bhattacherjee <pgsdbhacker@gmail.com> wrote:

Hi hackers,

At ServiceNow, we frequently encounter queries with very large IN lists where the number of elements in the IN list range from a few hundred to several thousand. For a significant fraction of the queries, the IN clauses are constructed on primary key columns. While planning these queries, Postgres query planner loops over every element in the IN clause, computing the selectivity of each element and then uses that as an input to compute the total selectivity of the IN clause. For IN clauses on primary key or unique columns, it is easy to see that the selectivity of the IN predicate is given by (number of elements in the IN clause / table cardinality) and is independent of the selectivity of the individual elements. We use this observation to avoid computing the selectivities of the individual elements. This results in an improvement in the planning time especially when the number of elements in the IN clause is relatively large.

The table below demonstrates the improvement in planning time (averaged over 3 runs) for IN queries of the form SELECT COUNT(*) FROM table_a WHERE sys_id IN ('000356e61b568510eabcca2b234bcb08', '00035846db2f24101ad7f256b9961925', ...). Here sys_id is the primary key column of type VARCHAR(32) and the table cardinality of table_a is around 10M.

Number of IN list elements
Planning time w/o optimization (in ms)
Planning time w/ optimization (in ms)
Speedup
500
0.371
0.236
1.57
5000
2.019
0.874
2.31
50000
19.886
8.273
2.40

Similar to IN clauses, the selectivity of NOT IN clauses on a primary key or unique column can be computed by not computing the selectivities of individual elements. The query used is of the form SELECT COUNT(*) FROM table_a WHERE sys_id NOT IN ('000356e61b568510eabcca2b234bcb08', '00035846db2f24101ad7f256b9961925', ...).

Number of NOT IN list elements
Planning time w/o optimization (in ms)
Planning time w/ optimization (in ms)
Speedup
500
0.380
0.248
1.53
5000
2.534
0.854
2.97
50000
21.316
9.009
2.36

We also obtain planning time of queries on a primary key column of type INTEGER with 10M elements for both IN and NOT in queries.

Number of IN list elements
Planning time w/o optimization (in ms)
Planning time w/ optimization (in ms)
Speedup
500
0.370
0.208
1.78
5000
1.998
0.816
2.45
50000
18.073
6.750
2.67

Number of NOT IN list elements
Planning time w/o optimization (in ms)
Planning time w/ optimization (in ms)
Speedup
500
0.342
0.203
1.68
5000
2.073
0.822
3.29
50000
19.551
6.738
2.90

We see that the planning time of queries on unique columns are identical to that we observed for primary key columns. The resulting patch file for the changes above is small and we are happy to polish it up and share.

Best,
Souvik Bhattacherjee
(ServiceNow)

pgsql-hackers by date:

From: Souvik Bhattacherjee
Date: 12 August 2022, 00:42:15
Subject: Reducing planning time of large IN queries on primary key / unique columns

From: Jacob Champion
Date: 12 August 2022, 01:28:31
Subject: Re: [PATCH] Expose port->authn_id to extensions and triggers

Re: Reducing planning time of large IN queries on primary key / unique columns - Mailing list pgsql-hackers

Previous

Next

Number of IN list elements	Planning time w/o optimization (in ms)	Planning time w/ optimization (in ms)	Speedup
500	0.371	0.236	1.57
5000	2.019	0.874	2.31
50000	19.886	8.273	2.40

Number of NOT IN list elements	Planning time w/o optimization (in ms)	Planning time w/ optimization (in ms)	Speedup
500	0.380	0.248	1.53
5000	2.534	0.854	2.97
50000	21.316	9.009	2.36