Re: Reducing planning time of large IN queries on primary key / unique columns - Mailing list pgsql-hackers

From Souvik Bhattacherjee
Subject Re: Reducing planning time of large IN queries on primary key / unique columns
Date
Msg-id CAJCGbZO1FwytN51tdfxJBNyPzuXgCixXio6tyevT09+p=0OjXw@mail.gmail.com
Whole thread Raw
In response to Reducing planning time of large IN queries on primary key / unique columns  (Souvik Bhattacherjee <pgsdbhacker@gmail.com>)
Responses Re: Reducing planning time of large IN queries on primary key / unique columns
List pgsql-hackers
(Re-posting with better formatting)

Hi hackers,

At ServiceNow, we frequently encounter queries with very large IN lists where the number of elements in the IN list range from a 

few hundred to several thousand. For a significant fraction of the queries, the IN clauses are constructed on primary key columns. 

While planning these queries, Postgres query planner loops over every element in the IN clause, computing the selectivity of each 

element and then uses that as an input to compute the total selectivity of the IN clause. For IN clauses on primary key or unique 

columns, it is easy to see that the selectivity of the IN predicate is given by (number of elements in the IN clause / table cardinality)

and is independent of the selectivity of the individual elements. We use this observation to avoid computing the selectivities of the

individual elements. This results in an improvement in the planning time especially when the number of elements in the IN clause 

is relatively large. 

 

The table below demonstrates the improvement in planning time (averaged over 3 runs) for IN queries of the form 

SELECT COUNT(*) FROM table_a WHERE sys_id IN ('000356e61b568510eabcca2b234bcb08', '00035846db2f24101ad7f256b9961925', ...). 

Here sys_id is the primary key column of type VARCHAR(32) and the table cardinality of table_a is around 10M.


Number of IN list elements | Planning time w/o optimization (in ms) | Planning time w/ optimization (in ms) | Speedup

------------------------------------|---------------------------------------------------  | --------------------------------------------------|--------------

500                                     | 0.371                                                     | 0.236                                                   | 1.57

5000                                   | 2.019                                                     | 0.874                                                   | 2.31

50000                                 | 19.886                                                   | 8.273                                                   | 2.40

 

Similar to IN clauses, the selectivity of NOT IN clauses on a primary key or unique column can be computed by not computing the 

selectivities of individual elements. The query used is of the form SELECT COUNT(*) FROM table_a WHERE sys_id NOT IN 

('000356e61b568510eabcca2b234bcb08', '00035846db2f24101ad7f256b9961925', ...).


Number of NOT IN list elements | Planning time w/o optimization (in ms) | Planning time w/ optimization (in ms) | Speedup

-------------------------------------------|---------------------------------------------------  | --------------------------------------------------|--------------

500                                              | 0.380                                                     | 0.248                                                   | 1.53

5000                                            | 2.534                                                     | 0.854                                                   | 2.97

50000                                          | 21.316                                                   | 9.009                                                   | 2.36

 

We also obtain planning time of queries on a primary key column of type INTEGER with 10M elements for both IN and NOT in queries.


Number of IN list elements | Planning time w/o optimization (in ms) | Planning time w/ optimization (in ms) | Speedup

------------------------------------|---------------------------------------------------  | --------------------------------------------------|--------------

500                                     | 0.370                                                     | 0.208                                                   | 1.78

5000                                   | 1.998                                                     | 0.816                                                   | 2.45

50000                                 | 18.073                                                   | 6.750                                                   | 2.67


Number of NOT IN list elements | Planning time w/o optimization (in ms) | Planning time w/ optimization (in ms) | Speedup

-------------------------------------------|---------------------------------------------------  | --------------------------------------------------|--------------

500                                              | 0.342                                                     | 0.203                                                   | 1.68

5000                                            | 2.073                                                     | 0.822                                                   | 3.29

50000                                          |19.551                                                   | 6.738                                                   | 2.90

 

We see that the planning time of queries on unique columns are identical to that we observed for primary key columns. 

The resulting patch file for the changes above is small and we are happy to polish it up and share.


Best,

Souvik Bhattacherjee

(ServiceNow)


On Thu, Aug 11, 2022 at 2:42 PM Souvik Bhattacherjee <pgsdbhacker@gmail.com> wrote:
Hi hackers,

At ServiceNow, we frequently encounter queries with very large IN lists where the number of elements in the IN list range from a few hundred to several thousand. For a significant fraction of the queries, the IN clauses are constructed on primary key columns. While planning these queries, Postgres query planner loops over every element in the IN clause, computing the selectivity of each element and then uses that as an input to compute the total selectivity of the IN clause. For IN clauses on primary key or unique columns, it is easy to see that the selectivity of the IN predicate is given by (number of elements in the IN clause / table cardinality) and is independent of the selectivity of the individual elements. We use this observation to avoid computing the selectivities of the individual elements. This results in an improvement in the planning time especially when the number of elements in the IN clause is relatively large. 

 

The table below demonstrates the improvement in planning time (averaged over 3 runs) for IN queries of the form SELECT COUNT(*) FROM table_a WHERE sys_id IN ('000356e61b568510eabcca2b234bcb08', '00035846db2f24101ad7f256b9961925', ...). Here sys_id is the primary key column of type VARCHAR(32) and the table cardinality of table_a is around 10M.

 

Number of IN list elements

Planning time w/o optimization (in ms)

Planning time w/ optimization (in ms)

Speedup

500

0.371

0.236

1.57

5000

2.019

0.874

2.31

50000

19.886

8.273

2.40

 

Similar to IN clauses, the selectivity of NOT IN clauses on a primary key or unique column can be computed by not computing the selectivities of individual elements. The query used is of the form SELECT COUNT(*) FROM table_a WHERE sys_id NOT IN ('000356e61b568510eabcca2b234bcb08', '00035846db2f24101ad7f256b9961925', ...).

 

Number of NOT IN list elements

Planning time w/o optimization (in ms)

Planning time w/ optimization (in ms)

Speedup

500

0.380

0.248

1.53

5000

2.534

0.854

2.97

50000

21.316

9.009

2.36

 

We also obtain planning time of queries on a primary key column of type INTEGER with 10M elements for both IN and NOT in queries.


Number of IN list elements

Planning time w/o optimization (in ms)

Planning time w/ optimization (in ms)

Speedup

500

0.370

0.208

1.78

5000

1.998

0.816

2.45

50000

18.073

6.750

2.67


 

Number of NOT IN list elements

Planning time w/o optimization (in ms)

Planning time w/ optimization (in ms)

Speedup

500

0.342

0.203

1.68

5000

2.073

0.822

3.29

50000

19.551

6.738

2.90

 

We see that the planning time of queries on unique columns are identical to that we observed for primary key columns. The resulting patch file for the changes above is small and we are happy to polish it up and share.


Best,

Souvik Bhattacherjee

(ServiceNow)

pgsql-hackers by date:

Previous
From: Souvik Bhattacherjee
Date:
Subject: Reducing planning time of large IN queries on primary key / unique columns
Next
From: Jacob Champion
Date:
Subject: Re: [PATCH] Expose port->authn_id to extensions and triggers