Thread: Performance and IN clauses

Performance and IN clauses

From
"Kynn Jones"
Date:
Hi.  I have a Perl script whose main loop generates thousands of SQL updates of the form

UPDATE edge SET keep = true WHERE node1 IN ( $node_list ) AND node2 = $node_id;

...where here $node_list stands for a comma-separated list of integers, and $node_id stands for some integer.

The list represented by $node_list can be fairly long (on average it has around 900 entries, and can be as long as 30K entries), and I'm concerned about the performance cost of testing for inclusion in such a long list.  Is this done by a sequential search?  If so, is there a better way to write this query?  (FWIW, I have two indexes on the edge table using btree( node1 ) and btree( node2 ), respectively.)

Also, assuming that the optimal way to write the query depends on the length of $node_list, how can I estimate the "critical length" at which I should switch from one form of the query to the other?

TIA!

Kynn

Re: Performance and IN clauses

From
Matthew Wakeling
Date:
On Tue, 18 Nov 2008, Kynn Jones wrote:
> Also, assuming that the optimal way to write the query depends on the length of $node_list, how can I estimate the
> "critical length" at which I should switch from one form of the query to the other?

In the past, I have found the fastest way to do this was to operate on
groups of a bit less than a thousand values, and issue one query per
group. Of course, Postgres may have improved since then, so I'll let more
knowledgable people cover that for me.

Matthew

--
 Heat is work, and work's a curse. All the heat in the universe, it's
 going to cool down, because it can't increase, then there'll be no
 more work, and there'll be perfect peace.      -- Michael Flanders

Re: Performance and IN clauses

From
tv@fuzzy.cz
Date:
I bet there is no 'critical' length - this is just another case of index
scan vs. seqscan. The efficiency depends on the size of the table / row,
amount of data in the table, variability of the column used in the IN
clause, etc.

Splitting the query with 1000 items into 10 separate queries, the smaller
queries may be faster but the total time consumed may be actually higher.
Something like

10 * (time of small query) + (time to combine them) > (time of large query)

If the performance of the 'split' solution is actually better than the
original query, it just means that the planner does not use index scan
when it actually should. That means that either

(a) the planner is not smart enough
(b) it has not current statistics of the table (run ANALYZE on the table)
(c) the statistics are not detailed enough (ALTER TABLE ... SET STATICTICS)
(d) the cost variables are not set properly (do not match the hardware -
decreate index scan cost / increase seq scan cost)

regards
Tomas

> On Tue, 18 Nov 2008, Kynn Jones wrote:
>> Also, assuming that the optimal way to write the query depends on the
>> length of $node_list, how can I estimate the
>> "critical length" at which I should switch from one form of the query to
>> the other?
>
> In the past, I have found the fastest way to do this was to operate on
> groups of a bit less than a thousand values, and issue one query per
> group. Of course, Postgres may have improved since then, so I'll let more
> knowledgable people cover that for me.
>
> Matthew
>
> --
>  Heat is work, and work's a curse. All the heat in the universe, it's
>  going to cool down, because it can't increase, then there'll be no
>  more work, and there'll be perfect peace.      -- Michael Flanders
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>



Re: Performance and IN clauses

From
Mark Roberts
Date:
On Tue, 2008-11-18 at 17:38 +0100, tv@fuzzy.cz wrote:
> I bet there is no 'critical' length - this is just another case of
> index
> scan vs. seqscan. The efficiency depends on the size of the table /
> row,
> amount of data in the table, variability of the column used in the IN
> clause, etc.
>
> Splitting the query with 1000 items into 10 separate queries, the
> smaller
> queries may be faster but the total time consumed may be actually
> higher.
> Something like
>
> 10 * (time of small query) + (time to combine them) > (time of large
> query)
>
> If the performance of the 'split' solution is actually better than the
> original query, it just means that the planner does not use index scan
> when it actually should. That means that either
>
> (a) the planner is not smart enough
> (b) it has not current statistics of the table (run ANALYZE on the
> table)
> (c) the statistics are not detailed enough (ALTER TABLE ... SET
> STATICTICS)
> (d) the cost variables are not set properly (do not match the hardware
> -
> decreate index scan cost / increase seq scan cost)
>
> regards
> Tomas

I know that it's much faster (for us) to run many smaller queries than
one large query, and I think that it's primarily because of your reason
a.  Most of our problems come from Pg misunderstanding the results of a
join and making a bad plan decision.  Batching dramatically reduces the
liklihood of this.

-Mark


Re: Performance and IN clauses

From
Tomas Vondra
Date:
> I know that it's much faster (for us) to run many smaller queries than
> one large query, and I think that it's primarily because of your reason
> a.  Most of our problems come from Pg misunderstanding the results of a
> join and making a bad plan decision.  Batching dramatically reduces the
> liklihood of this.
>
> -Mark

Show us the plan (output of EXPLAIN ANALYZE), along with detailed info
about the table (number of rows / pages) and environment (amount of RAM,
etc.). Without these information it's impossible to tell which of the
choices is right.

In my experience the planner is a very good piece of software, but if
you feed him with bad information about the environment / data in the
beginning, you can hardly expect good plans.

Occasionally I provide consultancy to developers having problems with
PostgreSQL, and while most of the time (say 70% of the time) the
problems are caused by mistakes in SQL or incorrect design of the
system, problems with proper settings of the PostgreSQL are quite often.
I really don't know if you use the default values or if you have tuned
the settings (and how), but it's hard to tell from your original post.

For example if you don't set the work_mem according to your settings,
this may result in on-disk sorting even if you have plenty of free
memory. Or if you have fast drives, the default cost settings may
produce bad plans (index scan may seem too expensive), etc. And of
course you may have data with complicated statistical properties, and
the default level of details may not be sufficient (try increase it with
SET STATISTICS for the column).

Anyway - there may be glitch / corner case in the planner of course, but
   it's hard to tell without the EXPLAIN ANALYZE output.

regards
Tomas

Re: Performance and IN clauses

From
Tomas Vondra
Date:
Mark Roberts napsal(a):
> On Tue, 2008-11-18 at 17:38 +0100, tv@fuzzy.cz wrote:
>> I bet there is no 'critical' length - this is just another case of
>> index
>> scan vs. seqscan. The efficiency depends on the size of the table /
>> row,
>> amount of data in the table, variability of the column used in the IN
>> clause, etc.
>>
>> Splitting the query with 1000 items into 10 separate queries, the
>> smaller
>> queries may be faster but the total time consumed may be actually
>> higher.
>> Something like
>>
>> 10 * (time of small query) + (time to combine them) > (time of large
>> query)
>>
>> If the performance of the 'split' solution is actually better than the
>> original query, it just means that the planner does not use index scan
>> when it actually should. That means that either
>>
>> (a) the planner is not smart enough
>> (b) it has not current statistics of the table (run ANALYZE on the
>> table)
>> (c) the statistics are not detailed enough (ALTER TABLE ... SET
>> STATICTICS)
>> (d) the cost variables are not set properly (do not match the hardware
>> -
>> decreate index scan cost / increase seq scan cost)
>>
>> regards
>> Tomas
>
> I know that it's much faster (for us) to run many smaller queries than
> one large query, and I think that it's primarily because of your reason
> a.  Most of our problems come from Pg misunderstanding the results of a
> join and making a bad plan decision.  Batching dramatically reduces the
> liklihood of this.

As I already said - even the smartest planner won't work without correct
input data. Have you tried fixing the points (b), (c) and (d)?

Fixing them might improve the planner performance so that you don't need
the batchning at all.

regards
Tomas