Thread: IN() statement values order makes 2x performance hit

IN() statement values order makes 2x performance hit

From
Alexey Kupershtokh
Date:
Hello everybody!<br /><br /> I have found a performance issue with 2 equivalent queries stably taking different (~x2)
timeto finish. In just a few words it can be described like this: if you have a lot of values in an IN() statement, you
shouldput most heavy (specifying most number of rows) ids first.<br /> This is mostly just a bug submit, than looking
forhelp.<br /><br /> So this is what I have: <ul><li> RHEL<li> PostgreSQL 8.3.1<li> A table ext_feeder_item with ~4.6M
records.<br/><tt>kia=# \d+ ext_feeder_item;<br /> Table "public.ext_feeder_item"<br /> Column | Type | Modifiers |
Description<br/>
----------+--------------------------+--------------------------------------------------------------+-------------<br/>
id| bigint | not null default nextval('ext_feeder_item_id_seq'::regclass) |<br /> feed_id | bigint | not null |<br />
pub_date| timestamp with time zone | |<br /> Indexes:<br /> "ext_feeder_item_pkey" PRIMARY KEY, btree (id)<br />
"ext_feeder_item_feed_id_pub_date_idx"btree (feed_id, pub_date)<br /> "ext_feeder_item_idx" btree (feed_id)<br />
Triggers:<br/> ....<br /> Has OIDs: no</tt><tt><br /></tt><li>Statistics for the fields feed_id and pub_date are set to
1000;<li>Thetable have just been vacuumed and analyzed.<li>A simple query to the table:<br /><tt> SELECT<br /> id<br />
FROM<br/> ext_feeder_item AS i<br /> WHERE<br /> i.feed_id IN (...)<br /> ORDER BY pub_date DESC, id DESC<br /> LIMIT
11OFFSET 0;<br /><br /></tt>with many (~1200) ids in the IN() statement.<li>The count of rows distribution for these
ids(may be thought of as foreign keys in this table) is the following:<br /> id = 54461: ~180000 - actually the most
heavyid in the whole table.<br /> other ids: a single id at most specifies 2032 rows; 6036 rows total.<li>If I perform
aquery with<br /><tt>IN(54461, ...)</tt><br /> it stably (5 attempts) takes 4.5..4.7 secs. to perform.<br /><tt>QUERY
PLAN<br/> Limit  (cost=1463104.22..1463104.25 rows=11 width=16) (actual time=4585.420..4585.452 rows=11 loops=1)<br />
 ->  Sort  (cost=1463104.22..1464647.29 rows=617228 width=16) (actual time=4585.415..4585.425 rows=11 loops=1)<br />
       Sort Key: pub_date, id<br />         Sort Method:  top-N heapsort  Memory: 17kB<br />         ->  Bitmap Heap
Scanon ext_feeder_item i (cost=13832.40..1449341.79 rows=617228 width=16) (actual time=894.622..4260.441 rows=185625
loops=1)<br/>               Recheck Cond: (feed_id = ANY ('{54461, ...}'::bigint[]))<br />               ->  Bitmap
IndexScan on ext_feeder_item_idx (cost=0.00..13678.10 rows=617228 width=0) (actual time=884.686..884.686 rows=185625
loops=1)<br/>                     Index Cond: (feed_id = ANY ('{54461, ...}'::bigint[]))<br /> Total runtime: 4585.852
ms</tt><tt><br/></tt><li>If I perform a query with<br /><tt>IN(..., 54461)<br /></tt>it stably (5 attempts) takes
9.3..9.5secs. to perform.<br /><tt>QUERY PLAN<br /> Limit  (cost=1463104.22..1463104.25 rows=11 width=16) (actual
time=9330.267..9330.298rows=11 loops=1)<br />   ->  Sort  (cost=1463104.22..1464647.29 rows=617228 width=16) (actual
time=9330.263..9330.273rows=11 loops=1)<br />         Sort Key: pub_date, id<br />         Sort Method:  top-N
heapsort Memory: 17kB<br />         ->  Bitmap Heap Scan on ext_feeder_item i (cost=13832.40..1449341.79 rows=617228
width=16)(actual time=1018.401..8971.029 rows=185625 loops=1)<br />               Recheck Cond: (feed_id = ANY ('{...
,54461}'::bigint[]))<br/>               ->  Bitmap Index Scan on ext_feeder_item_idx (cost=0.00..13678.10
rows=617228width=0) (actual time=1008.791..1008.791 rows=185625 loops=1)<br />                     Index Cond: (feed_id
=ANY ('{... ,54461}'::bigint[]))<br /> Total runtime: 9330.729 ms<br /></tt></ul> I don't know what are the roots of
theproblem, but I think that some symptomatic healing could be applied: the PostgreSQL could sort the IDs due to the
statistics.<br/> So currently I tend to select the IDs from another table ordering them due to their weights: it's easy
forme thanks to denormalization.<br /><br /> Also I would expect from PostgreSQL that it sorted the values to make
indexscan more sequential, but this expectation already conflicts with the bug described above :)<br /> 

Re: IN() statement values order makes 2x performance hit

From
Oleg Bartunov
Date:
You may try contrib/intarray, which we developed specially for
denormalization.

Oleg
On Thu, 29 May 2008, Alexey Kupershtokh wrote:

> Hello everybody!
>
> I have found a performance issue with 2 equivalent queries stably taking
> different (~x2) time to finish. In just a few words it can be described
> like this: if you have a lot of values in an IN() statement, you should
> put most heavy (specifying most number of rows) ids first.
> This is mostly just a bug submit, than looking for help.
>
> So this is what I have:
>  *  RHEL
>  *  PostgreSQL 8.3.1
>  *  A table ext_feeder_item with ~4.6M records.
>     kia=# \d+ ext_feeder_item;
>     Table "public.ext_feeder_item"
>     Column | Type | Modifiers | Description
> ----------+--------------------------+------------------------------------------
>     --------------------+-------------
>     id | bigint | not null default
>     nextval('ext_feeder_item_id_seq'::regclass) |
>     feed_id | bigint | not null |
>     pub_date | timestamp with time zone | |
>     Indexes:
>     "ext_feeder_item_pkey" PRIMARY KEY, btree (id)
>     "ext_feeder_item_feed_id_pub_date_idx" btree (feed_id, pub_date)
>     "ext_feeder_item_idx" btree (feed_id)
>     Triggers:
>     ....
>     Has OIDs: no
>  *  Statistics for the fields feed_id and pub_date are set to 1000;
>  *  The table have just been vacuumed and analyzed.
>  *  A simple query to the table:
>     SELECT
>     id
>     FROM
>     ext_feeder_item AS i
>     WHERE
>     i.feed_id IN (...)
>     ORDER BY pub_date DESC, id DESC
>     LIMIT 11 OFFSET 0;
>
>     with many (~1200) ids in the IN() statement.
>  *  The count of rows distribution for these ids (may be thought of as
>     foreign keys in this table) is the following:
>     id = 54461: ~180000 - actually the most heavy id in the whole table.
>     other ids: a single id at most specifies 2032 rows; 6036 rows total.
>  *  If I perform a query with
>     IN(54461, ...)
>     it stably (5 attempts) takes 4.5..4.7 secs. to perform.
>     QUERY PLAN
>     Limit  (cost=1463104.22..1463104.25 rows=11 width=16) (actual
>     time=4585.420..4585.452 rows=11 loops=1)
>       ->  Sort  (cost=1463104.22..1464647.29 rows=617228 width=16)
>     (actual time=4585.415..4585.425 rows=11 loops=1)
>             Sort Key: pub_date, id
>             Sort Method:  top-N heapsort  Memory: 17kB
>             ->  Bitmap Heap Scan on ext_feeder_item i
>     (cost=13832.40..1449341.79 rows=617228 width=16) (actual
>     time=894.622..4260.441 rows=185625 loops=1)
>                   Recheck Cond: (feed_id = ANY ('{54461,
>     ...}'::bigint[]))
>                   ->  Bitmap Index Scan on ext_feeder_item_idx
>     (cost=0.00..13678.10 rows=617228 width=0) (actual
>     time=884.686..884.686 rows=185625 loops=1)
>                         Index Cond: (feed_id = ANY ('{54461,
>     ...}'::bigint[]))
>     Total runtime: 4585.852 ms
>  *  If I perform a query with
>     IN(..., 54461)
>     it stably (5 attempts) takes 9.3..9.5 secs. to perform.
>     QUERY PLAN
>     Limit  (cost=1463104.22..1463104.25 rows=11 width=16) (actual
>     time=9330.267..9330.298 rows=11 loops=1)
>       ->  Sort  (cost=1463104.22..1464647.29 rows=617228 width=16)
>     (actual time=9330.263..9330.273 rows=11 loops=1)
>             Sort Key: pub_date, id
>             Sort Method:  top-N heapsort  Memory: 17kB
>             ->  Bitmap Heap Scan on ext_feeder_item i
>     (cost=13832.40..1449341.79 rows=617228 width=16) (actual
>     time=1018.401..8971.029 rows=185625 loops=1)
>                   Recheck Cond: (feed_id = ANY ('{...
>     ,54461}'::bigint[]))
>                   ->  Bitmap Index Scan on ext_feeder_item_idx
>     (cost=0.00..13678.10 rows=617228 width=0) (actual
>     time=1008.791..1008.791 rows=185625 loops=1)
>                         Index Cond: (feed_id = ANY ('{...
>     ,54461}'::bigint[]))
>     Total runtime: 9330.729 ms
> I don't know what are the roots of the problem, but I think that some
> symptomatic healing could be applied: the PostgreSQL could sort the IDs
> due to the statistics.
> So currently I tend to select the IDs from another table ordering them
> due to their weights: it's easy for me thanks to denormalization.
>
> Also I would expect from PostgreSQL that it sorted the values to make
> index scan more sequential, but this expectation already conflicts with
> the bug described above :)
>
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: IN() statement values order makes 2x performance hit

From
Alexey Kupershtokh
Date:
Thanks for the response.
I've taken a look at this feature. But it seems unapplicable to my case:
this table is not a many2many relation which seems the most common case
of the intarray usage.
The table just stores an information about items (rss posts): what feeds
(rss) are they from, and their publication date. Id in the table is PK.

Oleg Bartunov wrote:
> You may try contrib/intarray, which we developed specially for
> denormalization.
>
> Oleg
> On Thu, 29 May 2008, Alexey Kupershtokh wrote:
>
>> Hello everybody!
>>
>> I have found a performance issue with 2 equivalent queries stably taking
>> different (~x2) time to finish. In just a few words it can be described
>> like this: if you have a lot of values in an IN() statement, you should
>> put most heavy (specifying most number of rows) ids first.
>> This is mostly just a bug submit, than looking for help.
>>
>> So this is what I have:
>>  *  RHEL
>>  *  PostgreSQL 8.3.1
>>  *  A table ext_feeder_item with ~4.6M records.
>>     kia=# \d+ ext_feeder_item;
>>     Table "public.ext_feeder_item"
>>     Column | Type | Modifiers | Description
>> ----------+--------------------------+------------------------------------------
>>
>>     --------------------+-------------
>>     id | bigint | not null default
>>     nextval('ext_feeder_item_id_seq'::regclass) |
>>     feed_id | bigint | not null |
>>     pub_date | timestamp with time zone | |
>>     Indexes:
>>     "ext_feeder_item_pkey" PRIMARY KEY, btree (id)
>>     "ext_feeder_item_feed_id_pub_date_idx" btree (feed_id, pub_date)
>>     "ext_feeder_item_idx" btree (feed_id)
>>     Triggers:
>>     ....
>>     Has OIDs: no
>>  *  Statistics for the fields feed_id and pub_date are set to 1000;
>>  *  The table have just been vacuumed and analyzed.
>>  *  A simple query to the table:
>>     SELECT
>>     id
>>     FROM
>>     ext_feeder_item AS i
>>     WHERE
>>     i.feed_id IN (...)
>>     ORDER BY pub_date DESC, id DESC
>>     LIMIT 11 OFFSET 0;
>>
>>     with many (~1200) ids in the IN() statement.
>>  *  The count of rows distribution for these ids (may be thought of as
>>     foreign keys in this table) is the following:
>>     id = 54461: ~180000 - actually the most heavy id in the whole table.
>>     other ids: a single id at most specifies 2032 rows; 6036 rows total.
>>  *  If I perform a query with
>>     IN(54461, ...)
>>     it stably (5 attempts) takes 4.5..4.7 secs. to perform.
>>     QUERY PLAN
>>     Limit  (cost=1463104.22..1463104.25 rows=11 width=16) (actual
>>     time=4585.420..4585.452 rows=11 loops=1)
>>       ->  Sort  (cost=1463104.22..1464647.29 rows=617228 width=16)
>>     (actual time=4585.415..4585.425 rows=11 loops=1)
>>             Sort Key: pub_date, id
>>             Sort Method:  top-N heapsort  Memory: 17kB
>>             ->  Bitmap Heap Scan on ext_feeder_item i
>>     (cost=13832.40..1449341.79 rows=617228 width=16) (actual
>>     time=894.622..4260.441 rows=185625 loops=1)
>>                   Recheck Cond: (feed_id = ANY ('{54461,
>>     ...}'::bigint[]))
>>                   ->  Bitmap Index Scan on ext_feeder_item_idx
>>     (cost=0.00..13678.10 rows=617228 width=0) (actual
>>     time=884.686..884.686 rows=185625 loops=1)
>>                         Index Cond: (feed_id = ANY ('{54461,
>>     ...}'::bigint[]))
>>     Total runtime: 4585.852 ms
>>  *  If I perform a query with
>>     IN(..., 54461)
>>     it stably (5 attempts) takes 9.3..9.5 secs. to perform.
>>     QUERY PLAN
>>     Limit  (cost=1463104.22..1463104.25 rows=11 width=16) (actual
>>     time=9330.267..9330.298 rows=11 loops=1)
>>       ->  Sort  (cost=1463104.22..1464647.29 rows=617228 width=16)
>>     (actual time=9330.263..9330.273 rows=11 loops=1)
>>             Sort Key: pub_date, id
>>             Sort Method:  top-N heapsort  Memory: 17kB
>>             ->  Bitmap Heap Scan on ext_feeder_item i
>>     (cost=13832.40..1449341.79 rows=617228 width=16) (actual
>>     time=1018.401..8971.029 rows=185625 loops=1)
>>                   Recheck Cond: (feed_id = ANY ('{...
>>     ,54461}'::bigint[]))
>>                   ->  Bitmap Index Scan on ext_feeder_item_idx
>>     (cost=0.00..13678.10 rows=617228 width=0) (actual
>>     time=1008.791..1008.791 rows=185625 loops=1)
>>                         Index Cond: (feed_id = ANY ('{...
>>     ,54461}'::bigint[]))
>>     Total runtime: 9330.729 ms
>> I don't know what are the roots of the problem, but I think that some
>> symptomatic healing could be applied: the PostgreSQL could sort the IDs
>> due to the statistics.
>> So currently I tend to select the IDs from another table ordering them
>> due to their weights: it's easy for me thanks to denormalization.
>>
>> Also I would expect from PostgreSQL that it sorted the values to make
>> index scan more sequential, but this expectation already conflicts with
>> the bug described above :)
>>
>>
>
>     Regards,
>         Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83