Thread: IN() statement values order makes 2x performance hit
Hello everybody!<br /><br /> I have found a performance issue with 2 equivalent queries stably taking different (~x2) timeto finish. In just a few words it can be described like this: if you have a lot of values in an IN() statement, you shouldput most heavy (specifying most number of rows) ids first.<br /> This is mostly just a bug submit, than looking forhelp.<br /><br /> So this is what I have: <ul><li> RHEL<li> PostgreSQL 8.3.1<li> A table ext_feeder_item with ~4.6M records.<br/><tt>kia=# \d+ ext_feeder_item;<br /> Table "public.ext_feeder_item"<br /> Column | Type | Modifiers | Description<br/> ----------+--------------------------+--------------------------------------------------------------+-------------<br/> id| bigint | not null default nextval('ext_feeder_item_id_seq'::regclass) |<br /> feed_id | bigint | not null |<br /> pub_date| timestamp with time zone | |<br /> Indexes:<br /> "ext_feeder_item_pkey" PRIMARY KEY, btree (id)<br /> "ext_feeder_item_feed_id_pub_date_idx"btree (feed_id, pub_date)<br /> "ext_feeder_item_idx" btree (feed_id)<br /> Triggers:<br/> ....<br /> Has OIDs: no</tt><tt><br /></tt><li>Statistics for the fields feed_id and pub_date are set to 1000;<li>Thetable have just been vacuumed and analyzed.<li>A simple query to the table:<br /><tt> SELECT<br /> id<br /> FROM<br/> ext_feeder_item AS i<br /> WHERE<br /> i.feed_id IN (...)<br /> ORDER BY pub_date DESC, id DESC<br /> LIMIT 11OFFSET 0;<br /><br /></tt>with many (~1200) ids in the IN() statement.<li>The count of rows distribution for these ids(may be thought of as foreign keys in this table) is the following:<br /> id = 54461: ~180000 - actually the most heavyid in the whole table.<br /> other ids: a single id at most specifies 2032 rows; 6036 rows total.<li>If I perform aquery with<br /><tt>IN(54461, ...)</tt><br /> it stably (5 attempts) takes 4.5..4.7 secs. to perform.<br /><tt>QUERY PLAN<br/> Limit (cost=1463104.22..1463104.25 rows=11 width=16) (actual time=4585.420..4585.452 rows=11 loops=1)<br /> -> Sort (cost=1463104.22..1464647.29 rows=617228 width=16) (actual time=4585.415..4585.425 rows=11 loops=1)<br /> Sort Key: pub_date, id<br /> Sort Method: top-N heapsort Memory: 17kB<br /> -> Bitmap Heap Scanon ext_feeder_item i (cost=13832.40..1449341.79 rows=617228 width=16) (actual time=894.622..4260.441 rows=185625 loops=1)<br/> Recheck Cond: (feed_id = ANY ('{54461, ...}'::bigint[]))<br /> -> Bitmap IndexScan on ext_feeder_item_idx (cost=0.00..13678.10 rows=617228 width=0) (actual time=884.686..884.686 rows=185625 loops=1)<br/> Index Cond: (feed_id = ANY ('{54461, ...}'::bigint[]))<br /> Total runtime: 4585.852 ms</tt><tt><br/></tt><li>If I perform a query with<br /><tt>IN(..., 54461)<br /></tt>it stably (5 attempts) takes 9.3..9.5secs. to perform.<br /><tt>QUERY PLAN<br /> Limit (cost=1463104.22..1463104.25 rows=11 width=16) (actual time=9330.267..9330.298rows=11 loops=1)<br /> -> Sort (cost=1463104.22..1464647.29 rows=617228 width=16) (actual time=9330.263..9330.273rows=11 loops=1)<br /> Sort Key: pub_date, id<br /> Sort Method: top-N heapsort Memory: 17kB<br /> -> Bitmap Heap Scan on ext_feeder_item i (cost=13832.40..1449341.79 rows=617228 width=16)(actual time=1018.401..8971.029 rows=185625 loops=1)<br /> Recheck Cond: (feed_id = ANY ('{... ,54461}'::bigint[]))<br/> -> Bitmap Index Scan on ext_feeder_item_idx (cost=0.00..13678.10 rows=617228width=0) (actual time=1008.791..1008.791 rows=185625 loops=1)<br /> Index Cond: (feed_id =ANY ('{... ,54461}'::bigint[]))<br /> Total runtime: 9330.729 ms<br /></tt></ul> I don't know what are the roots of theproblem, but I think that some symptomatic healing could be applied: the PostgreSQL could sort the IDs due to the statistics.<br/> So currently I tend to select the IDs from another table ordering them due to their weights: it's easy forme thanks to denormalization.<br /><br /> Also I would expect from PostgreSQL that it sorted the values to make indexscan more sequential, but this expectation already conflicts with the bug described above :)<br />
You may try contrib/intarray, which we developed specially for denormalization. Oleg On Thu, 29 May 2008, Alexey Kupershtokh wrote: > Hello everybody! > > I have found a performance issue with 2 equivalent queries stably taking > different (~x2) time to finish. In just a few words it can be described > like this: if you have a lot of values in an IN() statement, you should > put most heavy (specifying most number of rows) ids first. > This is mostly just a bug submit, than looking for help. > > So this is what I have: > * RHEL > * PostgreSQL 8.3.1 > * A table ext_feeder_item with ~4.6M records. > kia=# \d+ ext_feeder_item; > Table "public.ext_feeder_item" > Column | Type | Modifiers | Description > ----------+--------------------------+------------------------------------------ > --------------------+------------- > id | bigint | not null default > nextval('ext_feeder_item_id_seq'::regclass) | > feed_id | bigint | not null | > pub_date | timestamp with time zone | | > Indexes: > "ext_feeder_item_pkey" PRIMARY KEY, btree (id) > "ext_feeder_item_feed_id_pub_date_idx" btree (feed_id, pub_date) > "ext_feeder_item_idx" btree (feed_id) > Triggers: > .... > Has OIDs: no > * Statistics for the fields feed_id and pub_date are set to 1000; > * The table have just been vacuumed and analyzed. > * A simple query to the table: > SELECT > id > FROM > ext_feeder_item AS i > WHERE > i.feed_id IN (...) > ORDER BY pub_date DESC, id DESC > LIMIT 11 OFFSET 0; > > with many (~1200) ids in the IN() statement. > * The count of rows distribution for these ids (may be thought of as > foreign keys in this table) is the following: > id = 54461: ~180000 - actually the most heavy id in the whole table. > other ids: a single id at most specifies 2032 rows; 6036 rows total. > * If I perform a query with > IN(54461, ...) > it stably (5 attempts) takes 4.5..4.7 secs. to perform. > QUERY PLAN > Limit (cost=1463104.22..1463104.25 rows=11 width=16) (actual > time=4585.420..4585.452 rows=11 loops=1) > -> Sort (cost=1463104.22..1464647.29 rows=617228 width=16) > (actual time=4585.415..4585.425 rows=11 loops=1) > Sort Key: pub_date, id > Sort Method: top-N heapsort Memory: 17kB > -> Bitmap Heap Scan on ext_feeder_item i > (cost=13832.40..1449341.79 rows=617228 width=16) (actual > time=894.622..4260.441 rows=185625 loops=1) > Recheck Cond: (feed_id = ANY ('{54461, > ...}'::bigint[])) > -> Bitmap Index Scan on ext_feeder_item_idx > (cost=0.00..13678.10 rows=617228 width=0) (actual > time=884.686..884.686 rows=185625 loops=1) > Index Cond: (feed_id = ANY ('{54461, > ...}'::bigint[])) > Total runtime: 4585.852 ms > * If I perform a query with > IN(..., 54461) > it stably (5 attempts) takes 9.3..9.5 secs. to perform. > QUERY PLAN > Limit (cost=1463104.22..1463104.25 rows=11 width=16) (actual > time=9330.267..9330.298 rows=11 loops=1) > -> Sort (cost=1463104.22..1464647.29 rows=617228 width=16) > (actual time=9330.263..9330.273 rows=11 loops=1) > Sort Key: pub_date, id > Sort Method: top-N heapsort Memory: 17kB > -> Bitmap Heap Scan on ext_feeder_item i > (cost=13832.40..1449341.79 rows=617228 width=16) (actual > time=1018.401..8971.029 rows=185625 loops=1) > Recheck Cond: (feed_id = ANY ('{... > ,54461}'::bigint[])) > -> Bitmap Index Scan on ext_feeder_item_idx > (cost=0.00..13678.10 rows=617228 width=0) (actual > time=1008.791..1008.791 rows=185625 loops=1) > Index Cond: (feed_id = ANY ('{... > ,54461}'::bigint[])) > Total runtime: 9330.729 ms > I don't know what are the roots of the problem, but I think that some > symptomatic healing could be applied: the PostgreSQL could sort the IDs > due to the statistics. > So currently I tend to select the IDs from another table ordering them > due to their weights: it's easy for me thanks to denormalization. > > Also I would expect from PostgreSQL that it sorted the values to make > index scan more sequential, but this expectation already conflicts with > the bug described above :) > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Thanks for the response. I've taken a look at this feature. But it seems unapplicable to my case: this table is not a many2many relation which seems the most common case of the intarray usage. The table just stores an information about items (rss posts): what feeds (rss) are they from, and their publication date. Id in the table is PK. Oleg Bartunov wrote: > You may try contrib/intarray, which we developed specially for > denormalization. > > Oleg > On Thu, 29 May 2008, Alexey Kupershtokh wrote: > >> Hello everybody! >> >> I have found a performance issue with 2 equivalent queries stably taking >> different (~x2) time to finish. In just a few words it can be described >> like this: if you have a lot of values in an IN() statement, you should >> put most heavy (specifying most number of rows) ids first. >> This is mostly just a bug submit, than looking for help. >> >> So this is what I have: >> * RHEL >> * PostgreSQL 8.3.1 >> * A table ext_feeder_item with ~4.6M records. >> kia=# \d+ ext_feeder_item; >> Table "public.ext_feeder_item" >> Column | Type | Modifiers | Description >> ----------+--------------------------+------------------------------------------ >> >> --------------------+------------- >> id | bigint | not null default >> nextval('ext_feeder_item_id_seq'::regclass) | >> feed_id | bigint | not null | >> pub_date | timestamp with time zone | | >> Indexes: >> "ext_feeder_item_pkey" PRIMARY KEY, btree (id) >> "ext_feeder_item_feed_id_pub_date_idx" btree (feed_id, pub_date) >> "ext_feeder_item_idx" btree (feed_id) >> Triggers: >> .... >> Has OIDs: no >> * Statistics for the fields feed_id and pub_date are set to 1000; >> * The table have just been vacuumed and analyzed. >> * A simple query to the table: >> SELECT >> id >> FROM >> ext_feeder_item AS i >> WHERE >> i.feed_id IN (...) >> ORDER BY pub_date DESC, id DESC >> LIMIT 11 OFFSET 0; >> >> with many (~1200) ids in the IN() statement. >> * The count of rows distribution for these ids (may be thought of as >> foreign keys in this table) is the following: >> id = 54461: ~180000 - actually the most heavy id in the whole table. >> other ids: a single id at most specifies 2032 rows; 6036 rows total. >> * If I perform a query with >> IN(54461, ...) >> it stably (5 attempts) takes 4.5..4.7 secs. to perform. >> QUERY PLAN >> Limit (cost=1463104.22..1463104.25 rows=11 width=16) (actual >> time=4585.420..4585.452 rows=11 loops=1) >> -> Sort (cost=1463104.22..1464647.29 rows=617228 width=16) >> (actual time=4585.415..4585.425 rows=11 loops=1) >> Sort Key: pub_date, id >> Sort Method: top-N heapsort Memory: 17kB >> -> Bitmap Heap Scan on ext_feeder_item i >> (cost=13832.40..1449341.79 rows=617228 width=16) (actual >> time=894.622..4260.441 rows=185625 loops=1) >> Recheck Cond: (feed_id = ANY ('{54461, >> ...}'::bigint[])) >> -> Bitmap Index Scan on ext_feeder_item_idx >> (cost=0.00..13678.10 rows=617228 width=0) (actual >> time=884.686..884.686 rows=185625 loops=1) >> Index Cond: (feed_id = ANY ('{54461, >> ...}'::bigint[])) >> Total runtime: 4585.852 ms >> * If I perform a query with >> IN(..., 54461) >> it stably (5 attempts) takes 9.3..9.5 secs. to perform. >> QUERY PLAN >> Limit (cost=1463104.22..1463104.25 rows=11 width=16) (actual >> time=9330.267..9330.298 rows=11 loops=1) >> -> Sort (cost=1463104.22..1464647.29 rows=617228 width=16) >> (actual time=9330.263..9330.273 rows=11 loops=1) >> Sort Key: pub_date, id >> Sort Method: top-N heapsort Memory: 17kB >> -> Bitmap Heap Scan on ext_feeder_item i >> (cost=13832.40..1449341.79 rows=617228 width=16) (actual >> time=1018.401..8971.029 rows=185625 loops=1) >> Recheck Cond: (feed_id = ANY ('{... >> ,54461}'::bigint[])) >> -> Bitmap Index Scan on ext_feeder_item_idx >> (cost=0.00..13678.10 rows=617228 width=0) (actual >> time=1008.791..1008.791 rows=185625 loops=1) >> Index Cond: (feed_id = ANY ('{... >> ,54461}'::bigint[])) >> Total runtime: 9330.729 ms >> I don't know what are the roots of the problem, but I think that some >> symptomatic healing could be applied: the PostgreSQL could sort the IDs >> due to the statistics. >> So currently I tend to select the IDs from another table ordering them >> due to their weights: it's easy for me thanks to denormalization. >> >> Also I would expect from PostgreSQL that it sorted the values to make >> index scan more sequential, but this expectation already conflicts with >> the bug described above :) >> >> > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83