Thread: min/max performance inequality.

min/max performance inequality.

From

Pawel Veselov

Date:

07 January 2015, 23:00:30

Hi.

I was wondering how come there is such a drastic difference between finding max and min. Seems like "index scan backwards" is really bad... The table is freshly re-indexed just in case. I added a count(*) in there, forcing the seq scan, and it's even better than the backwards index scan...

db=> EXPLAIN ANALYZE select min(rowdate) from r_agrio where blockid = 4814;

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------

Result (cost=495.89..495.90 rows=1 width=0) (actual time=24.149..24.150 rows=1 loops=1)

InitPlan 1 (returns $0)

-> Limit (cost=0.00..495.89 rows=1 width=13) (actual time=24.139..24.140 rows=1 loops=1)

-> Index Scan using rowdate_r_agrio on r_agrio (cost=0.00..222160.24 rows=448 width=13) (actual time=24.137..24.137 rows=1 loops=1)

Index Cond: ((rowdate)::text IS NOT NULL)

Filter: (blockid = 4814::numeric)

Total runtime: 24.186 ms

(7 rows)

db=> EXPLAIN ANALYZE select max(rowdate) from r_agrio where blockid = 4814;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------

Result (cost=495.89..495.90 rows=1 width=0) (actual time=926.032..926.033 rows=1 loops=1)

InitPlan 1 (returns $0)

-> Limit (cost=0.00..495.89 rows=1 width=13) (actual time=926.019..926.021 rows=1 loops=1)

-> Index Scan Backward using rowdate_r_agrio on r_agrio (cost=0.00..222160.24 rows=448 width=13) (actual time=926.017..926.017 rows=1 loops=1)

Index Cond: ((rowdate)::text IS NOT NULL)

Filter: (blockid = 4814::numeric)

Total runtime: 926.070 ms

(7 rows)

db=> EXPLAIN ANALYZE select count(*), max(rowdate) from r_agrio where blockid = 4814;

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------

Aggregate (cost=31585.18..31585.19 rows=1 width=13) (actual time=461.079..461.080 rows=1 loops=1)

-> Seq Scan on r_agrio (cost=0.00..31582.94 rows=448 width=13) (actual time=8.912..460.999 rows=15 loops=1)

Filter: (blockid = 4814::numeric)

Total runtime: 461.134 ms

(4 rows)

db=> \d r_agrio

Table "public.r_agrio"

Column | Type | Modifiers

-------------+-----------------------+--------------------

id | numeric(38,0) | not null

tagid | numeric(38,0) | not null

blockid | numeric(38,0) | not null

rowdate | character varying(15) | not null

count | numeric(38,0) | not null default 0

events | numeric(38,0) | not null default 0

devents | numeric(38,0) | not null default 0

duration | numeric(38,0) | not null default 0

device_type | numeric(38,0) | not null

placement | numeric(38,0) | not null default 0

unserved | numeric(38,0) | not null default 0

unconfirmed | numeric(38,0) | not null default 0

version | numeric(38,0) | not null default 1

Indexes:

"pk_r_agrio" PRIMARY KEY, btree (id)

"u_r_agrio" UNIQUE, btree (tagid, blockid, rowdate, device_type, placement)

"rowdate_r_agrio" btree (rowdate)

Re: min/max performance inequality.

From

Jeff Janes

Date:

07 January 2015, 23:34:56

On Wed, Jan 7, 2015 at 3:00 PM, Pawel Veselov <pawel.veselov@gmail.com> wrote:

Hi.

I was wondering how come there is such a drastic difference between finding max and min. Seems like "index scan backwards" is really bad... The table is freshly re-indexed just in case. I added a count(*) in there, forcing the seq scan, and it's even better than the backwards index scan...

db=> EXPLAIN ANALYZE select min(rowdate) from r_agrio where blockid = 4814;

It crawls the data in rowdate order (either forward or reverse) until it finds the first 4814. Crawling forward it finds 4814 very early. Crawling backwards it has to pass through a bunch of non-4814 before it finds the first 4814.

This fact doesn't show up in your EXPLAIN ANALYZE, but if you used a more modern version of postgresql (9.2 or above) there would be another line for "Rows Removed by Filter:" which would tell the story of what is going on.

If you have a composite index on (blockid, rowdate), it would help make this much faster, as it can go directly to the desired row.

Cheers,

Jeff

Re: min/max performance inequality.

From

Tom Lane

Date:

07 January 2015, 23:35:59

Pawel Veselov <pawel.veselov@gmail.com> writes:
> I was wondering how come there is such a drastic difference between finding
> max and min. Seems like "index scan backwards" is really bad...

It's probably an artifact of your data distribution, ie, the "blockid =
4814" condition is skipping lots of rows at one end of the index but few
or none at the other.

If you're concerned about the performance of this type of query, an index
on (blockid, rowdate) would work a lot better than the ones you've
provided.

            regards, tom lane

Re: min/max performance inequality.

From

Pawel Veselov

Date:

08 January 2015, 00:00:37

Thanks Jeff (and Tom)

On Wed, Jan 7, 2015 at 3:34 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

On Wed, Jan 7, 2015 at 3:00 PM, Pawel Veselov <pawel.veselov@gmail.com> wrote:
Hi.

I was wondering how come there is such a drastic difference between finding max and min. Seems like "index scan backwards" is really bad... The table is freshly re-indexed just in case. I added a count(*) in there, forcing the seq scan, and it's even better than the backwards index scan...

db=> EXPLAIN ANALYZE select min(rowdate) from r_agrio where blockid = 4814;

It crawls the data in rowdate order (either forward or reverse) until it finds the first 4814. Crawling forward it finds 4814 very early. Crawling backwards it has to pass through a bunch of non-4814 before it finds the first 4814.

This fact doesn't show up in your EXPLAIN ANALYZE, but if you used a more modern version of postgresql (9.2 or above) there would be another line for "Rows Removed by Filter:" which would tell the story of what is going on.

Yeah, there is 10x more rows on when going backwards

If you have a composite index on (blockid, rowdate), it would help make this much faster, as it can go directly to the desired row.

That does help a lot. So, when does postgres use a more-dimensional index, even if not all dimensions are engaged (as there is an index that involves those 2 fields, and more)? I definitely see it do that in some cases...

Even with that index, however, there is still a good difference in time (the interest is theoretical at this point, as I found a better way to extract that data anyway).

On a newer db.

db=> EXPLAIN analyze select min(rowdate) from r_agrio where blockid = 4814;

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------

Result (cost=521.54..521.55 rows=1 width=0) (actual time=39.770..39.770 rows=1 loops=1)

InitPlan 1 (returns $0)

-> Limit (cost=0.55..521.54 rows=1 width=13) (actual time=39.765..39.766 rows=1 loops=1)

-> Index Scan using rowdate_r_agrio on r_agrio (cost=0.55..303738.47 rows=583 width=13) (actual time=39.763..39.763 rows=1 loops=1)

Index Cond: ((rowdate)::text IS NOT NULL)

Filter: (blockid = 4814::numeric)

Rows Removed by Filter: 37246

Total runtime: 39.798 ms

(8 rows)

db=> EXPLAIN analyze select max(rowdate) from r_agrio where blockid = 4814;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------

Result (cost=521.54..521.55 rows=1 width=0) (actual time=1497.377..1497.378 rows=1 loops=1)

InitPlan 1 (returns $0)

-> Limit (cost=0.55..521.54 rows=1 width=13) (actual time=1497.371..1497.372 rows=1 loops=1)

-> Index Scan Backward using rowdate_r_agrio on r_agrio (cost=0.55..303738.47 rows=583 width=13) (actual time=1497.370..1497.370 rows=1 loops=1)

Index Cond: ((rowdate)::text IS NOT NULL)

Filter: (blockid = 4814::numeric)

Rows Removed by Filter: 317739

Total runtime: 1497.407 ms

(8 rows)

db=> CREATE INDEX concurrently xxx on r_agrio(rowdate,blockid);

CREATE INDEX

db=> EXPLAIN analyze select min(rowdate) from r_agrio where blockid = 4814;

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------

Result (cost=85.05..85.06 rows=1 width=0) (actual time=17.585..17.585 rows=1 loops=1)

InitPlan 1 (returns $0)

-> Limit (cost=0.43..85.05 rows=1 width=13) (actual time=17.580..17.581 rows=1 loops=1)

-> Index Only Scan using xxx on r_agrio (cost=0.43..37827.09 rows=447 width=13) (actual time=17.578..17.578 rows=1 loops=1)

Index Cond: ((rowdate IS NOT NULL) AND (blockid = 4814::numeric))

Heap Fetches: 0

Total runtime: 17.616 ms

(7 rows)

db=> EXPLAIN analyze select max(rowdate) from r_agrio where blockid = 4814;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------

Result (cost=85.04..85.05 rows=1 width=0) (actual time=89.141..89.142 rows=1 loops=1)

InitPlan 1 (returns $0)

-> Limit (cost=0.43..85.04 rows=1 width=13) (actual time=89.135..89.136 rows=1 loops=1)

-> Index Only Scan Backward using xxx on r_agrio (cost=0.43..37823.09 rows=447 width=13) (actual time=89.134..89.134 rows=1 loops=1)

Index Cond: ((rowdate IS NOT NULL) AND (blockid = 4814::numeric))

Heap Fetches: 1

Total runtime: 89.173 ms

(7 rows)

Cheers,