Thread: How to speed up min/max(id) in 50M rows table?

How to speed up min/max(id) in 50M rows table?

From

henk de wit

Date:

12 October 2007, 17:42:01

Hi,

I have a table with some 50 millions rows in PG 8.2. The table has indexes on relevant columns. My problem is that most everything I do with this table (which are actually very basic selects) is unbearable slow. For example:

select max(payment_id) from transactions

This takes 161 seconds.

The plan looks like this:

"Result (cost=0.37..0.38 rows=1 width=0) (actual time=184231.636..184231.638 rows=1 loops=1)"
" InitPlan"
"    -> Limit (cost=0.00..0.37 rows=1 width=8) (actual time=184231.620..184231.622 rows=1 loops=1)"
"          -> Index Scan Backward using trans_payment_id_index on transactions (cost=0.00..19144690.58 rows=51122691 width=8) (actual time=184231.613..184231.613 rows=1 loops=1)"
"                Filter: (payment_id IS NOT NULL)"
"Total runtime: 184231.755 ms"

As shown, in the plan, the index on the requested column "payment_id" is being used, but the query still takes quite a lot of time. If I use a where clause in a similar query, the query seemingly runs forever, e.g.

select min(time) from transactions where payment_id = 67

There are indexes on both the time (a timestamp with time zone) and payment_id (a bigint) columns. About 1 million rows satisfy the condition payment_id = 67. This query takes a totally unrealistic amount of time for execution (I have it running for >30 minutes now on a machine with 8GB and 4 cores@2.66Ghz, and it still isn't finished). With mpstat it becomes clear that the query is totally IO bound (what is expected of course). The machine I'm running this on has a fast RAID that can do about 320 MB/s.

Are these normal execution times for these amount of rows and this hardware? Is there anything I can do to speed up these kind of simple queries on huge tables?

Thanks in advance for all suggestions

Express yourself instantly with MSN Messenger! MSN Messenger

Re: How to speed up min/max(id) in 50M rows table?

From

"Kevin Grittner"

Date:

12 October 2007, 17:54:00

>>> On Fri, Oct 12, 2007 at  3:41 PM, in message
<BAY124-W503931258D0AE13AF37546F5A00@phx.gbl>, henk de wit
<henk53602@hotmail.com> wrote:
>
> I have a table with some 50 millions rows in PG 8.2. The table has indexes
> on relevant columns. My problem is that most everything I do with this table
> (which are actually very basic selects) is unbearable slow.

Do you have autovacuum turned on?  With what settings?

Do you run scheduled VACUUM ANALYZE?

What does the tail of the output from your last
VACUUM ANALYZE VERBOSE look like?

-Kevin

Re: How to speed up min/max(id) in 50M rows table?

From

henk de wit

Date:

12 October 2007, 17:59:59

> This query takes a totally unrealistic amount of time for execution (I have it running for >30 minutes now on a machine with 8GB and 4 cores@2.66Ghz, and it still isn't finished).

To correct myself, I looked at the wrong window earlier, when I typed the email the query had in fact finished already and took about 18 minutes (which is still really long for such a query of course), but more than 30 minutes was wrong.

Express yourself instantly with MSN Messenger! MSN Messenger

Re: How to speed up min/max(id) in 50M rows table?

From

henk de wit

Date:

12 October 2007, 18:19:53

> Do you have autovacuum turned on? With what settings?

Yes, I have it turned on. The settings are:

autovacuum     on
autovacuum_analyze_scale_factor     0.1
autovacuum_analyze_threshold     250
autovacuum_freeze_max_age     200000000
autovacuum_naptime     1min
autovacuum_vacuum_cost_delay     -1
autovacuum_vacuum_cost_limit     -1
autovacuum_vacuum_scale_factor     0.2
autovacuum_vacuum_threshold     500
vacuum_cost_delay     0
vacuum_cost_limit     200
vacuum_cost_page_dirty     20
vacuum_cost_page_hit     1
vacuum_cost_page_miss     10
vacuum_freeze_min_age     100000000

> Do you run scheduled VACUUM ANALYZE?

This too, every night.

> What does the tail of the output from your last
> VACUUM ANALYZE VERBOSE look like?

I'll try to look it up and post it back here once I got it.

Express yourself instantly with MSN Messenger! MSN Messenger

Re: How to speed up min/max(id) in 50M rows table?

From

henk de wit

Date:

12 October 2007, 18:42:01

> select payment_id from transactions order by payment_id desc limit 1;

This one is indeed instant! Less than 50ms. In my case I can't use it for max though because of the fact that payment_id can be null (which is an unfortunate design choice). The other variant however didn't become instant. I.e. I tried:

select time from transactions where payment_id = 67 order by time asc limit 1;

But this one is still really slow.

Regards

Express yourself instantly with MSN Messenger! MSN Messenger

Re: How to speed up min/max(id) in 50M rows table?

From

Tom Lane

Date:

12 October 2007, 19:04:32

henk de wit <henk53602@hotmail.com> writes:
> The plan looks like this:

> "Result  (cost=3D0.37..0.38 rows=3D1 width=3D0) (actual time=3D184231.636..=
> 184231.638 rows=3D1 loops=3D1)"
> "  InitPlan"
> "    ->  Limit  (cost=3D0.00..0.37 rows=3D1 width=3D8) (actual time=3D18423=
> 1.620..184231.622 rows=3D1 loops=3D1)"
> "          ->  Index Scan Backward using trans_payment_id_index on transact=
> ions  (cost=3D0.00..19144690.58 rows=3D51122691 width=3D8) (actual time=3D1=
> 84231.613..184231.613 rows=3D1 loops=3D1)"
> "                Filter: (payment_id IS NOT NULL)"
> "Total runtime: 184231.755 ms"

The only way I can see for that to be so slow is if you have a very
large number of rows where payment_id is null --- is that the case?

There's not a lot you could do about that in existing releases :-(.
In 8.3 it'll be possible to declare the index as NULLS FIRST, which
moves the performance problem from the max end to the min end ...

> select min(time) from transactions where payment_id =3D 67

> There are indexes on both the time (a timestamp with time zone) and payment=
> _id (a bigint) columns.

Creating indexes at random with no thought about how the system could
use them is not a recipe for speeding up your queries.  What you'd need
to make this query fast is a double-column index on (payment_id, time)
so that a forward scan on the items with payment_id = 67 would
immediately find the minimum time entry.  Neither of the single-column
indexes offers any way to find the desired entry without scanning over
lots of unrelated entries.

            regards, tom lane

Re: How to speed up min/max(id) in 50M rows table?

From

Alan Hodgson

Date:

12 October 2007, 19:04:49

On Friday 12 October 2007, henk de wit <henk53602@hotmail.com> wrote:
> >   select payment_id from transactions order by payment_id desc limit 1;
>
> This one is indeed instant! Less than 50ms. In my case I can't use it for
> max though because of the fact that payment_id can be null (which is an
> unfortunate design choice). The other variant however didn't become
> instant. I.e. I tried:
>
> select time from transactions where payment_id = 67 order by time asc
> limit 1;
>
> But this one is still really slow.

If you had a compound index on payment_id,time (especially with a WHERE time
IS NOT NULL conditional) it would likely speed it up.


--
Ghawar is dying

Re: How to speed up min/max(id) in 50M rows table?

From

Tom Lane

Date:

12 October 2007, 19:17:50

I wrote:
> The only way I can see for that to be so slow is if you have a very
> large number of rows where payment_id is null --- is that the case?

> There's not a lot you could do about that in existing releases :-(.

Actually, there is a possibility if you are willing to change the query:
make a partial index that excludes nulls.  Toy example:

regression=# create table fooey(f1 int);
CREATE TABLE
regression=# create index fooeyi on fooey(f1) where f1 is not null;
CREATE INDEX
regression=# explain select max(f1) from fooey;
                          QUERY PLAN
---------------------------------------------------------------
 Aggregate  (cost=36.75..36.76 rows=1 width=4)
   ->  Seq Scan on fooey  (cost=0.00..31.40 rows=2140 width=4)
(2 rows)

regression=# explain select max(f1) from fooey where f1 is not null;
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Result  (cost=0.03..0.04 rows=1 width=0)
   InitPlan
     ->  Limit  (cost=0.00..0.03 rows=1 width=4)
           ->  Index Scan Backward using fooeyi on fooey  (cost=0.00..65.55 rows=2129 width=4)
                 Filter: (f1 IS NOT NULL)
(5 rows)

Probably the planner ought to be smart enough to figure this out without
the explicit WHERE in the query, but right now it isn't.

            regards, tom lane

Re: How to speed up min/max(id) in 50M rows table?

From

henk de wit

Date:

12 October 2007, 19:22:13

> The only way I can see for that to be so slow is if you have a very
> large number of rows where payment_id is null --- is that the case?

The number of rows where payment_id is null is indeed large. They increase every day to about 1 million at the end of the so-called "payment period" (so currently, about 1/100th of the table has nulls).

> In 8.3 it'll be possible to declare the index as NULLS FIRST, which
> moves the performance problem from the max end to the min end ...

Sounds interesting. I also noticed 8.3 is able to use an index for "is null". Luckily you've just released the beta release of 8.3. I'm going to setup a test system for 8.3 real soon then to try what difference it would make for my particular dataset.

> Creating indexes at random with no thought about how the system could
> use them is not a recipe for speeding up your queries. What you'd need
> to make this query fast is a double-column index on (payment_id, time)
> so that a forward scan on the items with payment_id = 67 would
> immediately find the minimum time entry. Neither of the single-column
> indexes offers any way to find the desired entry without scanning over
> lots of unrelated entries.

I see, that sounds very interesting too. As you might have noticed, I'm not an expert on this field but I'm trying to learn. I was under the impression that the last few incarnations of postgresql automatically combined single column indexes for cases where a multi-column index would be needed in earlier releases. But apparently this isn't true for every case and I still have a lot to learn about PG.

Regards

Express yourself instantly with MSN Messenger! MSN Messenger

Re: How to speed up min/max(id) in 50M rows table?

From

Tom Lane

Date:

12 October 2007, 19:51:11

henk de wit <henk53602@hotmail.com> writes:
> I see, that sounds very interesting too. As you might have noticed, I'm not=
>  an expert on this field but I'm trying to learn. I was under the impressio=
> n that the last few incarnations of postgresql automatically combined singl=
> e column indexes for cases where a multi-column index would be needed in ea=
> rlier releases.

It's possible to combine independent indexes for resolving AND-type
queries, but the combination process does not preserve ordering, so
it's useless for this type of situation.

            regards, tom lane

Re: How to speed up min/max(id) in 50M rows table?

From

henk de wit

Date:

12 October 2007, 20:14:22

> It's possible to combine independent indexes for resolving AND-type
> queries, but the combination process does not preserve ordering, so
> it's useless for this type of situation.

Ok, I'm going to try the double column index. Your suggestion about the index with nulls left out worked great btw. Min/Max is instantly now.

I figure though that the double column index would not work for queries like:

select min(time) from transactions where payment_id is null

So for that situation I tried whether a specific index helped, i.e. :

create index transactions__time_payment_id__null__idx on transactions(time) where payment_id is null;

But this does not really seem to help. It might be better to see if I can refactor the DB design though to not use nulls.

Regards

Express yourself instantly with MSN Messenger! MSN Messenger

Re: How to speed up min/max(id) in 50M rows table?

From

henk de wit

Date:

12 October 2007, 20:48:31

>select min(time) from transactions where payment_id is null
>So for that situation I tried whether a specific index helped, i.e. :
>create index transactions__time_payment_id__null__idx on transactions(time) where payment_id is null;
>But this does not really seem to help. It might be better to see if I can refactor the DB design though to not use nulls.

I was posting too fast again, the previous index -does- work, making the above query instant (<50ms) :) I actually mis-typed the test query before (it's rather late at my place):

select min(payment_id) from transactions where payment_id is null

Although not useful, it's an interesting case. One would say that it could return quickly, but it takes 26 minutes to execute this.

Express yourself instantly with MSN Messenger! MSN Messenger