Thread: How to speed up min/max(id) in 50M rows table?
Hi,
I have a table with some 50 millions rows in PG 8.2. The table has indexes on relevant columns. My problem is that most everything I do with this table (which are actually very basic selects) is unbearable slow. For example:
select max(payment_id) from transactions
This takes 161 seconds.
The plan looks like this:
"Result (cost=0.37..0.38 rows=1 width=0) (actual time=184231.636..184231.638 rows=1 loops=1)"
" InitPlan"
" -> Limit (cost=0.00..0.37 rows=1 width=8) (actual time=184231.620..184231.622 rows=1 loops=1)"
" -> Index Scan Backward using trans_payment_id_index on transactions (cost=0.00..19144690.58 rows=51122691 width=8) (actual time=184231.613..184231.613 rows=1 loops=1)"
" Filter: (payment_id IS NOT NULL)"
"Total runtime: 184231.755 ms"
As shown, in the plan, the index on the requested column "payment_id" is being used, but the query still takes quite a lot of time. If I use a where clause in a similar query, the query seemingly runs forever, e.g.
select min(time) from transactions where payment_id = 67
There are indexes on both the time (a timestamp with time zone) and payment_id (a bigint) columns. About 1 million rows satisfy the condition payment_id = 67. This query takes a totally unrealistic amount of time for execution (I have it running for >30 minutes now on a machine with 8GB and 4 cores@2.66Ghz, and it still isn't finished). With mpstat it becomes clear that the query is totally IO bound (what is expected of course). The machine I'm running this on has a fast RAID that can do about 320 MB/s.
Are these normal execution times for these amount of rows and this hardware? Is there anything I can do to speed up these kind of simple queries on huge tables?
Thanks in advance for all suggestions
I have a table with some 50 millions rows in PG 8.2. The table has indexes on relevant columns. My problem is that most everything I do with this table (which are actually very basic selects) is unbearable slow. For example:
select max(payment_id) from transactions
This takes 161 seconds.
The plan looks like this:
"Result (cost=0.37..0.38 rows=1 width=0) (actual time=184231.636..184231.638 rows=1 loops=1)"
" InitPlan"
" -> Limit (cost=0.00..0.37 rows=1 width=8) (actual time=184231.620..184231.622 rows=1 loops=1)"
" -> Index Scan Backward using trans_payment_id_index on transactions (cost=0.00..19144690.58 rows=51122691 width=8) (actual time=184231.613..184231.613 rows=1 loops=1)"
" Filter: (payment_id IS NOT NULL)"
"Total runtime: 184231.755 ms"
As shown, in the plan, the index on the requested column "payment_id" is being used, but the query still takes quite a lot of time. If I use a where clause in a similar query, the query seemingly runs forever, e.g.
select min(time) from transactions where payment_id = 67
There are indexes on both the time (a timestamp with time zone) and payment_id (a bigint) columns. About 1 million rows satisfy the condition payment_id = 67. This query takes a totally unrealistic amount of time for execution (I have it running for >30 minutes now on a machine with 8GB and 4 cores@2.66Ghz, and it still isn't finished). With mpstat it becomes clear that the query is totally IO bound (what is expected of course). The machine I'm running this on has a fast RAID that can do about 320 MB/s.
Are these normal execution times for these amount of rows and this hardware? Is there anything I can do to speed up these kind of simple queries on huge tables?
Thanks in advance for all suggestions
Express yourself instantly with MSN Messenger! MSN Messenger
>>> On Fri, Oct 12, 2007 at 3:41 PM, in message <BAY124-W503931258D0AE13AF37546F5A00@phx.gbl>, henk de wit <henk53602@hotmail.com> wrote: > > I have a table with some 50 millions rows in PG 8.2. The table has indexes > on relevant columns. My problem is that most everything I do with this table > (which are actually very basic selects) is unbearable slow. Do you have autovacuum turned on? With what settings? Do you run scheduled VACUUM ANALYZE? What does the tail of the output from your last VACUUM ANALYZE VERBOSE look like? -Kevin
> This query takes a totally unrealistic amount of time for execution (I have it running for >30 minutes now on a machine with 8GB and 4 cores@2.66Ghz, and it still isn't finished).
To correct myself, I looked at the wrong window earlier, when I typed the email the query had in fact finished already and took about 18 minutes (which is still really long for such a query of course), but more than 30 minutes was wrong.
Express yourself instantly with MSN Messenger! MSN Messenger
> Do you have autovacuum turned on? With what settings?
Yes, I have it turned on. The settings are:
autovacuum on
autovacuum_analyze_scale_factor 0.1
autovacuum_analyze_threshold 250
autovacuum_freeze_max_age 200000000
autovacuum_naptime 1min
autovacuum_vacuum_cost_delay -1
autovacuum_vacuum_cost_limit -1
autovacuum_vacuum_scale_factor 0.2
autovacuum_vacuum_threshold 500
vacuum_cost_delay 0
vacuum_cost_limit 200
vacuum_cost_page_dirty 20
vacuum_cost_page_hit 1
vacuum_cost_page_miss 10
vacuum_freeze_min_age 100000000
> Do you run scheduled VACUUM ANALYZE?
This too, every night.
> What does the tail of the output from your last
> VACUUM ANALYZE VERBOSE look like?
I'll try to look it up and post it back here once I got it.
Express yourself instantly with MSN Messenger! MSN Messenger
Yes, I have it turned on. The settings are:
autovacuum on
autovacuum_analyze_scale_factor 0.1
autovacuum_analyze_threshold 250
autovacuum_freeze_max_age 200000000
autovacuum_naptime 1min
autovacuum_vacuum_cost_delay -1
autovacuum_vacuum_cost_limit -1
autovacuum_vacuum_scale_factor 0.2
autovacuum_vacuum_threshold 500
vacuum_cost_delay 0
vacuum_cost_limit 200
vacuum_cost_page_dirty 20
vacuum_cost_page_hit 1
vacuum_cost_page_miss 10
vacuum_freeze_min_age 100000000
> Do you run scheduled VACUUM ANALYZE?
This too, every night.
> What does the tail of the output from your last
> VACUUM ANALYZE VERBOSE look like?
I'll try to look it up and post it back here once I got it.
Express yourself instantly with MSN Messenger! MSN Messenger
> select payment_id from transactions order by payment_id desc limit 1;
This one is indeed instant! Less than 50ms. In my case I can't use it for max though because of the fact that payment_id can be null (which is an unfortunate design choice). The other variant however didn't become instant. I.e. I tried:
select time from transactions where payment_id = 67 order by time asc limit 1;
But this one is still really slow.
Regards
Express yourself instantly with MSN Messenger! MSN Messenger
This one is indeed instant! Less than 50ms. In my case I can't use it for max though because of the fact that payment_id can be null (which is an unfortunate design choice). The other variant however didn't become instant. I.e. I tried:
select time from transactions where payment_id = 67 order by time asc limit 1;
But this one is still really slow.
Regards
Express yourself instantly with MSN Messenger! MSN Messenger
henk de wit <henk53602@hotmail.com> writes: > The plan looks like this: > "Result (cost=3D0.37..0.38 rows=3D1 width=3D0) (actual time=3D184231.636..= > 184231.638 rows=3D1 loops=3D1)" > " InitPlan" > " -> Limit (cost=3D0.00..0.37 rows=3D1 width=3D8) (actual time=3D18423= > 1.620..184231.622 rows=3D1 loops=3D1)" > " -> Index Scan Backward using trans_payment_id_index on transact= > ions (cost=3D0.00..19144690.58 rows=3D51122691 width=3D8) (actual time=3D1= > 84231.613..184231.613 rows=3D1 loops=3D1)" > " Filter: (payment_id IS NOT NULL)" > "Total runtime: 184231.755 ms" The only way I can see for that to be so slow is if you have a very large number of rows where payment_id is null --- is that the case? There's not a lot you could do about that in existing releases :-(. In 8.3 it'll be possible to declare the index as NULLS FIRST, which moves the performance problem from the max end to the min end ... > select min(time) from transactions where payment_id =3D 67 > There are indexes on both the time (a timestamp with time zone) and payment= > _id (a bigint) columns. Creating indexes at random with no thought about how the system could use them is not a recipe for speeding up your queries. What you'd need to make this query fast is a double-column index on (payment_id, time) so that a forward scan on the items with payment_id = 67 would immediately find the minimum time entry. Neither of the single-column indexes offers any way to find the desired entry without scanning over lots of unrelated entries. regards, tom lane
On Friday 12 October 2007, henk de wit <henk53602@hotmail.com> wrote: > > select payment_id from transactions order by payment_id desc limit 1; > > This one is indeed instant! Less than 50ms. In my case I can't use it for > max though because of the fact that payment_id can be null (which is an > unfortunate design choice). The other variant however didn't become > instant. I.e. I tried: > > select time from transactions where payment_id = 67 order by time asc > limit 1; > > But this one is still really slow. If you had a compound index on payment_id,time (especially with a WHERE time IS NOT NULL conditional) it would likely speed it up. -- Ghawar is dying
I wrote: > The only way I can see for that to be so slow is if you have a very > large number of rows where payment_id is null --- is that the case? > There's not a lot you could do about that in existing releases :-(. Actually, there is a possibility if you are willing to change the query: make a partial index that excludes nulls. Toy example: regression=# create table fooey(f1 int); CREATE TABLE regression=# create index fooeyi on fooey(f1) where f1 is not null; CREATE INDEX regression=# explain select max(f1) from fooey; QUERY PLAN --------------------------------------------------------------- Aggregate (cost=36.75..36.76 rows=1 width=4) -> Seq Scan on fooey (cost=0.00..31.40 rows=2140 width=4) (2 rows) regression=# explain select max(f1) from fooey where f1 is not null; QUERY PLAN ----------------------------------------------------------------------------------------------- Result (cost=0.03..0.04 rows=1 width=0) InitPlan -> Limit (cost=0.00..0.03 rows=1 width=4) -> Index Scan Backward using fooeyi on fooey (cost=0.00..65.55 rows=2129 width=4) Filter: (f1 IS NOT NULL) (5 rows) Probably the planner ought to be smart enough to figure this out without the explicit WHERE in the query, but right now it isn't. regards, tom lane
> The only way I can see for that to be so slow is if you have a very
> large number of rows where payment_id is null --- is that the case?
The number of rows where payment_id is null is indeed large. They increase every day to about 1 million at the end of the so-called "payment period" (so currently, about 1/100th of the table has nulls).
> In 8.3 it'll be possible to declare the index as NULLS FIRST, which
> moves the performance problem from the max end to the min end ...
Sounds interesting. I also noticed 8.3 is able to use an index for "is null". Luckily you've just released the beta release of 8.3. I'm going to setup a test system for 8.3 real soon then to try what difference it would make for my particular dataset.
> Creating indexes at random with no thought about how the system could
> use them is not a recipe for speeding up your queries. What you'd need
> to make this query fast is a double-column index on (payment_id, time)
> so that a forward scan on the items with payment_id = 67 would
> immediately find the minimum time entry. Neither of the single-column
> indexes offers any way to find the desired entry without scanning over
> lots of unrelated entries.
I see, that sounds very interesting too. As you might have noticed, I'm not an expert on this field but I'm trying to learn. I was under the impression that the last few incarnations of postgresql automatically combined single column indexes for cases where a multi-column index would be needed in earlier releases. But apparently this isn't true for every case and I still have a lot to learn about PG.
Regards
Express yourself instantly with MSN Messenger! MSN Messenger
> large number of rows where payment_id is null --- is that the case?
The number of rows where payment_id is null is indeed large. They increase every day to about 1 million at the end of the so-called "payment period" (so currently, about 1/100th of the table has nulls).
> In 8.3 it'll be possible to declare the index as NULLS FIRST, which
> moves the performance problem from the max end to the min end ...
Sounds interesting. I also noticed 8.3 is able to use an index for "is null". Luckily you've just released the beta release of 8.3. I'm going to setup a test system for 8.3 real soon then to try what difference it would make for my particular dataset.
> Creating indexes at random with no thought about how the system could
> use them is not a recipe for speeding up your queries. What you'd need
> to make this query fast is a double-column index on (payment_id, time)
> so that a forward scan on the items with payment_id = 67 would
> immediately find the minimum time entry. Neither of the single-column
> indexes offers any way to find the desired entry without scanning over
> lots of unrelated entries.
I see, that sounds very interesting too. As you might have noticed, I'm not an expert on this field but I'm trying to learn. I was under the impression that the last few incarnations of postgresql automatically combined single column indexes for cases where a multi-column index would be needed in earlier releases. But apparently this isn't true for every case and I still have a lot to learn about PG.
Regards
Express yourself instantly with MSN Messenger! MSN Messenger
henk de wit <henk53602@hotmail.com> writes: > I see, that sounds very interesting too. As you might have noticed, I'm not= > an expert on this field but I'm trying to learn. I was under the impressio= > n that the last few incarnations of postgresql automatically combined singl= > e column indexes for cases where a multi-column index would be needed in ea= > rlier releases. It's possible to combine independent indexes for resolving AND-type queries, but the combination process does not preserve ordering, so it's useless for this type of situation. regards, tom lane
> It's possible to combine independent indexes for resolving AND-type
> queries, but the combination process does not preserve ordering, so
> it's useless for this type of situation.
Ok, I'm going to try the double column index. Your suggestion about the index with nulls left out worked great btw. Min/Max is instantly now.
I figure though that the double column index would not work for queries like:
select min(time) from transactions where payment_id is null
So for that situation I tried whether a specific index helped, i.e. :
create index transactions__time_payment_id__null__idx on transactions(time) where payment_id is null;
But this does not really seem to help. It might be better to see if I can refactor the DB design though to not use nulls.
Regards
Express yourself instantly with MSN Messenger! MSN Messenger
> queries, but the combination process does not preserve ordering, so
> it's useless for this type of situation.
Ok, I'm going to try the double column index. Your suggestion about the index with nulls left out worked great btw. Min/Max is instantly now.
I figure though that the double column index would not work for queries like:
select min(time) from transactions where payment_id is null
So for that situation I tried whether a specific index helped, i.e. :
create index transactions__time_payment_id__null__idx on transactions(time) where payment_id is null;
But this does not really seem to help. It might be better to see if I can refactor the DB design though to not use nulls.
Regards
Express yourself instantly with MSN Messenger! MSN Messenger
>select min(time) from transactions where payment_id is null
>So for that situation I tried whether a specific index helped, i.e. :
>create index transactions__time_payment_id__null__idx on transactions(time) where payment_id is null;
>But this does not really seem to help. It might be better to see if I can refactor the DB design though to not use nulls.
I was posting too fast again, the previous index -does- work, making the above query instant (<50ms) :) I actually mis-typed the test query before (it's rather late at my place):
select min(payment_id) from transactions where payment_id is null
Although not useful, it's an interesting case. One would say that it could return quickly, but it takes 26 minutes to execute this.
Express yourself instantly with MSN Messenger! MSN Messenger