Thread: Index / Performance issues

Index / Performance issues

From
Lucas Adamski
Date:
Hi all,
 
I've been using pgsql heavily for about 2 years now, and I keep running into some index-related wierdness that's rather puzzling.  This is for release 7.2.1, so if a more recent release has solved these, great!  Never the less:
 
I have a table with about 170,000 rows, each of them a network event.  I also have a serial 8 primary key set up, with a corresponding (unique) btree index.  The primary key is basically sequential, being incremented dynamically at insert time.  The problems I've had revolve around selecting an individual entry, or trying to figure out the current maximum ID in the table.  In both cases, the results are rather counter-intuitive.  Example below, with my comments in bold.

I've had this problem using functions such as max(), etc.  For example:

Obvious way, using max():

# explain analyze select max(my_e_id) from my_events;
Aggregate  (cost=68132.85..68132.85 rows=1 width=8) (actual time=16103.03..16103.03 rows=1 loops=1)
  ->  Seq Scan on my_events  (cost=0.00..67699.28 rows=173428 width=8) (actual time=0.09..15932.27 rows=173480 loops=1)
Total runtime: 16103.11 msec

Obtuse way, using ORDER BY DESC/LIMIT

# explain analyze select my_e_id from sn_events ORDER BY my_e_id DESC LIMIT 1;
Limit  (cost=0.00..1.48 rows=1 width=8) (actual time=36.02..36.03 rows=1 loops=1)
  ->  Index Scan Backward using my_events_pkey on my_events  (cost=0.00..256931.94 rows=173428 width=8) (actual time=36.02..36.02 rows=2 loops=
1)
Total runtime: 36.09 msec

In this case, the obtuse way is faster... 446 times faster, in fact.  I'd understand if this was a corner cases, but this has been the situation with ever PGSQL db I've built.

Here's another example, just trying to pick out a single random entry out of a 170,000. 
First, the simple approach (status quo):

# explain analyze select * from my_events WHERE my_e_id = 10800000;                                                                    
Seq Scan on my_events  (cost=0.00..68132.85 rows=1 width=771) (actual time=15916.75..16337.31 rows=1 loops=1)
Total runtime: 16337.42 msec

Pretty darned slow.. (16 secs in fact, ouch).  So now lets try our idea with limiting the query by order it in reverse order, and limiting to 1 result (even though the limit is unnecessary, but performance is identical without it)

# explain analyze select * from my_events WHERE my_e_id = 10800000 ORDER BY my_e_id DESC LIMIT 1;
Limit  (cost=68132.86..68132.86 rows=1 width=771) (actual time=16442.42..16442.43 rows=1 loops=1)
  ->  Sort  (cost=68132.86..68132.86 rows=1 width=771) (actual time=16442.42..16442.42 rows=1 loops=1)
        ->  Seq Scan on my_events  (cost=0.00..68132.85 rows=1 width=771) (actual time=16009.50..16441.91 rows=1 loops=1)
Total runtime: 16442.70 msec

Well, that's not any better... over a few runs, sometimes this was even slower that the status quo.  Well, at this point there was only one thing left to try... put in a <= in place of =, and see if it made a difference.

# explain analyze select * from my_events WHERE my_e_id <= 10800000 ORDER BY my_e_id DESC LIMIT 1;
Limit  (cost=0.00..5.52 rows=1 width=771) (actual time=474.40..474.42 rows=1 loops=1)
  ->  Index Scan Backward using my_events_pkey on my_events  (cost=0.00..257365.51 rows=46663 width=771) (actual time=474.39..474.41 rows=2 loo
ps=1)
Total runtime: 474.55 msec

Oddly enough, it did... note the "Index Scan Backward"... finally!  So for whatever reason, the DB decides not to use an index scan unless there's a greater or less than comparison operator in conjunction with an ORDER BY/LIMIT.  Now it takes half a second, instead of 16.

# explain analyze select * from my_events WHERE my_e_id >= 10800000 ORDER BY my_e_id LIMIT 1;
Limit  (cost=0.00..2.03 rows=1 width=771) (actual time=1379.74..1379.76 rows=1 loops=1)
  ->  Index Scan using my_events_pkey on my_events  (cost=0.00..257365.51 rows=126765 width=771) (actual time=1379.73..1379.75 rows=2 loops=1)
Total runtime: 1380.10 msec

Just for fun, run it in regular order (front to back, versus back to front, looking for >=).  Sure enough, still far better than the scan... 1.4 seconds vs 16.  So even the worst case index scan is still far better than the default approach.  Note that I tried using "set enable_seqscan=off", and it STILL insisted on scanning the table, but even slower this time.

Am I missing something really obvious?  Is there a proven way to consistantly encourage it to use indexes for these sorts of (rather obvious) queries?

Several runs of the above resulted in some variations in run time, but the corresponding orders of difference performance stayed pretty consistant.  I'm just confused as to why I have to go through such convoluted methods to force it to use the index when its obviously a FAR more efficient route to go regardless of which order it scans it in (forwards or backwards).  Any thoughts are appreciated.  Thanks!

  Lucas.

Re: Index / Performance issues

From
"scott.marlowe"
Date:
Hi Lucas, you are running into two fairly common postgresql tuning issues.

When you run max(), you are literally asking the database to look at every
value and find the highest one.  while 'select max(field) from table'
seems like a simple one to optimize, how about 'select max(field) from
table where id<=800000 and size='m' isn't so obivious anymore.  As the
max() queries get more complex, the ability to optimize them quickly
disappears.

Things get more complex in a multi-user environment, where different folks
can see different things.  While the limit offset solution seems like a
hack, it is actually asking the question in a more easily optimized way.

The second problem you're running into is that postgresql doesn't
automatically match int8 to int4, and it assumes ints without '' around
them are int4.  the easy solution is to enclose your id number inside ''
marks, so you have :

select * from table where 8bitintfield='123456789';

and that will force the planner to convert your number to int8.

On Fri, 7 Mar 2003, Lucas Adamski wrote:

> Hi all,
>
> I've been using pgsql heavily for about 2 years now, and I keep running into
> some index-related wierdness that's rather puzzling.  This is for release
> 7.2.1, so if a more recent release has solved these, great!  Never the less:
>
> I have a table with about 170,000 rows, each of them a network event.  I
> also have a serial 8 primary key set up, with a corresponding (unique) btree
> index.  The primary key is basically sequential, being incremented
> dynamically at insert time.  The problems I've had revolve around selecting
> an individual entry, or trying to figure out the current maximum ID in the
> table.  In both cases, the results are rather counter-intuitive.  Example
> below, with my comments in bold.
> I've had this problem using functions such as max(), etc.  For example:
>
> Obvious way, using max():
>
> # explain analyze select max(my_e_id) from my_events;
> Aggregate  (cost=68132.85..68132.85 rows=1 width=8) (actual
> time=16103.03..16103.03 rows=1 loops=1)
>   ->  Seq Scan on my_events  (cost=0.00..67699.28 rows=173428 width=8)
> (actual time=0.09..15932.27 rows=173480 loops=1)
> Total runtime: 16103.11 msec
>
> Obtuse way, using ORDER BY DESC/LIMIT
>
> # explain analyze select my_e_id from sn_events ORDER BY my_e_id DESC LIMIT
> 1;
> Limit  (cost=0.00..1.48 rows=1 width=8) (actual time=36.02..36.03 rows=1
> loops=1)
>   ->  Index Scan Backward using my_events_pkey on my_events
> (cost=0.00..256931.94 rows=173428 width=8) (actual time=36.02..36.02 rows=2
> loops=
> 1)
> Total runtime: 36.09 msec
>
> In this case, the obtuse way is faster... 446 times faster, in fact.  I'd
> understand if this was a corner cases, but this has been the situation with
> ever PGSQL db I've built.
>
> Here's another example, just trying to pick out a single random entry out of
> a 170,000.
> First, the simple approach (status quo):<?xml:namespace prefix = o ns =
> "urn:schemas-microsoft-com:office:office" />
>
> # explain analyze select * from my_events WHERE my_e_id = 10800000;
>
> Seq Scan on my_events  (cost=0.00..68132.85 rows=1 width=771) (actual
> time=15916.75..16337.31 rows=1 loops=1)
> Total runtime: 16337.42 msec
>
> Pretty darned slow.. (16 secs in fact, ouch).  So now lets try our idea with
> limiting the query by order it in reverse order, and limiting to 1 result
> (even though the limit is unnecessary, but performance is identical without
> it)
>
> # explain analyze select * from my_events WHERE my_e_id = 10800000 ORDER BY
> my_e_id DESC LIMIT 1;
> Limit  (cost=68132.86..68132.86 rows=1 width=771) (actual
> time=16442.42..16442.43 rows=1 loops=1)
>   ->  Sort  (cost=68132.86..68132.86 rows=1 width=771) (actual
> time=16442.42..16442.42 rows=1 loops=1)
>         ->  Seq Scan on my_events  (cost=0.00..68132.85 rows=1 width=771)
> (actual time=16009.50..16441.91 rows=1 loops=1)
> Total runtime: 16442.70 msec
>
> Well, that's not any better... over a few runs, sometimes this was even
> slower that the status quo.  Well, at this point there was only one thing
> left to try... put in a <= in place of =, and see if it made a difference.
>
> # explain analyze select * from my_events WHERE my_e_id <= 10800000 ORDER BY
> my_e_id DESC LIMIT 1;
> Limit  (cost=0.00..5.52 rows=1 width=771) (actual time=474.40..474.42 rows=1
> loops=1)
>   ->  Index Scan Backward using my_events_pkey on my_events
> (cost=0.00..257365.51 rows=46663 width=771) (actual time=474.39..474.41
> rows=2 loo
> ps=1)
> Total runtime: 474.55 msec
>
> Oddly enough, it did... note the "Index Scan Backward"... finally!  So for
> whatever reason, the DB decides not to use an index scan unless there's a
> greater or less than comparison operator in conjunction with an ORDER
> BY/LIMIT.  Now it takes half a second, instead of 16.
>
> # explain analyze select * from my_events WHERE my_e_id >= 10800000 ORDER BY
> my_e_id LIMIT 1;
> Limit  (cost=0.00..2.03 rows=1 width=771) (actual time=1379.74..1379.76
> rows=1 loops=1)
>   ->  Index Scan using my_events_pkey on my_events  (cost=0.00..257365.51
> rows=126765 width=771) (actual time=1379.73..1379.75 rows=2 loops=1)
> Total runtime: 1380.10 msec
>
> Just for fun, run it in regular order (front to back, versus back to front,
> looking for >=).  Sure enough, still far better than the scan... 1.4 seconds
> vs 16.  So even the worst case index scan is still far better than the
> default approach.  Note that I tried using "set enable_seqscan=off", and it
> STILL insisted on scanning the table, but even slower this time.
>
> Am I missing something really obvious?  Is there a proven way to
> consistantly encourage it to use indexes for these sorts of (rather obvious)
> queries?
>
> Several runs of the above resulted in some variations in run time, but the
> corresponding orders of difference performance stayed pretty consistant.
> I'm just confused as to why I have to go through such convoluted methods to
> force it to use the index when its obviously a FAR more efficient route to
> go regardless of which order it scans it in (forwards or backwards).  Any
> thoughts are appreciated.  Thanks!
>
>   Lucas.
>
>


Re: Index / Performance issues

From
Greg Stark
Date:

"scott.marlowe" <scott.marlowe@ihs.com> writes:

> select * from table where 8bitintfield='123456789';

Or:

select * from table where 8bitintfield=123456789::int8


I'm not sure which is aesthetically more pleasing.

--
greg

Re: Index / Performance issues

From
"scott.marlowe"
Date:
On 7 Mar 2003, Greg Stark wrote:

>
>
> "scott.marlowe" <scott.marlowe@ihs.com> writes:
>
> > select * from table where 8bitintfield='123456789';
>
> Or:
>
> select * from table where 8bitintfield=123456789::int8
>
>
> I'm not sure which is aesthetically more pleasing.

The cast is self documenting, so it's probably a better choice for most
setups.  On the other hand, it's not as likely to be portable.


Re: Index / Performance issues

From
Kevin Brown
Date:
scott.marlowe wrote:
> > select * from table where 8bitintfield=123456789::int8
> >
> >
> > I'm not sure which is aesthetically more pleasing.
>
> The cast is self documenting, so it's probably a better choice for most
> setups.  On the other hand, it's not as likely to be portable.

May as well make it as portable as possible, though:

select * from table where 8bitintfield = CAST(123456789 AS bigint)



--
Kevin Brown                          kevin@sysexperts.com