Thread: Query performance issue

Query performance issue

From
"Jonathan Gray"
Date:

We’re experiencing a query performance problem related to the planner and its ability to perform a specific type of merge.

 

We have created a test case (as attached, or here: http://www3.streamy.com/postgres/indextest.sql) which involves a hypothetical customer ordering system, with customers, orders, and customer groups.

 

If we want to retrieve a single customers 10 most recent orders, sorted by date, we can use a double index on (customer,date); Postgres’s query planner will use the double index with  a backwards index scan on the second indexed column (date).

 

However, if we want to retrieve a “customer class’s” 10 most recent orders, sorted by date, we are not able to get Postgres to use double indexes.

 

We have come to the conclusion that the fastest way to accomplish this type of query is to merge, in sorted order, each customers set of orders (for which we can use the double index).  Using a heap to merge these ordered lists (until we reach the limit) seems the most algorithmically efficient way we are able to find.  This is implemented in the attachment as a pl/pythonu function.

 

Another less algorithmically efficient solution, but faster in practice for many cases, is to fetch the full limit of orders from each customer, sort these by date, and return up to the limit.

 

We are no masters of reading query plans, but for straight SQL queries the planner seems to yield two different types of plan.  They are fast in certain cases but breakdown in our typical use cases, where the number of orders per customer is sparse compared to the total number of orders across the date range.

 

We are interested in whether a mechanism internal to Postgres can accomplish this type of merging of indexed columns in sorted order.

 

If this cannot currently be accomplished (or if there is something we are missing about why it shouldn’t be) we would appreciate any pointers to be able to translate our python heap approach into C functions integrated more closely with Postgres.  The python function incurs large constant costs because of type conversions and repeated queries to the database.

 

Thanks for any help or direction.

 

Jonathan Gray / Miguel Simon

Attachment

Re: Query performance issue

From
Chris
Date:
Jonathan Gray wrote:
> We’re experiencing a query performance problem related to the planner
> and its ability to perform a specific type of merge.
>
>
>
> We have created a test case (as attached, or here:
> http://www3.streamy.com/postgres/indextest.sql) which involves a
> hypothetical customer ordering system, with customers, orders, and
> customer groups.
>
>
>
> If we want to retrieve a single customers 10 most recent orders, sorted
> by date, we can use a double index on (customer,date); Postgres’s query
> planner will use the double index with  a backwards index scan on the
> second indexed column (date).
>
>
>
> However, if we want to retrieve a “customer class’s” 10 most recent
> orders, sorted by date, we are not able to get Postgres to use double
> indexes.

You don't have any indexes on the 'customerclass' table.

Creating a foreign key doesn't create an index, you need to do that
separately.

Try

create index cc_customerid_class on indextest.customerclass(classid,
customerid);

--
Postgresql & php tutorials
http://www.designmagick.com/

Re: Query performance issue

From
Chris
Date:
Chris wrote:
> Jonathan Gray wrote:
>> We’re experiencing a query performance problem related to the planner
>> and its ability to perform a specific type of merge.
>>
>>
>>
>> We have created a test case (as attached, or here:
>> http://www3.streamy.com/postgres/indextest.sql) which involves a
>> hypothetical customer ordering system, with customers, orders, and
>> customer groups.
>>
>>
>>
>> If we want to retrieve a single customers 10 most recent orders,
>> sorted by date, we can use a double index on (customer,date);
>> Postgres’s query planner will use the double index with  a backwards
>> index scan on the second indexed column (date).
>>
>>
>>
>> However, if we want to retrieve a “customer class’s” 10 most recent
>> orders, sorted by date, we are not able to get Postgres to use double
>> indexes.
>
> You don't have any indexes on the 'customerclass' table.
>
> Creating a foreign key doesn't create an index, you need to do that
> separately.
>
> Try
>
> create index cc_customerid_class on indextest.customerclass(classid,
> customerid);
>

It could also be that since you don't have very much data (10,000) rows
- postgres is ignoring the indexes because it'll be quicker to scan the
tables.

If you bump it up to say 100k rows, what happens?

--
Postgresql & php tutorials
http://www.designmagick.com/

Re: Query performance issue

From
"Jonathan Gray"
Date:
Chris,

Creating indexes on the customerclass table does speed up the queries but
still does not create the plan we are looking for (using the double index
with a backward index scan on the orders table).

The plans we now get, with times on par or slightly better than with the
plpgsql hack, are:

  EXPLAIN ANALYZE
  SELECT o.orderid,o.orderstamp FROM indextest.orders o
  INNER JOIN indextest.customerclass cc ON (cc.classid = 2)
  WHERE o.customerid = cc.customerid ORDER BY o.orderstamp DESC LIMIT 5;


QUERY PLAN
----------------------------------------------------------------------------
----------------------------------------------------------------------------
--------
 Limit  (cost=0.00..176.65 rows=5 width=12) (actual time=0.930..3.675 rows=5
loops=1)
   ->  Nested Loop  (cost=0.00..46388.80 rows=1313 width=12) (actual
time=0.927..3.664 rows=5 loops=1)
         ->  Index Scan Backward using orders_orderstamp_idx on orders o
(cost=0.00..6225.26 rows=141001 width=16) (actual time=0.015..0.957 rows=433
loops=1)
         ->  Index Scan using customerclass_customerid_idx on customerclass
cc  (cost=0.00..0.27 rows=1 width=4) (actual time=0.004..0.004 rows=0
loops=433)
               Index Cond: (o.customerid = cc.customerid)
               Filter: (classid = 2)

And

  EXPLAIN ANALYZE
  SELECT o.orderid,o.orderstamp FROM indextest.orders o
  INNER JOIN indextest.customerclass cc ON (cc.classid = 2)
  WHERE o.customerid = cc.customerid ORDER BY o.orderstamp DESC LIMIT 100;

                                                                      QUERY
PLAN
----------------------------------------------------------------------------
---------------------------------------------------------------------------
 Limit  (cost=1978.80..1979.05 rows=100 width=12) (actual time=6.167..6.448
rows=100 loops=1)
   ->  Sort  (cost=1978.80..1982.09 rows=1313 width=12) (actual
time=6.165..6.268 rows=100 loops=1)
         Sort Key: o.orderstamp
         ->  Nested Loop  (cost=3.99..1910.80 rows=1313 width=12) (actual
time=0.059..4.576 rows=939 loops=1)
               ->  Bitmap Heap Scan on customerclass cc  (cost=3.99..55.16
rows=95 width=4) (actual time=0.045..0.194 rows=95 loops=1)
                     Recheck Cond: (classid = 2)
                     ->  Bitmap Index Scan on customerclass_classid_idx
(cost=0.00..3.96 rows=95 width=0) (actual time=0.032..0.032 rows=95 loops=1)
                           Index Cond: (classid = 2)
               ->  Index Scan using orders_customerid_idx on orders o
(cost=0.00..19.35 rows=15 width=16) (actual time=0.006..0.025 rows=10
loops=95)
                     Index Cond: (o.customerid = cc.customerid)


As I said, this is a hypothetical test case we have arrived at that
describes our situation as best as we can given a simple case.  We're
interested in potential issues with the approach, why postgres would not
attempt something like it, and how we might go about implementing it
ourselves at a lower level than we currently have (in SPI, libpq, etc).

If it could be generalized then we could use it in cases where we aren't
pulling from just one table (the orders table) but rather trying to merge,
in sorted order, results from different conditions on different tables.
Right now we use something like the plpgsql or plpythonu functions in the
example and they outperform our regular SQL queries by a fairly significant
margin.

An example might be:

SELECT * FROM (
(SELECT orderid,stamp FROM indextest.orders_usa WHERE customerid = <setof
customerids> ORDER BY stamp DESC LIMIT 5) UNION
(SELECT orderid,stamp FROM indextest.orders_can WHERE customerid = <setoff
customerids> ORDER BY stamp DESC LIMIT 5)
) as x ORDER BY x.stamp DESC

Again, that's a general example but some of my queries contain between 5 and
10 different sorted joins of this kind and it would be helpful to have
something internal in postgres to efficiently handle it (do something just
like the above query but not have to do the full LIMIT 5 for each set, some
kind of in order merge/heap join?)

Jonathan Gray

-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Chris
Sent: Tuesday, July 24, 2007 1:51 AM
To: Jonathan Gray
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Query performance issue

Chris wrote:
> Jonathan Gray wrote:
>> We're experiencing a query performance problem related to the planner
>> and its ability to perform a specific type of merge.
>>
>>
>>
>> We have created a test case (as attached, or here:
>> http://www3.streamy.com/postgres/indextest.sql) which involves a
>> hypothetical customer ordering system, with customers, orders, and
>> customer groups.
>>
>>
>>
>> If we want to retrieve a single customers 10 most recent orders,
>> sorted by date, we can use a double index on (customer,date);
>> Postgres's query planner will use the double index with  a backwards
>> index scan on the second indexed column (date).
>>
>>
>>
>> However, if we want to retrieve a "customer class's" 10 most recent
>> orders, sorted by date, we are not able to get Postgres to use double
>> indexes.
>
> You don't have any indexes on the 'customerclass' table.
>
> Creating a foreign key doesn't create an index, you need to do that
> separately.
>
> Try
>
> create index cc_customerid_class on indextest.customerclass(classid,
> customerid);
>

It could also be that since you don't have very much data (10,000) rows
- postgres is ignoring the indexes because it'll be quicker to scan the
tables.

If you bump it up to say 100k rows, what happens?

--
Postgresql & php tutorials
http://www.designmagick.com/

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq


Re: Query performance issue

From
Chris
Date:
Jonathan Gray wrote:
> Chris,
>
> Creating indexes on the customerclass table does speed up the queries but
> still does not create the plan we are looking for (using the double index
> with a backward index scan on the orders table).

Stupid question - why is that particular plan your "goal" plan?

> The plans we now get, with times on par or slightly better than with the
> plpgsql hack, are:
>
>   EXPLAIN ANALYZE
>   SELECT o.orderid,o.orderstamp FROM indextest.orders o
>   INNER JOIN indextest.customerclass cc ON (cc.classid = 2)
>   WHERE o.customerid = cc.customerid ORDER BY o.orderstamp DESC LIMIT 5;

Didn't notice this before...

Shouldn't this be:

INNER JOIN indextest.customerclass cc ON (o.customerid = cc.customerid)
WHERE cc.classid = 2

ie join on the common field not the classid one which doesn't appear in
the 2nd table?

> As I said, this is a hypothetical test case we have arrived at that
> describes our situation as best as we can given a simple case.  We're
> interested in potential issues with the approach, why postgres would not
> attempt something like it, and how we might go about implementing it
> ourselves at a lower level than we currently have (in SPI, libpq, etc).
>
> If it could be generalized then we could use it in cases where we aren't
> pulling from just one table (the orders table) but rather trying to merge,
> in sorted order, results from different conditions on different tables.
> Right now we use something like the plpgsql or plpythonu functions in the
> example and they outperform our regular SQL queries by a fairly significant
> margin.

I'm sure if you posted the queries you are running with relevant info
you'd get some help ;)

--
Postgresql & php tutorials
http://www.designmagick.com/

Re: Query performance issue

From
"Jonathan Gray"
Date:
That particular plan is our goal because we've "hacked" it together to
perform better than the normal sql plans.  Analytically it makes sense to
approach this particular problem in this way because it is relatively
invariant to the distributions and sizes of the tables (with only having to
deal with increased index size).

Also, changing around the query doesn't change the query plan at all.  The
planner is intelligent enough to figure out what it really needs to join on
despite my poor query writing.  I originally had it this way to ensure my
(customerid,orderstamp) conditions were in the correct order but again
appears to not matter.

I will try to get a more complex/sophisticated test case running.  I'm not
able to post my actual structure or queries but I'll try to produce a better
example of the other (multiple table) case tomorrow.

Thanks.

Jonathan Gray


-----Original Message-----
From: Chris [mailto:dmagick@gmail.com]
Sent: Tuesday, July 24, 2007 2:36 AM
To: Jonathan Gray
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Query performance issue

Jonathan Gray wrote:
> Chris,
>
> Creating indexes on the customerclass table does speed up the queries but
> still does not create the plan we are looking for (using the double index
> with a backward index scan on the orders table).

Stupid question - why is that particular plan your "goal" plan?

> The plans we now get, with times on par or slightly better than with the
> plpgsql hack, are:
>
>   EXPLAIN ANALYZE
>   SELECT o.orderid,o.orderstamp FROM indextest.orders o
>   INNER JOIN indextest.customerclass cc ON (cc.classid = 2)
>   WHERE o.customerid = cc.customerid ORDER BY o.orderstamp DESC LIMIT 5;

Didn't notice this before...

Shouldn't this be:

INNER JOIN indextest.customerclass cc ON (o.customerid = cc.customerid)
WHERE cc.classid = 2

ie join on the common field not the classid one which doesn't appear in
the 2nd table?

> As I said, this is a hypothetical test case we have arrived at that
> describes our situation as best as we can given a simple case.  We're
> interested in potential issues with the approach, why postgres would not
> attempt something like it, and how we might go about implementing it
> ourselves at a lower level than we currently have (in SPI, libpq, etc).
>
> If it could be generalized then we could use it in cases where we aren't
> pulling from just one table (the orders table) but rather trying to merge,
> in sorted order, results from different conditions on different tables.
> Right now we use something like the plpgsql or plpythonu functions in the
> example and they outperform our regular SQL queries by a fairly
significant
> margin.

I'm sure if you posted the queries you are running with relevant info
you'd get some help ;)

--
Postgresql & php tutorials
http://www.designmagick.com/