Thread: Query performance issue
We’re experiencing a query performance problem related to the planner and its ability to perform a specific type of merge.
We have created a test case (as attached, or here: http://www3.streamy.com/postgres/indextest.sql) which involves a hypothetical customer ordering system, with customers, orders, and customer groups.
If we want to retrieve a single customers 10 most recent orders, sorted by date, we can use a double index on (customer,date); Postgres’s query planner will use the double index with a backwards index scan on the second indexed column (date).
However, if we want to retrieve a “customer class’s” 10 most recent orders, sorted by date, we are not able to get Postgres to use double indexes.
We have come to the conclusion that the fastest way to accomplish this type of query is to merge, in sorted order, each customers set of orders (for which we can use the double index). Using a heap to merge these ordered lists (until we reach the limit) seems the most algorithmically efficient way we are able to find. This is implemented in the attachment as a pl/pythonu function.
Another less algorithmically efficient solution, but faster in practice for many cases, is to fetch the full limit of orders from each customer, sort these by date, and return up to the limit.
We are no masters of reading query plans, but for straight SQL queries the planner seems to yield two different types of plan. They are fast in certain cases but breakdown in our typical use cases, where the number of orders per customer is sparse compared to the total number of orders across the date range.
We are interested in whether a mechanism internal to Postgres can accomplish this type of merging of indexed columns in sorted order.
If this cannot currently be accomplished (or if there is something we are missing about why it shouldn’t be) we would appreciate any pointers to be able to translate our python heap approach into C functions integrated more closely with Postgres. The python function incurs large constant costs because of type conversions and repeated queries to the database.
Thanks for any help or direction.
Jonathan Gray / Miguel Simon
Attachment
Jonathan Gray wrote: > We’re experiencing a query performance problem related to the planner > and its ability to perform a specific type of merge. > > > > We have created a test case (as attached, or here: > http://www3.streamy.com/postgres/indextest.sql) which involves a > hypothetical customer ordering system, with customers, orders, and > customer groups. > > > > If we want to retrieve a single customers 10 most recent orders, sorted > by date, we can use a double index on (customer,date); Postgres’s query > planner will use the double index with a backwards index scan on the > second indexed column (date). > > > > However, if we want to retrieve a “customer class’s” 10 most recent > orders, sorted by date, we are not able to get Postgres to use double > indexes. You don't have any indexes on the 'customerclass' table. Creating a foreign key doesn't create an index, you need to do that separately. Try create index cc_customerid_class on indextest.customerclass(classid, customerid); -- Postgresql & php tutorials http://www.designmagick.com/
Chris wrote: > Jonathan Gray wrote: >> We’re experiencing a query performance problem related to the planner >> and its ability to perform a specific type of merge. >> >> >> >> We have created a test case (as attached, or here: >> http://www3.streamy.com/postgres/indextest.sql) which involves a >> hypothetical customer ordering system, with customers, orders, and >> customer groups. >> >> >> >> If we want to retrieve a single customers 10 most recent orders, >> sorted by date, we can use a double index on (customer,date); >> Postgres’s query planner will use the double index with a backwards >> index scan on the second indexed column (date). >> >> >> >> However, if we want to retrieve a “customer class’s” 10 most recent >> orders, sorted by date, we are not able to get Postgres to use double >> indexes. > > You don't have any indexes on the 'customerclass' table. > > Creating a foreign key doesn't create an index, you need to do that > separately. > > Try > > create index cc_customerid_class on indextest.customerclass(classid, > customerid); > It could also be that since you don't have very much data (10,000) rows - postgres is ignoring the indexes because it'll be quicker to scan the tables. If you bump it up to say 100k rows, what happens? -- Postgresql & php tutorials http://www.designmagick.com/
Chris, Creating indexes on the customerclass table does speed up the queries but still does not create the plan we are looking for (using the double index with a backward index scan on the orders table). The plans we now get, with times on par or slightly better than with the plpgsql hack, are: EXPLAIN ANALYZE SELECT o.orderid,o.orderstamp FROM indextest.orders o INNER JOIN indextest.customerclass cc ON (cc.classid = 2) WHERE o.customerid = cc.customerid ORDER BY o.orderstamp DESC LIMIT 5; QUERY PLAN ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- -------- Limit (cost=0.00..176.65 rows=5 width=12) (actual time=0.930..3.675 rows=5 loops=1) -> Nested Loop (cost=0.00..46388.80 rows=1313 width=12) (actual time=0.927..3.664 rows=5 loops=1) -> Index Scan Backward using orders_orderstamp_idx on orders o (cost=0.00..6225.26 rows=141001 width=16) (actual time=0.015..0.957 rows=433 loops=1) -> Index Scan using customerclass_customerid_idx on customerclass cc (cost=0.00..0.27 rows=1 width=4) (actual time=0.004..0.004 rows=0 loops=433) Index Cond: (o.customerid = cc.customerid) Filter: (classid = 2) And EXPLAIN ANALYZE SELECT o.orderid,o.orderstamp FROM indextest.orders o INNER JOIN indextest.customerclass cc ON (cc.classid = 2) WHERE o.customerid = cc.customerid ORDER BY o.orderstamp DESC LIMIT 100; QUERY PLAN ---------------------------------------------------------------------------- --------------------------------------------------------------------------- Limit (cost=1978.80..1979.05 rows=100 width=12) (actual time=6.167..6.448 rows=100 loops=1) -> Sort (cost=1978.80..1982.09 rows=1313 width=12) (actual time=6.165..6.268 rows=100 loops=1) Sort Key: o.orderstamp -> Nested Loop (cost=3.99..1910.80 rows=1313 width=12) (actual time=0.059..4.576 rows=939 loops=1) -> Bitmap Heap Scan on customerclass cc (cost=3.99..55.16 rows=95 width=4) (actual time=0.045..0.194 rows=95 loops=1) Recheck Cond: (classid = 2) -> Bitmap Index Scan on customerclass_classid_idx (cost=0.00..3.96 rows=95 width=0) (actual time=0.032..0.032 rows=95 loops=1) Index Cond: (classid = 2) -> Index Scan using orders_customerid_idx on orders o (cost=0.00..19.35 rows=15 width=16) (actual time=0.006..0.025 rows=10 loops=95) Index Cond: (o.customerid = cc.customerid) As I said, this is a hypothetical test case we have arrived at that describes our situation as best as we can given a simple case. We're interested in potential issues with the approach, why postgres would not attempt something like it, and how we might go about implementing it ourselves at a lower level than we currently have (in SPI, libpq, etc). If it could be generalized then we could use it in cases where we aren't pulling from just one table (the orders table) but rather trying to merge, in sorted order, results from different conditions on different tables. Right now we use something like the plpgsql or plpythonu functions in the example and they outperform our regular SQL queries by a fairly significant margin. An example might be: SELECT * FROM ( (SELECT orderid,stamp FROM indextest.orders_usa WHERE customerid = <setof customerids> ORDER BY stamp DESC LIMIT 5) UNION (SELECT orderid,stamp FROM indextest.orders_can WHERE customerid = <setoff customerids> ORDER BY stamp DESC LIMIT 5) ) as x ORDER BY x.stamp DESC Again, that's a general example but some of my queries contain between 5 and 10 different sorted joins of this kind and it would be helpful to have something internal in postgres to efficiently handle it (do something just like the above query but not have to do the full LIMIT 5 for each set, some kind of in order merge/heap join?) Jonathan Gray -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Chris Sent: Tuesday, July 24, 2007 1:51 AM To: Jonathan Gray Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Query performance issue Chris wrote: > Jonathan Gray wrote: >> We're experiencing a query performance problem related to the planner >> and its ability to perform a specific type of merge. >> >> >> >> We have created a test case (as attached, or here: >> http://www3.streamy.com/postgres/indextest.sql) which involves a >> hypothetical customer ordering system, with customers, orders, and >> customer groups. >> >> >> >> If we want to retrieve a single customers 10 most recent orders, >> sorted by date, we can use a double index on (customer,date); >> Postgres's query planner will use the double index with a backwards >> index scan on the second indexed column (date). >> >> >> >> However, if we want to retrieve a "customer class's" 10 most recent >> orders, sorted by date, we are not able to get Postgres to use double >> indexes. > > You don't have any indexes on the 'customerclass' table. > > Creating a foreign key doesn't create an index, you need to do that > separately. > > Try > > create index cc_customerid_class on indextest.customerclass(classid, > customerid); > It could also be that since you don't have very much data (10,000) rows - postgres is ignoring the indexes because it'll be quicker to scan the tables. If you bump it up to say 100k rows, what happens? -- Postgresql & php tutorials http://www.designmagick.com/ ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Jonathan Gray wrote: > Chris, > > Creating indexes on the customerclass table does speed up the queries but > still does not create the plan we are looking for (using the double index > with a backward index scan on the orders table). Stupid question - why is that particular plan your "goal" plan? > The plans we now get, with times on par or slightly better than with the > plpgsql hack, are: > > EXPLAIN ANALYZE > SELECT o.orderid,o.orderstamp FROM indextest.orders o > INNER JOIN indextest.customerclass cc ON (cc.classid = 2) > WHERE o.customerid = cc.customerid ORDER BY o.orderstamp DESC LIMIT 5; Didn't notice this before... Shouldn't this be: INNER JOIN indextest.customerclass cc ON (o.customerid = cc.customerid) WHERE cc.classid = 2 ie join on the common field not the classid one which doesn't appear in the 2nd table? > As I said, this is a hypothetical test case we have arrived at that > describes our situation as best as we can given a simple case. We're > interested in potential issues with the approach, why postgres would not > attempt something like it, and how we might go about implementing it > ourselves at a lower level than we currently have (in SPI, libpq, etc). > > If it could be generalized then we could use it in cases where we aren't > pulling from just one table (the orders table) but rather trying to merge, > in sorted order, results from different conditions on different tables. > Right now we use something like the plpgsql or plpythonu functions in the > example and they outperform our regular SQL queries by a fairly significant > margin. I'm sure if you posted the queries you are running with relevant info you'd get some help ;) -- Postgresql & php tutorials http://www.designmagick.com/
That particular plan is our goal because we've "hacked" it together to perform better than the normal sql plans. Analytically it makes sense to approach this particular problem in this way because it is relatively invariant to the distributions and sizes of the tables (with only having to deal with increased index size). Also, changing around the query doesn't change the query plan at all. The planner is intelligent enough to figure out what it really needs to join on despite my poor query writing. I originally had it this way to ensure my (customerid,orderstamp) conditions were in the correct order but again appears to not matter. I will try to get a more complex/sophisticated test case running. I'm not able to post my actual structure or queries but I'll try to produce a better example of the other (multiple table) case tomorrow. Thanks. Jonathan Gray -----Original Message----- From: Chris [mailto:dmagick@gmail.com] Sent: Tuesday, July 24, 2007 2:36 AM To: Jonathan Gray Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Query performance issue Jonathan Gray wrote: > Chris, > > Creating indexes on the customerclass table does speed up the queries but > still does not create the plan we are looking for (using the double index > with a backward index scan on the orders table). Stupid question - why is that particular plan your "goal" plan? > The plans we now get, with times on par or slightly better than with the > plpgsql hack, are: > > EXPLAIN ANALYZE > SELECT o.orderid,o.orderstamp FROM indextest.orders o > INNER JOIN indextest.customerclass cc ON (cc.classid = 2) > WHERE o.customerid = cc.customerid ORDER BY o.orderstamp DESC LIMIT 5; Didn't notice this before... Shouldn't this be: INNER JOIN indextest.customerclass cc ON (o.customerid = cc.customerid) WHERE cc.classid = 2 ie join on the common field not the classid one which doesn't appear in the 2nd table? > As I said, this is a hypothetical test case we have arrived at that > describes our situation as best as we can given a simple case. We're > interested in potential issues with the approach, why postgres would not > attempt something like it, and how we might go about implementing it > ourselves at a lower level than we currently have (in SPI, libpq, etc). > > If it could be generalized then we could use it in cases where we aren't > pulling from just one table (the orders table) but rather trying to merge, > in sorted order, results from different conditions on different tables. > Right now we use something like the plpgsql or plpythonu functions in the > example and they outperform our regular SQL queries by a fairly significant > margin. I'm sure if you posted the queries you are running with relevant info you'd get some help ;) -- Postgresql & php tutorials http://www.designmagick.com/