Thread: Query not using index
Here's the query: SELECT cart_row_id FROM pa_shopping_cart WHERE order_id = 20; Here's the schema: Attribute | Type | Modifier -----------------------+--------------------------+--------------------------------------------------------------------- cart_row_id | integer | not null default product_id | integer | not null color | character varying(100) | size | character varying(100) | style | character varying(100) | order_id | integer | not null time_added | timestamp with time zone | voided_date | timestamp with time zone | voided_by | integer | expired_date | timestamp with time zone | item_state | character varying(50) | default 'in_basket' received_back_date | timestamp with time zone | price_charged | numeric(30,6) | price_refunded | numeric(30,6) | shipping_charged | numeric(30,6) | shipping_refunded | numeric(30,6) | price_tax_charged | numeric(30,6) | price_tax_refunded | numeric(30,6) | shipping_tax_charged | numeric(30,6) | shipping_tax_refunded | numeric(30,6) | price_name | character varying(30) | refund_id | integer | cs_comments | text | price | numeric(30,6) | ship_group_id | integer | ship_package_id | integer | delivery_date | date | sentiment | text | vendor_id | integer | linkshare_sent | bit(1) | mapped_to_address | character(1) | product_name | character varying(200) | Indices: delivery_date_pa_shopping_cart_, pa_cart_by_item_state, pa_cart_by_order, pa_cart_by_product, pa_cart_row_order, pa_item_map_to_addr, pa_shop_cart_prod_ord_idx, pa_shopping_cart_pkey, ship_package_id_pa_shopping_car, vendor_id_pa_shopping_cart_key There is an index on: just order_id just order_id and cart_row_id and a PK on cart row_id I don't understand why it's not using one of these indexes! Please post your responses to the group - my email is down. Thanks for any help! -r
Have you vacuum analyzed recently and what does explain show for the query? On Thu, 10 May 2001 ryan@paymentalliance.net wrote: > Here's the query: > > SELECT > cart_row_id > FROM > pa_shopping_cart > WHERE > order_id = 20; > [ ... ] > There is an index on: > just order_id > just order_id and cart_row_id > and a PK on cart row_id > > I don't understand why it's not using one of these indexes! > Please post your responses to the group - my email is down.
I vacuum every half hour! Here is the output from EXPLAIN: NOTICE: QUERY PLAN: Seq Scan on pa_shopping_cart (cost=0.00..7237.94 rows=9420 width=296) EXPLAIN Thanks! On Thu, 10 May 2001 18:19:16 +0000 (UTC), sszabo@megazone23.bigpanda.com (Stephan Szabo) wrote: > >Have you vacuum analyzed recently and what does >explain show for the query? > >On Thu, 10 May 2001 ryan@paymentalliance.net wrote: > >> Here's the query: >> >> SELECT >> cart_row_id >> FROM >> pa_shopping_cart >> WHERE >> order_id = 20; >> [ ... ] >> There is an index on: >> just order_id >> just order_id and cart_row_id >> and a PK on cart row_id >> >> I don't understand why it's not using one of these indexes! >> Please post your responses to the group - my email is down. > > >---------------------------(end of broadcast)--------------------------- >TIP 2: you can get off all lists at once with the unregister command > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
Does that query really return 9420 rows ? If so, a sequential scan is probably better/faster than an index scan.. -Mitch ----- Original Message ----- From: <ryan@paymentalliance.net> To: <pgsql-general@postgresql.org> Sent: Thursday, May 10, 2001 9:22 AM Subject: Re: Query not using index > I vacuum every half hour! Here is the output from EXPLAIN: > > NOTICE: QUERY PLAN: > > Seq Scan on pa_shopping_cart (cost=0.00..7237.94 rows=9420 width=296) > > EXPLAIN > > Thanks! > > > On Thu, 10 May 2001 18:19:16 +0000 (UTC), > sszabo@megazone23.bigpanda.com (Stephan Szabo) wrote: > > > > >Have you vacuum analyzed recently and what does > >explain show for the query? > > > >On Thu, 10 May 2001 ryan@paymentalliance.net wrote: > > > >> Here's the query: > >> > >> SELECT > >> cart_row_id > >> FROM > >> pa_shopping_cart > >> WHERE > >> order_id = 20; > >> [ ... ] > >> There is an index on: > >> just order_id > >> just order_id and cart_row_id > >> and a PK on cart row_id > >> > >> I don't understand why it's not using one of these indexes! > >> Please post your responses to the group - my email is down. > > > > > >---------------------------(end of broadcast)--------------------------- > >TIP 2: you can get off all lists at once with the unregister command > > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html >
On Thu, May 10, 2001 at 01:22:56PM +0000, ryan@paymentalliance.net wrote: > I vacuum every half hour! Here is the output from EXPLAIN: > > NOTICE: QUERY PLAN: > > Seq Scan on pa_shopping_cart (cost=0.00..7237.94 rows=9420 width=296) > > EXPLAIN > > Thanks! Then try set enable_seqscan to off; explain select ... and see what the cost is. Compare it with the value above. Cheers, Patrick
No the query usually returns between 0 and 5 rows. Usually not zero - most often 1. -r On Thu, 10 May 2001 19:47:32 +0000 (UTC), mitch@venux.net ("Mitch Vincent") wrote: >Does that query really return 9420 rows ? If so, a sequential scan is >probably better/faster than an index scan.. > >-Mitch > >----- Original Message ----- >From: <ryan@paymentalliance.net> >To: <pgsql-general@postgresql.org> >Sent: Thursday, May 10, 2001 9:22 AM >Subject: Re: Query not using index > > >> I vacuum every half hour! Here is the output from EXPLAIN: >> >> NOTICE: QUERY PLAN: >> >> Seq Scan on pa_shopping_cart (cost=0.00..7237.94 rows=9420 width=296) >> >> EXPLAIN >> >> Thanks! >> >> >> On Thu, 10 May 2001 18:19:16 +0000 (UTC), >> sszabo@megazone23.bigpanda.com (Stephan Szabo) wrote: >> >> > >> >Have you vacuum analyzed recently and what does >> >explain show for the query? >> > >> >On Thu, 10 May 2001 ryan@paymentalliance.net wrote: >> > >> >> Here's the query: >> >> >> >> SELECT >> >> cart_row_id >> >> FROM >> >> pa_shopping_cart >> >> WHERE >> >> order_id = 20; >> >> [ ... ] >> >> There is an index on: >> >> just order_id >> >> just order_id and cart_row_id >> >> and a PK on cart row_id >> >> >> >> I don't understand why it's not using one of these indexes! >> >> Please post your responses to the group - my email is down. >> > >> > >> >---------------------------(end of broadcast)--------------------------- >> >TIP 2: you can get off all lists at once with the unregister command >> > (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) >> >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 5: Have you checked our extensive FAQ? >> >> http://www.postgresql.org/users-lounge/docs/faq.html >> > > >---------------------------(end of broadcast)--------------------------- >TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
ryan@paymentalliance.net writes: > No the query usually returns between 0 and 5 rows. Usually not zero - > most often 1. Ah. You must have a few values that are far more frequent (like tens of thousands of occurrences?) and these are throwing off the planner's statistics. 7.2 will probably do better with this sort of data distribution, but for now it's a difficult problem. regards, tom lane
On Thu, May 10, 2001 at 05:22:07PM -0400, Tom Lane wrote: > ryan@paymentalliance.net writes: > > No the query usually returns between 0 and 5 rows. Usually not zero - > > most often 1. > > Ah. You must have a few values that are far more frequent (like tens of > thousands of occurrences?) and these are throwing off the planner's > statistics. I had a similar situation, where I had a lot of rows with 0's in them. Changing those to NULLs worked wonders. The planner (or statistics gatherer, or something) apparently takes notice of the distribution of non-NULL values. Chris -- chris@mt.sri.com ----------------------------------------------------- Chris Jones SRI International, Inc. www.sri.com
Attachment
Chris Jones <chris@mt.sri.com> writes: >> Ah. You must have a few values that are far more frequent (like tens of >> thousands of occurrences?) and these are throwing off the planner's >> statistics. > I had a similar situation, where I had a lot of rows with 0's in > them. Changing those to NULLs worked wonders. Yes, if you have a lot of "dummy" values it's a good idea to represent them as NULLs rather than some arbitrarily-chosen regular data value. The planner does keep track of NULLs separately from everything else. regards, tom lane
From: "Tom Lane" <tgl@sss.pgh.pa.us> > > I had a similar situation, where I had a lot of rows with 0's in > > them. Changing those to NULLs worked wonders. > > Yes, if you have a lot of "dummy" values it's a good idea to represent > them as NULLs rather than some arbitrarily-chosen regular data value. > The planner does keep track of NULLs separately from everything else. Is there a good reason why rdbms don't just keep a cache of decisions on this stuff. I realise SQL is supposed to be ad-hoc but in reality, it's the old 90:10 rule where a handful of queries get run consistently and where performance is important. Why doesn't PG (or any other system afaik) just have a first guess, run the query and then if the costs are horribly wrong cache the right result. I'm guessing there's a bloody good reason (TM) for it since query planning has got to be equivalent to least-cost path so NP (NP-Complete? I forget - too long out of college). - Richard Huxton
"Richard Huxton" <dev@archonet.com> writes: > Why doesn't PG (or any other system afaik) just have a first guess, run the > query and then if the costs are horribly wrong cache the right result. ?? Knowing that your previous guess was wrong doesn't tell you what the right answer is, especially not for the somewhat-different question that the next query is likely to provide. The real problem here is simply that PG hasn't been keeping adequately detailed statistics. I'm currently working on improving that for 7.2... see discussions over in pghackers if you are interested. regards, tom lane
> Is there a good reason why rdbms don't just keep a cache of decisions on > this stuff. I realise SQL is supposed to be ad-hoc but in reality, it's the > old 90:10 rule where a handful of queries get run consistently and where > performance is important. > > Why doesn't PG (or any other system afaik) just have a first guess, run the > query and then if the costs are horribly wrong cache the right result. I'm > guessing there's a bloody good reason (TM) for it since query planning has > got to be equivalent to least-cost path so NP (NP-Complete? I forget - too > long out of college). I have asked about this before. Decisions about sequential/index scans could be theoretically fed from the executor back to the optimizer for later user. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
You and Stephan hit it right on the nose - our table has been maliciously propagated with thousands of faulty values - once gone index are in use and DB is SPEEDING along 8) Thanks for your help!!! -r On Thu, 10 May 2001 21:49:28 +0000 (UTC), in comp.databases.postgresql.general you wrote: >--w2JjAQZceEVGylhD >Content-Type: text/plain; charset=us-ascii >Content-Disposition: inline >Content-Transfer-Encoding: quoted-printable > >On Thu, May 10, 2001 at 05:22:07PM -0400, Tom Lane wrote: > >> ryan@paymentalliance.net writes: >> > No the query usually returns between 0 and 5 rows. Usually not zero - >> > most often 1. >>=20 >> Ah. You must have a few values that are far more frequent (like tens of >> thousands of occurrences?) and these are throwing off the planner's >> statistics. > >I had a similar situation, where I had a lot of rows with 0's in >them. Changing those to NULLs worked wonders. The planner (or >statistics gatherer, or something) apparently takes notice of the >distribution of non-NULL values. > >Chris > >--=20 >chris@mt.sri.com ----------------------------------------------------- >Chris Jones SRI International, Inc. > www.sri.com > >--w2JjAQZceEVGylhD >Content-Type: application/pgp-signature >Content-Disposition: inline > >-----BEGIN PGP SIGNATURE----- >Version: GnuPG v1.0.4 (NetBSD) >Comment: For info see http://www.gnupg.org > >iEYEARECAAYFAjr7CVoACgkQ4nX8TnrnU2+p0ACaAoSNEtwIlibMlh+H9ehJecmy >lBcAnjI0TYJubbSIwgzi8DuRxoos4OwT >=edfw >-----END PGP SIGNATURE----- > >--w2JjAQZceEVGylhD >Content-Type: text/plain >Content-Disposition: inline >Content-Transfer-Encoding: binary >MIME-Version: 1.0 > > >---------------------------(end of broadcast)--------------------------- >TIP 4: Don't 'kill -9' the postmaster > >--w2JjAQZceEVGylhD--
From: "Tom Lane" <tgl@sss.pgh.pa.us> > "Richard Huxton" <dev@archonet.com> writes: > > Why doesn't PG (or any other system afaik) just have a first guess, run the > > query and then if the costs are horribly wrong cache the right result. > > ?? Knowing that your previous guess was wrong doesn't tell you what the > right answer is, especially not for the somewhat-different question that > the next query is likely to provide. Surely if you used a seqscan on "where x=1" and only got 2 rows rather than the 3000 you were expecting the only alternative is to try an index? > The real problem here is simply that PG hasn't been keeping adequately > detailed statistics. I'm currently working on improving that for 7.2... > see discussions over in pghackers if you are interested. Thinking about it (along with Bruce's reply posted to the list) I guess the difference is whether you gather the statistics up-front during a vacuum, or build them as queries are used. You're always going to need *something* to base your first guess on anyway - the "learning" would only help you in those cases where the distribution of values wasn't a normal curve. Anyway, given that I'm up to my neck in work at the moment and I don't actually know what I'm talking about, I'll shut up and get back to keeping clients happy :-) - Richard Huxton
"Richard Huxton" <dev@archonet.com> writes: >> ?? Knowing that your previous guess was wrong doesn't tell you what the >> right answer is, especially not for the somewhat-different question that >> the next query is likely to provide. > Surely if you used a seqscan on "where x=1" and only got 2 rows rather than > the 3000 you were expecting the only alternative is to try an index? But if the next query is "where x=2", what do you do? Keep in mind that the data distributions people have been having trouble with are irregular: you can't conclude anything very reliable about x=2 based on what you know about x=1. > Thinking about it (along with Bruce's reply posted to the list) I guess the > difference is whether you gather the statistics up-front during a vacuum, or > build them as queries are used. Stats gathered as a byproduct of individual queries might be useful if you happen to get the exact same queries over again, but I doubt that a succession of such results should be expected to build up a picture that's complete enough to extrapolate to other queries. Stats gathered by ANALYZE have the merit that they come from a process that's designed specifically to give you a good statistical picture. regards, tom lane