Thread: Forcing query to use an index
Hello everyone, I have a search query that does a whole bunch of LEFT OUTER JOINs between multiple tables, since this is a generic search and some records may not exist for certain customers (e.g. searching for the text "kate" should bring up people whose customer name, street address or email addresses match that word). This is for an ISP's customer management database. Unfortunately one stage in the query keeps using a sequence scan rather than the index. Here is the "EXPLAIN ANALYZE" results for the 115 line SQL query. Sort (cost=6666.08..6666.08 rows=268 width=265) (actual time=949.00..949.00 rows=1 loops=1) -> Aggregate (cost=6487.84..6655.27 rows=268 width=265) (actual time=948.86..948.86 rows=1 loops=1) -> Group (cost=6487.84..6648.58 rows=2679 width=265) (actual time=948.70..948.70 rows=1 loops=1) -> Sort (cost=6487.84..6487.84 rows=2679 width=265) (actual time=948.66..948.66 rows=1 loops=1) -> Merge Join (cost=6106.42..6335.30 rows=2679 width=265) (actual time=859.77..948.06 rows=1 loops=1) -> Merge Join (cost=6101.24..6319.77 rows=2679 width=247) (actual time=554.11..674.17 rows=2679 loops=1) -> Index Scan using customer_id_keyon customer c (cost=0.00..129.63 rows=2679 width=156) (actual time=0.40..43.43 rows=2679 loops=1) -> Sort (cost=6101.24..6101.24 rows=8117 width=91) (actual time=553.64..559.58 rows=8117 loops=1) -> Seq Scan on customer_month_summary cms (cost=0.00..5574.17 rows=8117 width=91) (actual time=258.03..477.11 rows=8117 loops=1) -> Sort (cost=5.18..5.18 rows=77 width=18) (actual time=0.70..0.80 rows=77 loops=1) -> Seq Scan on emailaddress ea (cost=0.00..2.77 rows=77 width=18) (actual time=0.08..0.35 rows=77 loops=1) Total runtime: 951.70 msec The table in question is "customer_month_summary"; it has 8117 rows. Essentially, there is one record in the customer_month_summary table for every month for every customer that has a cable modem (this doesn't include dial-up users). I have two columns in the summary table that I'm matching by: CustomerID and MonthStart, which is the first day of the month in question. I also have an index on this table on "Customer_Month_Summary(MonthStart, CustomerID)". It keeps trying to do a sequence scan, and while the query only takes 951 msec right now, the summary table will keep growing, and I don't want performance to suffer a few months/years down the line (also, having to wait a second between search results is a bit too much). This is currently running on a dual-proc PIII-800 with 4G of ram. I've put a lot of effort to make this application very responsive, but it's this one query that keeps killing me. Any help you can provide would therefore be much appreciated. --man Michael A Nachbaur <mike@nachbaur.com> "I used to hate writing assignments, but now I enjoy them. I realized that the purpose of writing is to inflate weak ideas, obscure poor reasoning and inhibit clarity. With a little practice, writing can be an intimidating and impenetrable fog!" -- Calvin
Michael, > Unfortunately one stage in the query keeps using a sequence scan rather > than the index. Here is the "EXPLAIN ANALYZE" results for the 115 line > SQL query. If you want any help, you'll need to post the query as well as the explain results. -- -Josh BerkusAglio Database SolutionsSan Francisco
On Mon, 3 Mar 2003, Michael Nachbaur wrote: > Hello everyone, > > I have a search query that does a whole bunch of LEFT OUTER JOINs > between multiple tables, since this is a generic search and some > records may not exist for certain customers (e.g. searching for the > text "kate" should bring up people whose customer name, street address > or email addresses match that word). This is for an ISP's customer > management database. > > Unfortunately one stage in the query keeps using a sequence scan rather > than the index. Here is the "EXPLAIN ANALYZE" results for the 115 line > SQL query. > > Sort (cost=6666.08..6666.08 rows=268 width=265) (actual > time=949.00..949.00 rows=1 loops=1) > -> Aggregate (cost=6487.84..6655.27 rows=268 width=265) (actual > time=948.86..948.86 rows=1 loops=1) > -> Group (cost=6487.84..6648.58 rows=2679 width=265) (actual > time=948.70..948.70 rows=1 loops=1) > -> Sort (cost=6487.84..6487.84 rows=2679 width=265) > (actual time=948.66..948.66 rows=1 loops=1) > -> Merge Join (cost=6106.42..6335.30 rows=2679 > width=265) (actual time=859.77..948.06 rows=1 loops=1) > -> Merge Join (cost=6101.24..6319.77 > rows=2679 width=247) (actual time=554.11..674.17 rows=2679 loops=1) > -> Index Scan using customer_id_key on > customer c (cost=0.00..129.63 rows=2679 width=156) (actual > time=0.40..43.43 rows=2679 loops=1) > -> Sort (cost=6101.24..6101.24 > rows=8117 width=91) (actual time=553.64..559.58 rows=8117 loops=1) > -> Seq Scan on > customer_month_summary cms (cost=0.00..5574.17 rows=8117 width=91) > (actual time=258.03..477.11 rows=8117 loops=1) > -> Sort (cost=5.18..5.18 rows=77 width=18) > (actual time=0.70..0.80 rows=77 loops=1) > -> Seq Scan on emailaddress ea > (cost=0.00..2.77 rows=77 width=18) (actual time=0.08..0.35 rows=77 > loops=1) > Total runtime: 951.70 msec > > The table in question is "customer_month_summary"; it has 8117 rows. If you're hitting all the rows in the table, there's only disadvantage to using an indexscan (right now, given the way data is stored). If you were returning some fraction of the rows postgresql should hopefully switch to a different plan (depending on the estimated costs).
Michael Nachbaur <mike@nachbaur.com> writes: >-> Merge Join (cost=6106.42..6335.30 rows=2679 width=265) (actual time=859.77..948.06 rows=1 loops=1) > -> Merge Join (cost=6101.24..6319.77 rows=2679 width=247) (actual time=554.11..674.17 rows=2679 loops=1) > -> Index Scan using customer_id_key on customer c (cost=0.00..129.63 rows=2679 width=156) (actual time=0.40..43.43rows=2679 loops=1) > -> Sort (cost=6101.24..6101.24 rows=8117 width=91) (actual time=553.64..559.58 rows=8117 loops=1) > -> Seq Scan on customer_month_summary cms (cost=0.00..5574.17 rows=8117 width=91) (actual time=258.03..477.11rows=8117 loops=1) You should send the query as well, and \d customer_month_summary so we can see how you defined your indexes. There doesn't seem to be a filter on the scan so it looks like postgres thinks you're actually reading in the entire table, which is normally faster with a sequential scan than an index scan. In fact I'm surprised it's doing an index scan on the other table and not a sequential scan. Some things to try: set enable_seqscan = off Then try your query again, see if postgres is right and it really is faster to do the sequential scan. set random_page_cost = 2 Or even lower values. I've also had some success with raising cpu_tuple_cost, though I'm unclear on whether that's actually a good approach or not. Also, don't forget to do a vacuum full on these tables before doing testing for optimizations at this level. You can get some confusing results if your tables have lots of empty holes in them. -- greg
On Monday, Mar 3, 2003, at 14:09 US/Pacific, Josh Berkus wrote: > If you want any help, you'll need to post the query as well as the > explain > results. Sorry, forgot to include the statement. Don't you just hate it when you do that? (or forget to include an attachment to an email)
Attachment
On Monday, Mar 3, 2003, at 14:09 US/Pacific, Stephan Szabo wrote: > If you're hitting all the rows in the table, there's only disadvantage > to using an indexscan (right now, given the way data is stored). If > you > were returning some fraction of the rows postgresql should hopefully > switch to a different plan (depending on the estimated costs). I should only ever get one match per record. This is the structure of the relevant tables: Customer --> Customer_Month_Summary ^ | EmailAddress There is one Customer_Month_Summary record per-customer-per-month, so I should only get one record out of the database. Additionally, since there are about 2000 customer records, but multiple months in the summary table, I'll never use the entire results of the summary table in one query.
On Monday, Mar 3, 2003, at 14:21 US/Pacific, Greg Stark wrote: > You should send the query as well, and \d customer_month_summary so we > can see > how you defined your indexes. Table "customer_month_summary" Column | Type | Modifiers ---------------+--------------------------+----------- customerid | integer | timestart | timestampwith time zone | timeend | timestamp with time zone | accountnum | character varying(255) | firstname | character varying(255) | lastname | character varying(255) | organization | character varying(255) | package | character varying(255) | up | bigint | down | bigint | maxup | bigint | maxdown | bigint | violatedsize | bigint | maxtotal | bigint | total | bigint | violatedup | bigint | violateddown | bigint | violatedtotal | bigint | monthstart | date | Indexes: customer_month_summary_cid_idx, customer_month_summary_cm_idx, customer_month_summary_time_idx > There doesn't seem to be a filter on the scan so it looks like > postgres thinks > you're actually reading in the entire table, which is normally faster > with a > sequential scan than an index scan. In fact I'm surprised it's doing > an index > scan on the other table and not a sequential scan. > > Some things to try: > > set enable_seqscan = off > > Then try your query again, see if postgres is right and it really is > faster to > do the sequential scan. Sort (cost=100014872.07..100014872.07 rows=268 width=265) (actual time=382.51..382.51 rows=6 loops=1) -> Aggregate (cost=100014693.83..100014861.27 rows=268 width=265) (actual time=381.93..382.33 rows=6 loops=1) -> Group (cost=100014693.83..100014854.57 rows=2679 width=265) (actual time=381.79..381.90 rows=6 loops=1) -> Sort (cost=100014693.83..100014693.83 rows=2679 width=265) (actual time=381.75..381.75 rows=6 loops=1) -> Merge Join (cost=100000005.18..100014541.30 rows=2679 width=265) (actual time=38.21..381.13 rows=6 loops=1) -> Nested Loop (cost=0.00..14525.77 rows=2679 width=247) (actual time=0.14..149.21 rows=2679 loops=1) -> Index Scan using customer_id_keyon customer c (cost=0.00..129.73 rows=2679 width=156) (actual time=0.06..36.92 rows=2679 loops=1) -> Index Scan using customer_month_summary_cm_idx on customer_month_summary cms (cost=0.00..5.36 rows=1 width=91) (actual time=0.02..0.03 rows=1 loops=2679) -> Sort (cost=100000005.18..100000005.18 rows=77 width=18) (actual time=0.44..0.56 rows=77 loops=1) -> Seq Scan on emailaddress ea (cost=100000000.00..100000002.77 rows=77 width=18) (actual time=0.01..0.24 rows=77 loops=1) Total runtime: 383.25 msec So from the looks of things the index is way faster (querying 1 row, rather than over 8000). > set random_page_cost = 2 > > Or even lower values. Sort (cost=6655.12..6655.12 rows=268 width=265) (actual time=902.75..902.76 rows=6 loops=1) -> Aggregate (cost=6476.88..6644.32 rows=268 width=265) (actual time=902.18..902.58 rows=6 loops=1) -> Group (cost=6476.88..6637.62 rows=2679 width=265) (actual time=902.01..902.13 rows=6 loops=1) -> Sort (cost=6476.88..6476.88 rows=2679 width=265) (actual time=901.97..901.98 rows=6 loops=1) -> Merge Join (cost=6106.42..6324.34 rows=2679 width=265) (actual time=585.73..901.35 rows=6 loops=1) -> Merge Join (cost=6101.24..6308.82 rows=2679 width=247) (actual time=549.53..667.35 rows=2679 loops=1) -> Index Scan using customer_id_keyon customer c (cost=0.00..118.77 rows=2679 width=156) (actual time=0.25..42.08 rows=2679 loops=1) -> Sort (cost=6101.24..6101.24 rows=8117 width=91) (actual time=549.21..555.19 rows=8117 loops=1) -> Seq Scan on customer_month_summary cms (cost=0.00..5574.17 rows=8117 width=91) (actual time=252.90..472.29 rows=8117 loops=1) -> Sort (cost=5.18..5.18 rows=77 width=18) (actual time=0.70..0.81 rows=77 loops=1) -> Seq Scan on emailaddress ea (cost=0.00..2.77 rows=77 width=18) (actual time=0.08..0.35 rows=77 loops=1) Total runtime: 905.47 msec This is with enable_seqscan turned off, so it doesn't look like it's doing much good. > I've also had some success with raising cpu_tuple_cost, though I'm > unclear on > whether that's actually a good approach or not. > > Also, don't forget to do a vacuum full on these tables before doing > testing for optimizations at this level. You can get some confusing > results if > your tables have lots of empty holes in them. IIRC this locks the tables, so I'll have to run this tonight so the users of this system don't race into my office to tar and feather me. Does using "set enable_seqscan = off" impact anything else on the system, or is it connection-specific? Is this the recommended way of doing this, or is there another way of coercing Postgres into using indexes? I'm making these queries from Perl code running in an XSP page, so I'm not sure how flexible this option would be from a developer point of view. Thanks for your help.
Micheal, Don't forget to run ANALYZE as well; this does not require a table lock. > Does using "set enable_seqscan = off" impact anything else on the > system, or is it connection-specific? Is this the recommended way of > doing this, or is there another way of coercing Postgres into using > indexes? I'm making these queries from Perl code running in an XSP > page, so I'm not sure how flexible this option would be from a > developer point of view. Setting enable_seqscan=off is just for testing purposes, to see if the planner is making a wrong decision so that you can adjust your .conf params, query, and/or indexes to suit. The setting is *not* intended as a production solution. -- -Josh BerkusAglio Database SolutionsSan Francisco
On Monday, Mar 3, 2003, at 15:21 US/Pacific, Josh Berkus wrote: > Micheal, > > Don't forget to run ANALYZE as well; this does not require a table > lock. Yes, I have a cron job that runs a "vacuum analyze" once every other hour. >> Does using "set enable_seqscan = off" impact anything else on the >> system, or is it connection-specific? Is this the recommended way of >> doing this, or is there another way of coercing Postgres into using >> indexes? I'm making these queries from Perl code running in an XSP >> page, so I'm not sure how flexible this option would be from a >> developer point of view. > > Setting enable_seqscan=off is just for testing purposes, to see if the > planner > is making a wrong decision so that you can adjust your .conf params, > query, > and/or indexes to suit. The setting is *not* intended as a production > solution. *phew* good. Now that I know the planner is making a bad choice, where do I go from here? Should I RTFM? Also, the query values in the configuration file that the planner uses, are those machine-specific or are the defaults the recommended values?
Micheal, Issues with your query: LEFT OUTER JOIN Customer_Month_Summary AS CMS ON ( C.ID = CMS.CustomerID AND CMS.MonthStart = DATE '2003-02-01' ) Check out the thread: Re: [SQL] OUTER JOIN with filter in today's list; this relates to your problem. Then try your query as: LEFT OUTER JOIN (SELECT * FROM Customer_Month_Summary WHRE CMS.MonthStart = DATE '2003-02-01' ) CMS ON C.ID = CMS.CustomerID This may make better use of your index, because the planner will have a more accurate estimate of the number of rows returned from the outer join. AND: AND ( C.Accountnum ~* 'kate' OR C.Firstname ~* 'kate' ORC.Lastname ~* 'kate' OR C.Organization ~* 'kate' OR C.Address ~* 'kate' OR C.Postal ~* 'kate' OR C.City ~* 'kate' OR EA.Name || '@' || JoinDomain(EA.Domain) ~* 'kate' This set of expressions has "seq scan" written all over it. I hihgly suggest that you try to find a way to turn these into anchored text searches, perhaps using functional indexes on lower(column). Finally: OR CMS.Package ~* 'kate' Per the above, this is why Postgres cannot use an index on your table; that is an unanchored text search which can *only* be indexed using FTS. -- -Josh BerkusAglio Database SolutionsSan Francisco
On Monday, Mar 3, 2003, at 15:32 US/Pacific, Josh Berkus wrote: > Check out the thread: Re: [SQL] OUTER JOIN with filter > in today's list; this relates to your problem. Then try your query as: I'll read through this, thank you. > LEFT OUTER JOIN (SELECT * FROM Customer_Month_Summary > WHRE CMS.MonthStart = DATE '2003-02-01' > ) CMS ON C.ID = CMS.CustomerID This works beautifully! Thanks. > This set of expressions has "seq scan" written all over it. I hihgly > suggest > that you try to find a way to turn these into anchored text searches, > perhaps > using functional indexes on lower(column). Hmm, I'll have to give this one some thought, since it looks like there won't be any easy way to index those text columns, since the text I'm searching for could be anywhere within the column. Thank you everyone for all the help!
One suggestion I'll make about your data model -- I'm not sure it would actually help this query, but might help elsewhere: WHERE ( C.Disabled > '2003-02-28' OR C.Disabled IS NULL ) Don't use NULL values like this. Most databases don't index NULLs (Oracle) or even if they do, don't make "IS NULL" an indexable operation (postgres). There's been some talk of changing this in postgres but even then, it wouldn't be able to use an index for an OR clause like this. If you used a very large date, like 9999-01-01 as your "not deactivated" value then the constraint would be C.disabled > '2003-02-28' and postgres could use an index on "disabled". Alternatively if you have a disabled_flag and disabled_date then you could have an index on disabled_flag,disabled_date and uhm, there should be a way to use that index though I'm not seeing it right now. This won't matter at first when 99% of your customers are active. And ideally in this query you find some way to use an index to find "kate" rather than doing a fully table scan. But later when 90% of the clients are disabled, then in a bigger batch job where you actually want to process every active record it could prevent postgres from having to dig through a table full of old inactive records. > This may make better use of your index, because the planner will have a more > accurate estimate of the number of rows returned from the outer join. > > AND: > > AND ( C.Accountnum ~* 'kate' > OR C.Firstname ~* 'kate' > OR C.Lastname ~* 'kate' > OR C.Organization ~* 'kate' > OR C.Address ~* 'kate' > OR C.Postal ~* 'kate' > OR C.City ~* 'kate' > OR EA.Name || '@' || JoinDomain(EA.Domain) ~* 'kate' > > This set of expressions has "seq scan" written all over it. I hihgly suggest > that you try to find a way to turn these into anchored text searches, perhaps > using functional indexes on lower(column). If you really need to find substring matches everywhere you might want to look into the full text search module in contrib/tsearch. I haven't started using it yet but I expect I will have to when I get to that part of my project. > Finally: > > OR CMS.Package ~* 'kate' *confusion*. Oooh, Yeah, this one is a big problem. It means it's not clear which end of the join to start with. Maybe it would be better to separate this into two separate queries, give the user the option to search for a user "kate" or a package "kate" but not both simultaneously. -- greg
> -> Merge Join (cost=6106.42..6335.30 rows=2679 width=265) (actual time=859.77..948.06 rows=1 loops=1) Actually another problem, notice the big discrepancy between the estimated row and the actual rows. That's because you have the big OR clause so postgres figures there's a good chance one of the clauses will be true so it estimates a lot of rows will match. In fact of course they're all very selective and you'll usually probably only get a few records. If you're stuck with the unanchored text search it will always do a full table scan so it will never be lightening fast. But it would probably be a bit faster if you put a limit clause (on a subquery) on the table that's doing the full table scan. That will convince postgres that there won't be thousands of resulting records, which might convince it to do a nested loop. Also, as a beneficial side effect will also limit the damage if one your users does a search for "e"... This only really helps if you can get rid of the OR CMS.package clause... otherwise it actually needs all the records in case they match a summary record with a kate package. -- greg
I beg to differ. A NULL field means not set. Having to use work around because the database does not index null is one thing, but making it a general rule is not. Having NULL indexed would also speed up things when "is null" is part af the query. Until then... JLL Greg Stark wrote: > > One suggestion I'll make about your data model -- I'm not sure it would > actually help this query, but might help elsewhere: > > WHERE ( C.Disabled > '2003-02-28' > OR C.Disabled IS NULL > ) > > Don't use NULL values like this. Most databases don't index NULLs (Oracle) or > even if they do, don't make "IS NULL" an indexable operation (postgres). > There's been some talk of changing this in postgres but even then, it wouldn't > be able to use an index for an OR clause like this. > > If you used a very large date, like 9999-01-01 as your "not deactivated" value > then the constraint would be C.disabled > '2003-02-28' and postgres could use > an index on "disabled". > > Alternatively if you have a disabled_flag and disabled_date then you could > have an index on disabled_flag,disabled_date and uhm, there should be a way to > use that index though I'm not seeing it right now. > > This won't matter at first when 99% of your customers are active. And ideally > in this query you find some way to use an index to find "kate" rather than > doing a fully table scan. But later when 90% of the clients are disabled, then > in a bigger batch job where you actually want to process every active record > it could prevent postgres from having to dig through a table full of old > inactive records. >
Jean-Luc Lachance <jllachan@nsd.ca> writes: > I beg to differ. > > A NULL field means not set. The best description for what NULL means is "unknown". > Having to use work around because the database does not index null is > one thing, but making it a general rule is not. My natural inclination is to use exactly the representation he used. I've done so numerous times in the past. But using NULL sort of means "we don't know when this account might have been deactivated" which is why it leads to all these awkward OR clauses in his queries. I admit using 9999-01-01 as a date gives me the willies. But it does match with the way the field is used and it produces nice clean index range lookups. > Having NULL indexed would also speed up things when "is null" is part af > the query. No, it wouldn't. Not in his query. His query had "disabled IS NULL OR disabled < ?" Even if "IS NULL" was indexable this still wouldn't be an indexable clause. Another option would be to use a functional index. create function disabled_as_of(timestamp with time zone) as'select coalesce($1,''9999-01-01'')' language sql immutable Then index disabled_as_of(disabled) and access it with"disabled_as_of(disabled) > current_time" (or perhaps it would be clearer as "NOT disabled_as_of(disabled) < current_time") Of course all this is just extra busywork to keep the ugly data representation out of the data model and hidden in the functional index. And it's not really very well hidden either. -- greg
Greg Stark wrote: > > Jean-Luc Lachance <jllachan@nsd.ca> writes: > > > I beg to differ. > > > > A NULL field means not set. > > The best description for what NULL means is "unknown". I agree. > > > Having to use work around because the database does not index null is > > one thing, but making it a general rule is not. > > My natural inclination is to use exactly the representation he used. > I've done so numerous times in the past. But using NULL sort of means "we > don't know when this account might have been deactivated" which is why it > leads to all these awkward OR clauses in his queries. > > I admit using 9999-01-01 as a date gives me the willies. But it does match > with the way the field is used and it produces nice clean index range lookups. I know it is quite far in the futur, but it remind me too much of he Y2K problem. One of my customers buried in their code 99-12-31 as an undefined date... > > > Having NULL indexed would also speed up things when "is null" is part af > > the query. > > No, it wouldn't. Not in his query. His query had > "disabled IS NULL OR disabled < ?" > Even if "IS NULL" was indexable this still wouldn't be an indexable clause. What? Select f from t where f is null or f < '2003-03-04'; is the same as Select f from t where f is null union select f from t where f > '2003-03-03'; Surely the optimizer will know to use the index. Will it not??? JLL > > Another option would be to use a functional index. > > create function disabled_as_of(timestamp with time zone) as > 'select coalesce($1,''9999-01-01'')' language sql immutable > > Then index disabled_as_of(disabled) and access it with > "disabled_as_of(disabled) > current_time" > > (or perhaps it would be clearer as "NOT disabled_as_of(disabled) < current_time") > > Of course all this is just extra busywork to keep the ugly data representation > out of the data model and hidden in the functional index. And it's not really > very well hidden either. > > -- > greg
Greg Stark <gsstark@MIT.EDU> writes: > I admit using 9999-01-01 as a date gives me the willies. But it does match > with the way the field is used and it produces nice clean index range > lookups. If you're willing to switch to type 'timestamp' then you could use the special value 'infinity' (or '-infinity' when you need it to sort before instead of after normal times). There was some talk awhile back of making type 'date' support infinity as well. This'd cost one day at each end of the allowable range, which doesn't seem like a big loss. But no one's done the legwork. regards, tom lane