Thread: optimizer not using an index...
SELECT customers.usedby, customers.hostname, customers.email, customers.valid, customers.signuptime, pincodes.codenum, pincodes.code, subaccts.type, subaccts.batch, subaccts.renew, subaccts.active, subaccts.price, subaccts.rebill, domains.name, domains.client, domains.authtype, ibill.login, ibill.passwd, customers.transnum, customers.subscr FROM pincodes, subaccts, customers, domains, owners, ibill WHERE pincodes.codenum=customers.codenum AND pincodes.type=subaccts.type AND domains.client=subaccts.clientAND domains.client=owners.client AND ibill.client=domains.client AND customers.email='caffeine@toodarkpark.org'; explain reveals that postgres ( 6.5.0 ) isnt using some of my indexes, opting instead for a complete table scan ( and drastically slowing things down ). Hash Join (cost=48272.95 rows=293465 width=222) -> Nested Loop (cost=20598.72 rows=376206 width=138) -> Hash Join (cost=58.73 rows=413 width=118) -> Seq Scan on subaccts (cost=17.51 rows=379 width=58) -> Hash (cost=24.84 rows=87 width=60) -> Hash Join (cost=24.84 rows=87 width=60) -> Hash Join (cost=14.56 rows=81 width=56) -> Seq Scan on ibill (cost=3.64 rows=80width=28) -> Hash (cost=4.64 rows=80 width=28) -> Seq Scan on domains (cost=4.64 rows=80 width=28) -> Hash (cost=3.81 rows=85 width=4) -> Seq Scan on owners (cost=3.81 rows=85 width=4) -> Index Scan using codes_type_idx onpincodes (cost=49.73 rows=345235 width=20) -> Hash (cost=546.46 rows=8757 width=84) -> Seq Scan on customers (cost=546.46 rows=8757 width=84) I have an index on customers.name, subaccts.type, ibill.client, owners.client... every column thats being queried on. tables are all vacuum analyze'ed ( the DEBUG notice shows that postgres does indeed 'see' the indexes, which are all btree's, btw ). customers table has 12k entries, pincodes has 350k, ibill as 80, domains has 80, owners has 80, subaccts has 380. doing a complete table scan on a column thats indexed isnt really that nice, especially since there are 12,000 entries in it. why postgres chooses to use table scans on other tables is also beyond me: "explain select * from (ibill|domains|owners|subaccts) where client=1" uses the proper index. the hash join and nested loop also bug me; thats a lot of rows to cycle through. interestingly, when querying on "pincodes.code" instead of "customers.name", postgres does NOT use a full table scan; it uses the proper indexes: Hash Join (cost=23.78 rows=5 width=222) -> Seq Scan on ibill (cost=3.64 rows=80 width=28) -> Hash (cost=16.37 rows=4width=194) -> Nested Loop (cost=16.37 rows=4 width=194) -> Nested Loop (cost=10.22 rows=3 width=190) -> Nested Loop (cost=6.12 rows=2 width=162) -> Nested Loop (cost=4.07rows=1 width=104) -> Index Scan using codes_code_idx on pincodes (cost=2.02 rows=1width=20) -> Index Scan using users_pkey on customers (cost=2.05 rows=11226 width=84) -> Index Scan using types_pkey on subaccts (cost=2.05 rows=379 width=58) -> Index Scan using doms_pkey on domains (cost=2.05 rows=80 width=28) -> Index Scan using owner_client_idxon owners (cost=2.05 rows=85 width=4) so what gives ? the two queries are 90% identical apart from the column that's being keyed on ( customers.name -vs- pincodes.code ). any help would be MOST appreciated. --- Howie <caffeine@toodarkpark.org> URL: http://www.toodarkpark.org "The distance between insanity and genius is measured only by success."
Howie <caffeine@toodarkpark.org> writes: > explain reveals that postgres ( 6.5.0 ) isnt using some of my indexes, > opting instead for a complete table scan ( and drastically slowing things > down ). Well, mumble. The optimizer certainly needs work, but what's your evidence that this query would be faster if done another way? Hashing the smaller tables, as it's doing, ought to be a pretty good strategy. One way to check is to start your client with environment variable setting PGOPTIONS="-fh" ("forbid hashjoin") to discourage the optimizer from using hashes, then check the generated plan for the same query and see what its actual runtime is. That's likely to be a suboptimal plan however, since it'll turn off *all* hashing. The hash on customers is probably the thing that's bothering you. How many result rows do you actually get from this query? If you eliminate the customers table from the query, and just do the same join among the remaining tables, how many rows do you get? I suspect the optimizer is drastically off in its estimate of ~300k result rows, and that's contributing to the problem. > doing a complete table scan on a column thats indexed isnt really that > nice, especially since there are 12,000 entries in it. But it's estimating it's going to have to probe that table 300k times, which makes the hashjoin look mighty attractive... > interestingly, when querying on "pincodes.code" instead of > "customers.name", postgres does NOT use a full table scan; it uses the > proper indexes: > Hash Join (cost=23.78 rows=5 width=222) Note the drastic difference in the estimated result-row count; that's undoubtedly what's changing the optimizer's choice of what to do. You haven't given us enough detail to understand why this query would be (or at least seem) more selective than the other, however. Anyway, this looks to me like it is probably a symptom of poor selectivity estimation leading to bogus estimates of output row counts leading to a nonoptimal plan choice. I have been working on improving the selectivity estimation for 6.6, and am looking for test cases to check out the logic on. Is your database small enough/non proprietary enough that you could send me a dump? Or could you strip it down to a test case that still exhibits the same misbehavior? If you don't like either of those, perhaps you could grab a current snapshot, install your data in a test postmaster, and report back on whether it acts any different... regards, tom lane
On Fri, 27 Aug 1999, Tom Lane wrote: > Howie <caffeine@toodarkpark.org> writes: > > explain reveals that postgres ( 6.5.0 ) isnt using some of my indexes, > > opting instead for a complete table scan ( and drastically slowing things > > down ). > > Well, mumble. The optimizer certainly needs work, but what's your > evidence that this query would be faster if done another way? Hashing > the smaller tables, as it's doing, ought to be a pretty good strategy. primary focus was on why pgsql decided to do a complete table scan of customers ( searching for customer.email ) when there's an index on it; surely an index scan would be loads faster than a complete table scan. the other interesting item i ran into was why one query used the indexes provided and the other didnt. granted, pincodes has at least 10x more entries than customers does ( ~300k+ vs ~11k ). > One way to check is to start your client with environment variable > setting PGOPTIONS="-fh" ("forbid hashjoin") to discourage the optimizer > from using hashes, then check the generated plan for the same query and > see what its actual runtime is. That's likely to be a suboptimal plan > however, since it'll turn off *all* hashing. The hash on customers is > probably the thing that's bothering you. that and pgsql not using the indexes that the query on pincodes.code does. using -fh causes pgsql to use all the proper indexes, but still beefs up 4 merge joins, seq scans, sorts, and nested loops: [excerpt] Merge Join (cost=355979.91 rows=293465 width=222) -> Seq Scan (cost=342501.81 rows=376206 width=138) -> Sort (cost=342501.81rows=376206 width=138) -> Nested Loop (cost=20616.32 rows=376206 width=138) -> Merge Join (cost=76.34 rows=413 width=118) -> Merge Join (cost=33.01 rows=87 width=60) -> Merge Join (cost=20.28 rows=81 width=56) ( query on customers.email ) > How many result rows do you actually get from this query? when querying on pincodes.code, i get 1 row back. when querying on customers.email, i get 4 rows back. it's producing the correct results, its just going about getting those results in a somewhat odd manner. > If you eliminate the customers table from the query, and just do the > same join among the remaining tables, how many rows do you get? 1, as expected. Hash Join (cost=21.73 rows=5 width=138) -> Seq Scan on ibill (cost=3.64 rows=80 width=28) -> Hash (cost=14.32 rows=4width=110) -> Nested Loop (cost=14.32 rows=4 width=110) -> Nested Loop (cost=8.17 rows=3 width=106) -> Nested Loop (cost=4.07 rows=2 width=78) -> Index Scan using codes_code_idxon pincodes (cost=2.02 rows=1 width=20) -> Index Scan using types_pkey on subaccts(cost=2.05 rows=379 width=58) -> Index Scan using doms_pkey on domains (cost=2.05 rows=80 width=28) -> Index Scan using owner_client_idx on owners (cost=2.05 rows=85 width=4) > I suspect the optimizer is drastically off in its estimate of ~300k > result rows, and that's contributing to the problem. yes, so why is it off for one query but right on target for the other ? more importantly, why is it chosing to use indexes for one query yet chosing to do complete table scans for the other ( even though the two queries are almost identical ) ? answer that and i'll personally treat you to a beer. :) > > doing a complete table scan on a column thats indexed isnt really that > > nice, especially since there are 12,000 entries in it. > > But it's estimating it's going to have to probe that table 300k times, > which makes the hashjoin look mighty attractive... why would it have to probe pincodes 300k times when there's a unique index on pincodes ( pincodes.codenum ) and a unique index on customers ( customers.codenum ) ? > > interestingly, when querying on "pincodes.code" instead of > > "customers.name", postgres does NOT use a full table scan; it uses the > > proper indexes: > > > Hash Join (cost=23.78 rows=5 width=222) > > Note the drastic difference in the estimated result-row count; that's > undoubtedly what's changing the optimizer's choice of what to do. You > haven't given us enough detail to understand why this query would be > (or at least seem) more selective than the other, however. im not sure it would; i placed indexes in ( what i thought were ) all the proper places - pincodes.code, pincodes.codenum, customers.name, customers.codenum... every column that's being queried on or joined on. ideally, all the indexes would be used and querying on customers.email would be ( query-plan wise ) almost identical to querying on pincodes.code; only an index would change. > Anyway, this looks to me like it is probably a symptom of poor > selectivity estimation leading to bogus estimates of output row counts > leading to a nonoptimal plan choice. I have been working on improving > the selectivity estimation for 6.6, and am looking for test cases to > check out the logic on. Is your database small enough/non proprietary > enough that you could send me a dump? i could send you a dump, but the db is fairly large; 7m uncompressed, 2m gzip -9'ed. none of the data is overly sensetive. i am, however, on an overly lagged 28k8. > Or could you strip it down to > a test case that still exhibits the same misbehavior? i havent tried trimming the db down. im not sure that would 'fix' the problem if the optimizer is misguessing the number of rows it's going to have to look at... > If you don't > like either of those, perhaps you could grab a current snapshot, install > your data in a test postmaster, and report back on whether it acts any > different... over the weekend im planning on upgrading to 6.5.1, but i dont recall seeing any changes to the optimizer in the changelog... --- Howie <caffeine@toodarkpark.org> URL: http://www.toodarkpark.org "The distance between insanity and genius is measured only by success."
Howie <caffeine@toodarkpark.org> writes: > On Fri, 27 Aug 1999, Tom Lane wrote: >> Well, mumble. The optimizer certainly needs work, but what's your >> evidence that this query would be faster if done another way? Hashing >> the smaller tables, as it's doing, ought to be a pretty good strategy. > primary focus was on why pgsql decided to do a complete table scan of > customers ( searching for customer.email ) when there's an index on it; > surely an index scan would be loads faster than a complete table scan. No, an index scan will be loads slower if most or all of the table has to be visited. If it were always faster, the optimizer would have a much easier task ;-). The join we are considering here is on pincodes.codenum = customers.codenum, and since the system (mistakenly) thinks it is going to be joining a lot of rows from pincodes, doing it with a hash rather than an indexscan on customers.codenum looks good. You bring up a good point though: why isn't it deciding to use the restriction on customer.email for an indexscan of customers, using the index on email rather than the one on codenum? (I assume you have both) It looks like the thing is misestimating the selectivity of that restriction too: there are not going to be 8757 customer rows that pass that WHERE clause --- but the system thinks that, and that's why it's rejecting the indexscan as being slower than the sequential. I had missed that point before, being distracted by the even sillier row count from the other subjoin. There are clearly two different selectivity mistakes being made here :-( > the other interesting item i ran into was why one query used the indexes > provided and the other didnt. granted, pincodes has at least 10x more > entries than customers does ( ~300k+ vs ~11k ). Right, and the issue seems to be that when you are restricting pincodes with an extra WHERE clause, the estimated output row count is realistic, whereas when you aren't restricting pincodes it ain't. I'm not sure why, yet. >> But it's estimating it's going to have to probe that table 300k times, >> which makes the hashjoin look mighty attractive... > why would it have to probe pincodes 300k times when there's a unique index > on pincodes ( pincodes.codenum ) and a unique index on customers ( > customers.codenum ) ? It's guessing that it's going to get 300k rows out of the join of pincodes to all the little tables (ie, about one row per pincode row; it doesn't know there's only going to be one match), and then it thinks it's going to have to look through most of the customers table for match(es) to each of those rows. If the selectivity estimates that the system was making were in the right ballpark, then the hash join probably *would* be the right way to do it. The problem here seems to be that the estimated row counts are so far off-base. I'd like to find out why. > i could send you a dump, but the db is fairly large; 7m uncompressed, 2m > gzip -9'ed. none of the data is overly sensetive. i am, however, on an > overly lagged 28k8. 2m gzipped is no problem at this end, but I'll bet we'd see the same behavior with only half or a third of the data, if you can extract a consistent subset easily... > i havent tried trimming the db down. im not sure that would 'fix' the > problem if the optimizer is misguessing the number of rows it's going to > have to look at... No, it wouldn't be a fix for your problem. I was just thinking of making a smaller, easier-to-ship test case that I could use to make sure the problem is fixed for 6.6. > over the weekend im planning on upgrading to 6.5.1, but i dont recall > seeing any changes to the optimizer in the changelog... There are not; this is work for 6.6. The fixes involve storing more and different info in the system statistics table, so there is no prospect of back-patching them into 6.5.*. (BTW, I think 6.5.2 will be out shortly, so you might want to wait another week before updating.) regards, tom lane