Thread: Are statistics gathered on function indexes?
Hi- Can someone tell me how the cost is estimated for retrieving a column based on a function that is indexed? The issue I'm wrestling with is a query which works nicely when based on a plain "LIKE field", but poorly using "LIKE upper(field)". There is an index on both. Since this field has only uppercase strings in it currently, I know that both indexes are equally selective, but the planner judges that "LIKE field" will return 1 row, and "LIKE upper(field)" will return 2168 rows. In both cases, the index is used, but in the next step, joining to another table, the regular version uses an index on the other table, while the "upper" version uses a seq scan. I'm guessing that the scan is used because if we are going after 2168 rows in the adjoining table, the index is no longer a good choice. I'm able to see the stats for the field using pg_stats, but don't see anything connected to the function, so I'm guessing that real stats aren't kept & some sort of default is used. perhaps I can modify this default. Also, even with 2168 rows to gather, my experience based on cases where several thousand rows really are returned indicates that the index would still be a good choice. Is there a way to make the planner favor index scans a bit more? (Other than the drastic set enable_seqscan to off.) Thanks -Nick -------------------------------------------------------------------------- Nick Fankhauser nickf@ontko.com Phone 1.765.935.4283 Fax 1.765.962.9788 Ray Ontko & Co. Software Consulting Services http://www.ontko.com/
"Nick Fankhauser" <nickf@ontko.com> writes: > [see subject] Nope, they ain't. I agree they should be. > Can someone tell me how the cost is estimated for retrieving a column based > on a function that is indexed? It falls back to a default selectivity estimate, which is something like 1% or 0.5% (depending on which version you are running). > Also, even with 2168 rows to gather, my experience based on cases where > several thousand rows really are returned indicates that the index would > still be a good choice. Is there a way to make the planner favor index scans > a bit more? (Other than the drastic set enable_seqscan to off.) I'd suggest reducing random_page_cost; we've seen a number of anecdotal reports that the default of 4.0 is too high, though nothing systematic enough to refute the experiments I did to get that number awhile back. (IMHO anyway. Others may differ.) regards, tom lane
Tom- Thanks for the info- Based on your response plus some local issues, we're going to work around this by simply creating another column containing the results of the function & then index the column. That gets the results we want without tweaking something we may regret later. Stats for function indexes would be nice, so add our vote for it to wherever such things are tallied to come up with priorities. Regards, -Nick > -----Original Message----- > From: pgsql-admin-owner@postgresql.org > [mailto:pgsql-admin-owner@postgresql.org]On Behalf Of Tom Lane > Sent: Wednesday, June 26, 2002 10:37 PM > To: nickf@ontko.com > Cc: pgsql-admin > Subject: Re: [ADMIN] Are statistics gathered on function indexes? > > > "Nick Fankhauser" <nickf@ontko.com> writes: > > [see subject] > > Nope, they ain't. I agree they should be. > > > Can someone tell me how the cost is estimated for retrieving a > column based > > on a function that is indexed? > > It falls back to a default selectivity estimate, which is something > like 1% or 0.5% (depending on which version you are running). > > > Also, even with 2168 rows to gather, my experience based on cases where > > several thousand rows really are returned indicates that the index would > > still be a good choice. Is there a way to make the planner > favor index scans > > a bit more? (Other than the drastic set enable_seqscan to off.) > > I'd suggest reducing random_page_cost; we've seen a number of anecdotal > reports that the default of 4.0 is too high, though nothing systematic > enough to refute the experiments I did to get that number awhile back. > (IMHO anyway. Others may differ.) > > regards, tom lane > > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > >
Tom, et al, Yes, thanks. Another thing that we noticed is that when LIKE is used on an index with a constant value like 'WILLIAMS%', a full table scan occurs when the constant is 'W%', but the index is used if the like string is 'WI%' or longer. It seems to me that the selectivity of the string would vary with the length of the string, perhaps as a fraction of the length of the field. In other words, I would have expected the selectivity to vary something like this: W% -> 0.1 WI% -> 0.01 WIL% -> 0.001 WILL% -> 0.0001 WILLI% -> 0.00001 WILLIA% -> 0.000001 WILLIAM% -> 0.0000001 WILLIAMS% -> 0.00000001 In other words, if I only give one letter, then I might expect to get about 1/10 of the table, and a full scan might make sense. But the cost should continue to decline as I give longer and longer strings, up to the length of the field. Would this be a reasonable improvement to the optimizer? Ray [Charset iso-8859-1 unsupported, filtering to ASCII...] > Tom- > > Thanks for the info- > > Based on your response plus some local issues, we're going to work around > this by simply creating another column containing the results of the > function & then index the column. That gets the results we want without > tweaking something we may regret later. > > Stats for function indexes would be nice, so add our vote for it to wherever > such things are tallied to come up with priorities. > > Regards, > > -Nick > > > -----Original Message----- > > From: pgsql-admin-owner@postgresql.org > > [mailto:pgsql-admin-owner@postgresql.org]On Behalf Of Tom Lane > > Sent: Wednesday, June 26, 2002 10:37 PM > > To: nickf@ontko.com > > Cc: pgsql-admin > > Subject: Re: [ADMIN] Are statistics gathered on function indexes? > > > > > > "Nick Fankhauser" <nickf@ontko.com> writes: > > > [see subject] > > > > Nope, they ain't. I agree they should be. > > > > > Can someone tell me how the cost is estimated for retrieving a > > column based > > > on a function that is indexed? > > > > It falls back to a default selectivity estimate, which is something > > like 1% or 0.5% (depending on which version you are running). > > > > > Also, even with 2168 rows to gather, my experience based on cases where > > > several thousand rows really are returned indicates that the index would > > > still be a good choice. Is there a way to make the planner > > favor index scans > > > a bit more? (Other than the drastic set enable_seqscan to off.) > > > > I'd suggest reducing random_page_cost; we've seen a number of anecdotal > > reports that the default of 4.0 is too high, though nothing systematic > > enough to refute the experiments I did to get that number awhile back. > > (IMHO anyway. Others may differ.) > > > > regards, tom lane > > > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 3: if posting/reading through Usenet, please send an appropriate > > subscribe-nomail command to majordomo@postgresql.org so that your > > message can get through to the mailing list cleanly > > > > > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > > ---------------------------------------------------------------------- Ray Ontko rayo@ontko.com Phone 1.765.935.4283 Fax 1.765.962.9788 Ray Ontko & Co. Software Consulting Services http://www.ontko.com/
Ray Ontko <rayo@ontko.com> writes: > In other words, if I only give one letter, then I might expect > to get about 1/10 of the table, and a full scan might make sense. > But the cost should continue to decline as I give longer and longer > strings, up to the length of the field. > Would this be a reasonable improvement to the optimizer? It's there already; what did you think was making the difference between W% and WI% ? regards, tom lane
The vast majority of the cost of using an index is due to reading the data blocks in the table in a random order, rathern than sequentially, right? Rather than making estimates of the selectivity of an index when you search for 'W%', why not actually start the index lookup and count how many you get? If you know you want to do a table scan if you have more than, say, 500 rows that match 'W%', you'd only have to read a few index pages to determine whether or not there are more than 500 rows, right? Or am I on crack here? cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC
Tom, > > In other words, if I only give one letter, then I might expect > > to get about 1/10 of the table, and a full scan might make sense. > > But the cost should continue to decline as I give longer and longer > > strings, up to the length of the field. > > Would this be a reasonable improvement to the optimizer? > > It's there already; what did you think was making the difference > between W% and WI% ? Yes, but the cost doesn't continue to decline if I make the LIKE more and more restrictive by going from WI% to WIL% to WILL%, etc. The current approach assumes, perhaps correctly, that with only one letter, you might as well do a full table scan, but with 2 or more letters, you might as well use an index and make a reasonable guess at the cost. The limitation with this approach is that the optimizer is considering a number of different options if the join includes a number of tables and constraints. My query suffers from a one-size-fits-all approach. If the optimizer had a better guess on cost, it could choose to drive my query using this index instead of another. By allowing the cost to decline as the length of the string increases, we're making the bet that longer strings are more selective and require fewer random pages to be read, among other improvements. Note that this would behave badly in situations where all the values in the index begin with the same 10 characters and the LIKE string is less than 10 characters long. This is already a problem for the current approach, and I think could only be solved by adding statistics for the selectivity of increasingly longer strings. BTW, I think that this discussion probably belongs in a different list (e.g., the one for hacking the optimizer). Since I'm not (yet) planning to jump in to the code, my purpose for raising the question here is to help me (and others on the list) understand the capabilities and limitations of indexes and the optimizer so that we can make better use of what we have currently, and help identify areas for improvement. Ray ---------------------------------------------------------------------- Ray Ontko rayo@ontko.com Phone 1.765.935.4283 Fax 1.765.962.9788 Ray Ontko & Co. Software Consulting Services http://www.ontko.com/
Ray Ontko <rayo@ontko.com> writes: >> It's there already; what did you think was making the difference >> between W% and WI% ? > Yes, but the cost doesn't continue to decline if I make the LIKE > more and more restrictive by going from WI% to WIL% to WILL%, etc. Yes it does, if you have a large enough table. In most scenarios the selectivity drops off fast enough with larger strings that you hit the minimum estimate of 1 row pretty quickly; I suppose that's what's happening with your case. Here's an example using the 7.2 regression-test database: -- update stats regression=# analyze road; ANALYZE -- now force planner to think "road" is much larger than it really is, -- else we can't see the change in estimate beyond WI% regression=# update pg_class set relpages = relpages * 10000, regression-# reltuples = reltuples * 10000 where relname = 'road'; UPDATE 1 regression=# explain select * from only road where name like 'W%'; NOTICE: QUERY PLAN: Seq Scan on road (cost=0.00..1444625.00 rows=764903 width=89) EXPLAIN regression=# explain select * from only road where name like 'WI%'; NOTICE: QUERY PLAN: Index Scan using rix on road (cost=0.00..25007.80 rows=8406 width=89) EXPLAIN regression=# explain select * from only road where name like 'WIJ%'; NOTICE: QUERY PLAN: Index Scan using rix on road (cost=0.00..277.04 rows=92 width=89) EXPLAIN regression=# explain select * from only road where name like 'WIJK%'; NOTICE: QUERY PLAN: Index Scan using rix on road (cost=0.00..5.28 rows=1 width=89) EXPLAIN regression=# explain select * from only road where name like 'WIJKL%'; NOTICE: QUERY PLAN: Index Scan using rix on road (cost=0.00..5.23 rows=1 width=89) EXPLAIN regression=# explain select * from only road where name like 'WIJKLM%'; NOTICE: QUERY PLAN: Index Scan using rix on road (cost=0.00..5.23 rows=1 width=89) EXPLAIN regression=# As you can see, the estimate drops off by about a factor of 90 per added character. This is probably too much, but it's not that easy to determine what the ratio ought to be. The critical code involved in this is convert_string_to_scalar in backend/utils/adt/selfuncs.c; the ratio per character is essentially the same as the character range that it induces from the available values. Feel free to propose a better implementation if you can think of one. regards, tom lane
Tom, et al, Hmm. Something is wierd here. Watch this. 1) first we demonstrate that we drop to 1 row between 1 and 2 letters. 2) then we do an analyze and things work the way you suggest. 3) then we do some more stuff 4) things revert to the old way 5) And even after we re-analyze correctly, things still look broken Eek. Do you have an explanation for this behavior? It appears that "vacuum analyze verbose actor" causes the problem. It appears that I have to say "vacuum analyze actor" in order to clear out the ill effects of having said "vacuum analyze verbose actor". Typing "vacuum verbose analyze actor" doesn't clear things out, but it doesn't produce the problem. Ray P.S. psql (PostgreSQL) 7.2.1 ********** 1) first we demonstrate that we drop to 1 row between 1 and 2 letters. ********** develop=# select count(*) from actor ; count -------- 433902 (1 row) develop=# explain select * from actor where actor_full_name like 'W%' ; NOTICE: QUERY PLAN: Seq Scan on actor (cost=0.00..12529.77 rows=3992 width=570) EXPLAIN develop=# explain select * from actor where actor_full_name like 'WI%' ; NOTICE: QUERY PLAN: Index Scan using actor_full_name on actor (cost=0.00..6.01 rows=1 width=570) EXPLAIN ********** 2) then we do an analyze and things work the way you suggest. ********** develop=# vacuum analyze verbose actor ; NOTICE: --Relation actor-- NOTICE: Pages 7106: Changed 0, Empty 0; Tup 433902: Vac 0, Keep 0, UnUsed 1443. Total CPU 0.23s/0.07u sec elapsed 0.30 sec. NOTICE: Analyzing actor VACUUM develop=# \h vac Command: VACUUM Description: garbage-collect and optionally analyze a database Syntax: VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ table ] VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] ANALYZE [ table [ (column [, ...] ) ] ] develop=# vacuum verbose analyze actor ; NOTICE: --Relation actor-- NOTICE: Pages 7106: Changed 0, Empty 0; Tup 433902: Vac 0, Keep 0, UnUsed 1443. Total CPU 0.26s/0.05u sec elapsed 0.30 sec. NOTICE: Analyzing actor VACUUM develop=# explain select * from actor where actor_full_name like 'W%' ; NOTICE: QUERY PLAN: Seq Scan on actor (cost=0.00..12529.77 rows=7468 width=571) EXPLAIN develop=# explain select * from actor where actor_full_name like 'WI%' ; NOTICE: QUERY PLAN: Index Scan using actor_full_name on actor (cost=0.00..447.49 rows=112 width=571 ) EXPLAIN develop=# explain select * from actor where actor_full_name like 'WIL%' ; NOTICE: QUERY PLAN: Index Scan using actor_full_name on actor (cost=0.00..9.61 rows=2 width=571) EXPLAIN ********** 3) then we do some more stuff ********** develop=# vacuum verbose analyze actor ; NOTICE: --Relation actor-- NOTICE: Pages 7106: Changed 0, Empty 0; Tup 433902: Vac 0, Keep 0, UnUsed 1443. Total CPU 0.24s/0.06u sec elapsed 0.30 sec. NOTICE: Analyzing actor VACUUM develop=# vacuum analyze actor verbose ; ERROR: parser: parse error at or near "verbose" develop=# vacuum verbose analyze actor ; NOTICE: --Relation actor-- NOTICE: Pages 7106: Changed 0, Empty 0; Tup 433902: Vac 0, Keep 0, UnUsed 1443. Total CPU 0.22s/0.08u sec elapsed 0.29 sec. NOTICE: Analyzing actor VACUUM develop=# vacuum analyze verbose actor ; NOTICE: --Relation actor-- NOTICE: Pages 7106: Changed 0, Empty 0; Tup 433902: Vac 0, Keep 0, UnUsed 1443. Total CPU 0.24s/0.06u sec elapsed 0.30 sec. NOTICE: Analyzing actor VACUUM ********** 4) Then things revert to the old way. ********** develop=# explain select * from actor where actor_full_name like 'W%' ; NOTICE: QUERY PLAN: Seq Scan on actor (cost=0.00..12529.77 rows=6244 width=571) EXPLAIN develop=# explain select * from actor where actor_full_name like 'WI%' ; NOTICE: QUERY PLAN: Index Scan using actor_full_name on actor (cost=0.00..6.01 rows=1 width=571) EXPLAIN develop=# explain select * from actor where actor_full_name like 'WIL%' ; NOTICE: QUERY PLAN: Index Scan using actor_full_name on actor (cost=0.00..6.01 rows=1 width=571) EXPLAIN develop=# explain select * from actor where actor_full_name like 'WILL%' ; NOTICE: QUERY PLAN: Index Scan using actor_full_name on actor (cost=0.00..6.01 rows=1 width=571) EXPLAIN ********** 5) And even after we re-analyze correctly, things still look broken ********** develop=# vacuum verbose analyze actor ; NOTICE: --Relation actor-- NOTICE: Pages 7106: Changed 0, Empty 0; Tup 433902: Vac 0, Keep 0, UnUsed 1443. Total CPU 0.30s/0.01u sec elapsed 0.30 sec. NOTICE: Analyzing actor VACUUM develop=# explain select * from actor where actor_full_name like 'W%' ; NOTICE: QUERY PLAN: Seq Scan on actor (cost=0.00..12529.77 rows=7130 width=571) EXPLAIN develop=# explain select * from actor where actor_full_name like 'WI%' ; NOTICE: QUERY PLAN: Index Scan using actor_full_name on actor (cost=0.00..6.01 rows=1 width=571) EXPLAIN develop=# explain select * from actor where actor_full_name like 'WIL%' ; NOTICE: QUERY PLAN: Index Scan using actor_full_name on actor (cost=0.00..6.01 rows=1 width=571) EXPLAIN develop=# explain select * from actor where actor_full_name like 'WILL%' ; NOTICE: QUERY PLAN: Index Scan using actor_full_name on actor (cost=0.00..6.01 rows=1 width=571) EXPLAIN > Ray Ontko <rayo@ontko.com> writes: > >> It's there already; what did you think was making the difference > >> between W% and WI% ? > > > Yes, but the cost doesn't continue to decline if I make the LIKE > > more and more restrictive by going from WI% to WIL% to WILL%, etc. > > Yes it does, if you have a large enough table. In most scenarios > the selectivity drops off fast enough with larger strings that you > hit the minimum estimate of 1 row pretty quickly; I suppose that's > what's happening with your case. Here's an example using the 7.2 > regression-test database: > > -- update stats > regression=# analyze road; > ANALYZE > > -- now force planner to think "road" is much larger than it really is, > -- else we can't see the change in estimate beyond WI% > regression=# update pg_class set relpages = relpages * 10000, > regression-# reltuples = reltuples * 10000 where relname = 'road'; > UPDATE 1 > > regression=# explain select * from only road where name like 'W%'; > NOTICE: QUERY PLAN: > > Seq Scan on road (cost=0.00..1444625.00 rows=764903 width=89) > > EXPLAIN > regression=# explain select * from only road where name like 'WI%'; > NOTICE: QUERY PLAN: > > Index Scan using rix on road (cost=0.00..25007.80 rows=8406 width=89) > > EXPLAIN > regression=# explain select * from only road where name like 'WIJ%'; > NOTICE: QUERY PLAN: > > Index Scan using rix on road (cost=0.00..277.04 rows=92 width=89) > > EXPLAIN > regression=# explain select * from only road where name like 'WIJK%'; > NOTICE: QUERY PLAN: > > Index Scan using rix on road (cost=0.00..5.28 rows=1 width=89) > > EXPLAIN > regression=# explain select * from only road where name like 'WIJKL%'; > NOTICE: QUERY PLAN: > > Index Scan using rix on road (cost=0.00..5.23 rows=1 width=89) > > EXPLAIN > regression=# explain select * from only road where name like 'WIJKLM%'; > NOTICE: QUERY PLAN: > > Index Scan using rix on road (cost=0.00..5.23 rows=1 width=89) > > EXPLAIN > regression=# > > As you can see, the estimate drops off by about a factor of 90 per > added character. This is probably too much, but it's not that easy > to determine what the ratio ought to be. The critical code involved > in this is convert_string_to_scalar in backend/utils/adt/selfuncs.c; > the ratio per character is essentially the same as the character range > that it induces from the available values. Feel free to propose a > better implementation if you can think of one. > > regards, tom lane > ---------------------------------------------------------------------- Ray Ontko rayo@ontko.com Phone 1.765.935.4283 Fax 1.765.962.9788 Ray Ontko & Co. Software Consulting Services http://www.ontko.com/
Tom, et al, > It appears that "vacuum analyze verbose actor" causes the problem. > It appears that I have to say "vacuum analyze actor" in order to > clear out the ill effects of having said "vacuum analyze verbose actor". > Typing "vacuum verbose analyze actor" doesn't clear things out, > but it doesn't produce the problem. Here's two more data points: 1) "analyze verbose actor" causes the problem too. "analyze actor" usually clears the problem out. 2) In cases where "analyze actor" doesn't clear out the problem, exiting psql and trying again does clear the problem out. Ray ---------------------------------------------------------------------- Ray Ontko rayo@ontko.com Phone 1.765.935.4283 Fax 1.765.962.9788 Ray Ontko & Co. Software Consulting Services http://www.ontko.com/
Tom Lane wrote: > As you can see, the estimate drops off by about a factor of 90 per > added character. This is probably too much, but it's not that easy > to determine what the ratio ought to be. The critical code involved > in this is convert_string_to_scalar in backend/utils/adt/selfuncs.c; > the ratio per character is essentially the same as the character range > that it induces from the available values. Feel free to propose a > better implementation if you can think of one. FYI, 7.3 will drop off quicker than <=7.2.X: * Change FIXED_CHAR_SEL to 0.20 from 0.04 to give better selectivity (Bruce) -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > FYI, 7.3 will drop off quicker than <=7.2.X: Not on this test... regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > FYI, 7.3 will drop off quicker than <=7.2.X: > > Not on this test... Are you sure. I thought FIXED_CHAR_SEL was used for exactly this purpose. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: >> Not on this test... > Are you sure. Quite; I get the same behavior from either 7.2 or current. > I thought FIXED_CHAR_SEL was used for exactly this purpose. That only applies to characters appearing after the first wildcard. regards, tom lane
Ray Ontko <rayo@ontko.com> writes: >> It appears that "vacuum analyze verbose actor" causes the problem. >> It appears that I have to say "vacuum analyze actor" in order to >> clear out the ill effects of having said "vacuum analyze verbose actor". I really, really doubt that "verbose" has anything to do with it. What do you get from select * from pg_stats where tablename = 'actor' and attname = 'actor_full_name'; Do the results change significantly between the "good" state and the "bad" state? How about the results of select relpages, reltuples from pg_class where relname = 'actor'; It would seem that one or another of these statistical items is getting set weirdly by something you are doing, but I have no idea what exactly is going wrong... regards, tom lane
> Ray Ontko <rayo@ontko.com> writes: > >> It appears that "vacuum analyze verbose actor" causes the problem. > >> It appears that I have to say "vacuum analyze actor" in order to > >> clear out the ill effects of having said "vacuum analyze verbose actor". > > I really, really doubt that "verbose" has anything to do with it. > > What do you get from > select * from pg_stats where tablename = 'actor' and > attname = 'actor_full_name'; > > Do the results change significantly between the "good" state and the > "bad" state? How about the results of > select relpages, reltuples from pg_class where relname = 'actor'; > > It would seem that one or another of these statistical items is getting > set weirdly by something you are doing, but I have no idea what exactly > is going wrong... Hmm. 1) here's the "bad" stats. 2) here's the "good" stats. Note that the information really is different. 3) here's the results of the relpages,reltuples query. Same whether good or bad stats. Ray ********** 1) here's the "bad" stats. ********** develop=# explain select * from actor where actor_full_name like 'WI%' ; NOTICE: QUERY PLAN: Index Scan using actor_full_name on actor (cost=0.00..6.01 rows=1 width=570) EXPLAIN develop=# select * from pg_stats where tablename = 'actor' and develop-# attname = 'actor_full_name'; tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation -----------+-----------------+-------------+-----------+------------+----------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- ---------------------------------------------------------------------------+---- -------------------------------------------------------------------------------- -------------+------------------------------------------------------------------ -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- ------+------------- actor | actor_full_name | 0.000333333 | 22 | 14657 | {"INDIANA DEPARTMENT OF REVENUE","AEGIS WOMENS HEALTHCARE","BLOOMINGTON HOSPITAL","MONROE COUNTY TREASURER","PEOPLES STATE BANK","RICHLAND BEAN BLOSSOM CSC","SMITHVILLE T ELEPHONE","STATE OF INDIANA","PETTAY, LEE","WOODINGTON COURTS MANAGEMENT"} | {0. 0813333,0.00366667,0.003,0.00266667,0.00266667,0.00266667,0.00233333,0.00233333, 0.002,0.002} | {"(ABEL) CONDER, CRYSTAL","BLOOMINGTON HOUSING AUTHORITY","CORBIN , MARK J","FLEETWOOD, JAMES WILBUR","HAZEL, JEFF W","KIDD, PATTY","MEADOW PARK A PARTMENTS","PETERSON, CATHY L","SHADLE, MARY","THRASHER, CHRISTOPHER B","ZYNNCO LLC"} | 0.025242 (1 row) ********** 2) ********** develop=# explain select * from actor where actor_full_name like 'WI%' ; NOTICE: QUERY PLAN: Index Scan using actor_full_name on actor (cost=0.00..6.01 rows=1 width=571) EXPLAIN develop=# analyze actor ; ANALYZE develop=# explain select * from actor where actor_full_name like 'WI%' ; NOTICE: QUERY PLAN: Index Scan using actor_full_name on actor (cost=0.00..433.52 rows=108 width=571 ) EXPLAIN develop=# select * from pg_stats where tablename = 'actor' and develop-# attname = 'actor_full_name'; tablename | attname | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | cor relation -----------+-----------------+-----------+-----------+------------+------------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- ----------------------------------------------------------------+--------------- -------------------------------------------------------------------------------- --+----------------------------------------------------------------------------- -------------------------------------------------------------------------------- ---------------------------------------------------------------------------+---- --------- actor | actor_full_name | 0 | 22 | 14541 | {"INDIANA DE PARTMENT OF REVENUE","RICHLAND BEAN BLOSSOM CSC","PETTAY, LEE","STATE OF INDIANA ","BAKER DDS, DONALD","BLOOMINGTON HOSPITAL","SMITHVILLE TELEPHONE","AEGIS WOMEN S HEALTHCARE","BAKER DDS, LISA","BLOOMINGTON ACCOUNTS SERVICE"} | {0.0856667,0.0 0333333,0.00233333,0.00233333,0.002,0.002,0.002,0.00166667,0.00166667,0.00166667 } | {"(FITZPATRICK) STOUT, LISA","BLOOMINGTON HOUSING AUTHORITY","CONKLIN, TONIA A","EWING, CRAIG","HARTENFELD, KATHLEEN A","KELLEY, KIMBERLEY","MDF BUILDERS"," PENNINGTON, ADA M","SCISCOE, R L ETAL","THOMPSON, JEANA J","ZOOK, ALISON"} | 0 .0127368 (1 row) ********** 3) results of the replage,reltuples query ********** develop=# select relpages, reltuples from pg_class where relname = 'actor' ; relpages | reltuples ----------+----------- 7106 | 436871 (1 row) ------------------------------------------------------------------------ Ray Ontko rayo@ontko.com Phone 1.765.935.4283 Fax 1.765.962.9788 Ray Ontko & Co. Software Consulting Services http://www.ontko.com/
Ray Ontko <rayo@ontko.com> writes: > 1) here's the "bad" stats. > 2) here's the "good" stats. > Note that the information really is different. Yeah. It seems that the critical difference is the null_frac; if I take your "bad" values and change null_frac to 0 then I get reasonable-looking answers, but with null_frac = 0.000333333 I indeed get bad answers. I suspect null_frac is getting double-counted somewhere, and depending on just how many nulls ANALYZE happens to find, you might get pushed over the edge from a small selectivity to zero or negative selectivity. Will dig further. regards, tom lane
Tom, > Ray Ontko <rayo@ontko.com> writes: > > 1) here's the "bad" stats. > > 2) here's the "good" stats. > > Note that the information really is different. > > Yeah. It seems that the critical difference is the null_frac; if I take > your "bad" values and change null_frac to 0 then I get > reasonable-looking answers, but with null_frac = 0.000333333 I indeed > get bad answers. I suspect null_frac is getting double-counted > somewhere, and depending on just how many nulls ANALYZE happens to find, > you might get pushed over the edge from a small selectivity to zero or > negative selectivity. Will dig further. Does this help? develop=# select count(*) from actor where actor_full_name is not null ; count -------- 433809 (1 row) develop=# select count(*) from actor ; count -------- 433902 (1 row) Ray ---------------------------------------------------------------------- Ray Ontko rayo@ontko.com Phone 1.765.935.4283 Fax 1.765.962.9788 Ray Ontko & Co. Software Consulting Services http://www.ontko.com/
Ray Ontko <rayo@ontko.com> writes: > Does this help? Well, it says that ANALYZE is not doing anything wrong ;-). The true null_frac given those numbers is 1.0 - 433809.0/433902.0 or about 0.000214. In the default ANALYZE sample size (3000 rows) the statistical expectation would be 0.643 NULLs, so most of the time ANALYZE is going to find 1 NULL, sometimes no NULLs, and rarely more than 1 NULL. Most of the time you'll get null_frac = 1/3000 = 0.000333333, sometimes 0, etc. The stats themselves look fine to me; I think there must be a mathematical error in what selfuncs.c is doing with them. Still looking to find it... regards, tom lane
Tom, > Ray Ontko <rayo@ontko.com> writes: > > Does this help? > > Well, it says that ANALYZE is not doing anything wrong ;-). > > The true null_frac given those numbers is 1.0 - 433809.0/433902.0 > or about 0.000214. In the default ANALYZE sample size (3000 rows) > the statistical expectation would be 0.643 NULLs, so most of the > time ANALYZE is going to find 1 NULL, sometimes no NULLs, and rarely > more than 1 NULL. Most of the time you'll get null_frac = 1/3000 = > 0.000333333, sometimes 0, etc. > > The stats themselves look fine to me; I think there must be a > mathematical error in what selfuncs.c is doing with them. > Still looking to find it... Well, that explains the non-determinism, I guess. It seems odd that if we aren't doing any updates to the data that analyze would give us different stats. How does analyze pick the 3000 rows it gets back, and should this vary from run to run if we aren't changing the table? Ray ---------------------------------------------------------------------- Ray Ontko rayo@ontko.com Phone 1.765.935.4283 Fax 1.765.962.9788 Ray Ontko & Co. Software Consulting Services http://www.ontko.com/
> The stats themselves look fine to me; I think there must be a > mathematical error in what selfuncs.c is doing with them. > Still looking to find it... Oh, I see it: range selectivity double-excludes NULLs. See the comment for clauselist_selectivity in src/backend/optimizer/path/clausesel.c. The individual estimates for the two component comparison operators each exclude nulls, and when we merge them together we get the wrong answer. Good catch! (Though I'm surprised no one noticed this before; with a larger null population the error would be much more obvious.) I'm running out of time today but will look into a fix later. regards, tom lane
Ray Ontko <rayo@ontko.com> writes: > Well, that explains the non-determinism, I guess. It seems > odd that if we aren't doing any updates to the data that analyze > would give us different stats. How does analyze pick the 3000 > rows it gets back, and should this vary from run to run if we > aren't changing the table? At random, and yes. See the documentation ... regards, tom lane
Awhile back I said: > Oh, I see it: range selectivity double-excludes NULLs. See the comment > for clauselist_selectivity in src/backend/optimizer/path/clausesel.c. > The individual estimates for the two component comparison operators > each exclude nulls, and when we merge them together we get the wrong > answer. Just FYI, I have fixed this for 7.3. regards, tom lane