Thread: Deceiding which index to use
<font size="-1"><font face="Tahoma">Hi!<br /><br /> I have two tables with some indices on them:<br /><br /> CREATE TABLEsubscriber<br /> (<br /> id serial NOT NULL,<br /> anumber character varying(32) NOT NULL,<br /> CONSTRAINT subscriber_pkPRIMARY KEY (id)<br /> ) <br /><br /> CREATE INDEX anumber_idx_numeric<br /> ON subscriber<br /> USING btree<br/> (anumber::numeric);<br /><br /> CREATE TABLE output_message_log<br /> (<br /> id serial NOT NULL,<br /> subscriber_id integer NOT NULL,<br /> crd timestamp without time zone NOT NULL DEFAULT now(),<br /> CONSTRAINT output_message_log_pkPRIMARY KEY (id),<br /> CONSTRAINT subscriber_fk FOREIGN KEY (subscriber_id)<br /> REFERENCESsubscriber (id) MATCH SIMPLE<br /> ON UPDATE NO ACTION ON DELETE NO ACTION,<br /> ) <br /><br /> CREATE INDEXcrd_idx<br /> ON output_message_log<br /> USING btree<br /> (crd);<br /><br /> CREATE INDEX subscriber_id_idx<br/> ON output_message_log<br /> USING btree<br /> (subscriber_id);<br /><br /> I would like to runa query like this one:<br /><br /> select l.id<br /> from output_message_log l join subscriber s on l.subscriber_id =s.id <br /> where s.anumber::numeric = 5555555555<br /> order by l.crd desc<br /> limit 41<br /> offset 20<br /><br /> Thething I do not understand is why postgresql wants to use crd_idx:<br /><br /> "Limit (cost=4848.58..14788.18 rows=41width=12) (actual time=7277.115..8583.814 rows=41 loops=1)"<br /> " -> Nested Loop (cost=0.00..1195418.42 rows=4931width=12) (actual time=92.083..8583.713 rows=61 loops=1)"<br /> " -> Index Scan Backward using crd_idxon output_message_log l (cost=0.00..17463.80 rows=388646 width=16) (actual time=0.029..975.095 rows=271447 loops=1)"<br/> " -> Index Scan using subscriber_pk on subscriber s (cost=0.00..3.02 rows=1 width=4) (actual time=0.026..0.026rows=0 loops=271447)"<br /> " Index Cond: ("outer".subscriber_id = s.id)"<br /> " Filter: ((anumber)::numeric = 36308504669::numeric)"<br /> "Total runtime: 8584.016 ms"<br /><br /> I wouldlike postgresql to use </font></font><font size="-1"><font face="Tahoma">subscriber_id_idx which resulst in a far lessexecution time on this database.<br /><br /> I tried to lower random_page_cost, but that didn't help as an index is alreadyused, just not the "good" one.<br /><br /> Could you please comment on this issue and suggest some possible soulutions?<br/><br /> Thanks,<br /><br /> Zizi<br /></font></font><font size="-1"><font face="Tahoma"><br /></font></font>
Mezei Zoltán wrote: > Hi! > > I have two tables with some indices on them: > > CREATE TABLE subscriber > ( > id serial NOT NULL, > anumber character varying(32) NOT NULL, > CONSTRAINT subscriber_pk PRIMARY KEY (id) > ) > > CREATE INDEX anumber_idx_numeric > ON subscriber > USING btree > (anumber::numeric); > I would like to run a query like this one: > > select l.id > from output_message_log l join subscriber s on l.subscriber_id = s.id > where s.anumber::numeric = 5555555555 > order by l.crd desc > limit 41 > offset 20 Q1. Why are you storing a numeric in a varchar? Q2. How many unique values does anumber have? And how many rows in subscriber? Q3. What happens if you create the index on plain (anumber) and then test against '555555555'? -- Richard Huxton Archonet Ltd
<small><font face="Tahoma">Richard Huxton wrote:<br /> > Mezei Zoltán wrote:<br /> > Q1. Why are you storing a numericin a varchar?<br /><br /> Because it's not always numeric info. :/<br /><br /> > Q2. How many unique values doesanumber have? And how many rows in<br /> > subscriber?<br /><br /> About 10k distinct anumbers and 20k rows. Nothingspecial...<br /><br /> > Q3. What happens if you create the index on plain (anumber) and then<br /> > test against'555555555'?<br /><br /> Nothing, everything is the same - the problem lies on the other table's index usage, usingthis index is fine.<br /><br /> Zizi</font><font face="Tahoma"></font><br /></small>
Mezei Zoltán wrote: > Richard Huxton wrote: > > Mezei Zoltán wrote: > > Q1. Why are you storing a numeric in a varchar? > > Because it's not always numeric info. :/ > > > Q2. How many unique values does anumber have? And how many rows in > > subscriber? > > About 10k distinct anumbers and 20k rows. Nothing special... And does the planner know that? SELECT * FROM pg_stats WHERE tablename='subscriber' AND attname='anumber'; It's the n_distinct you're interested in, and perhaps most_common_freqs. > > Q3. What happens if you create the index on plain (anumber) and then > > test against '555555555'? > > Nothing, everything is the same - the problem lies on the other table's index > usage, using this index is fine. The planner has to guess how many matches it will have for subscriber=5555555. Based on that choice, it will either: a. Do the join, then find the highest crd values (sort) b. Scan the crd values backwards and then join It's chosen (b) because it's estimating the numbers of matches incorrectly. I'm wondering whether the system can't see through your function-call (the cast to numeric) to determine how many matches it's going to get for any given value. If the system can't be persuaded into getting its estimates more accurate, it might be worth trying an index on (subscriber_id,crd) and dropping the index on (crd) - if that's reasonable for your query patterns. -- Richard Huxton Archonet Ltd
Richard Huxton wrote: <blockquote cite="mid:45F1775A.8030701@archonet.com" type="cite"></blockquote><p><font size="2">Anddoes the planner know that?<br /> SELECT * FROM pg_stats WHERE tablename='subscriber' AND attname='anumber';<br/> It's the n_distinct you're interested in, and perhaps most_common_freqs.</font><br /><small>n_distinctis -0.359322 and most_common_vals contains about 10 different anumbers (which are corretct), most_common_freqsare between 0.01 and 0.001. What does n_distinct exactly mean? Why is it negative?</small><br /><blockquotecite="mid:45F1775A.8030701@archonet.com" type="cite"><p><font size="2">> Nothing, everything is the same- the problem lies on the other table's index<br /> > usage, using this index is fine.<br /><br /> The planner hasto guess how many matches it will have for<br /> subscriber=5555555. Based on that choice, it will either:<br /> a.Do the join, then find the highest crd values (sort)<br /> b. Scan the crd values backwards and then join<br /> It'schosen (b) because it's estimating the numbers of matches<br /> incorrectly. I'm wondering whether the system can't seethrough your<br /> function-call (the cast to numeric) to determine how many matches it's<br /> going to get for any givenvalue.<br /></font></blockquote><small>It can see through the cast - I have just tried to create the same database omittingthe non-numeric anumbers and the results are the same.</small><br /><blockquote cite="mid:45F1775A.8030701@archonet.com"type="cite"><p><font size="2">If the system can't be persuaded into getting its estimatesmore<br /> accurate, it might be worth trying an index on (subscriber_id,crd) and<br /> dropping the index on (crd)- if that's reasonable for your query patterns.<br /></font></blockquote><small>I'll try that one if the negative n_distinctvalue can be a correct one :-)<br /><br /> Zizi</small><br />
Mezei Zoltán wrote: > Richard Huxton wrote: >> >> And does the planner know that? >> SELECT * FROM pg_stats WHERE tablename='subscriber' AND attname='anumber'; >> It's the n_distinct you're interested in, and perhaps most_common_freqs. >> > n_distinct is -0.359322 and most_common_vals contains about 10 different > anumbers (which are corretct), most_common_freqs are between 0.01 and 0.001. > What does n_distinct exactly mean? Why is it negative? It's saying that it's a ratio, so if you doubled the number of subscribers it would expect that the number of unique anumber's would double too. So you've got about 36% of the rows with unique values - pretty much what you said earlier. That's not bad, since the planner only uses an estimate. OK - so the next place to look is the distribution of values for subscriber_id on the output_message_log. Does that have some subscribers with many rows and lots with hardly any? If so, you might need to increase the stats on that column: ALTER TABLE output_message_log ALTER COLUMN subscriber_id SET STATISTICS <num>; ANALYSE output_message_log (subscriber_id); The <num> defaults to 10, but can be set as high as 1000. You want to try and capture the "big" subscribers. -- Richard Huxton Archonet Ltd
Richard Huxton wrote: <blockquote cite="mid:45F17B70.60001@archonet.com" type="cite"></blockquote><p><font size="2">OK -so the next place to look is the distribution of values for<br /> subscriber_id on the output_message_log. Does that havesome subscribers<br /> with many rows and lots with hardly any?</font><small>Hmm... There are about 1.5k subscriberswith 100-200 messages each - all the other 19k has an average of 8.9 messages, most of them having only 1 message.I think that's exactly the situation you mention...</small><br /><blockquote cite="mid:45F17B70.60001@archonet.com"type="cite"><p><font size="2">If so, you might need to<br /> increase the stats onthat column:<br /><br /> ALTER TABLE output_message_log ALTER COLUMN subscriber_id SET STATISTICS<br /> <num>;<br/> ANALYSE output_message_log (subscriber_id);<br /><br /> The <num> defaults to 10, but can be setas high as 1000. You want to<br /> try and capture the "big" subscribers.<br /></font></blockquote><small>So if I'm correct:this statistics gathering can be fine tuned, and if i set the <num> to 1000 then not only the first 10 subsribers(with most messages) will be stored in pg_stats, but the first 1000? Is 1000 a hard-coded highest-possible-value?I think it would be best to set that to simething like 1800-1900 as I have about that many subsciberswith high message count.<br /><br /> Zizi<br /></small>
Mezei Zoltán wrote: > Richard Huxton wrote: >> >> OK - so the next place to look is the distribution of values for >> subscriber_id on the output_message_log. Does that have some subscribers >> with many rows and lots with hardly any? >> > Hmm... There are about 1.5k subscribers with 100-200 messages each - all the > other 19k has an average of 8.9 messages, most of them having only 1 message. I > think that's exactly the situation you mention... [snip alter table ... set statistics] > So if I'm correct: this statistics gathering can be fine tuned, and if i set the > <num> to 1000 then not only the first 10 subsribers (with most messages) will be > stored in pg_stats, but the first 1000? Is 1000 a hard-coded > highest-possible-value? I think it would be best to set that to simething like > 1800-1900 as I have about that many subscibers with high message count. There is a cost to increasing the stats values, otherwise it'd already be set at 1000. In your case I'm not sure if 100-200 vs 8-9 messages is enough to skew things. Only one way to find out... -- Richard Huxton Archonet Ltd
Mezei Zoltán wrote: > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> > <html> > <head> > <meta content="text/html;charset=UTF-8" http-equiv="Content-Type"> > </head> > <body bgcolor="#ffffff" text="#000000"> > Richard Huxton wrote: > <blockquote cite="mid:45F17B70.60001@archonet.com" type="cite"> > <meta http-equiv="Content-Type" content="text/html; "> > <meta name="Generator" content="MS Exchange Server version 6.5.7226.0"> > <title>Re: [PERFORM] Deceiding which index to use</title> > <!-- Converted from text/plain format --> Would you mind instructing your mail system to skip converting your text/plain messages to HTML? It's kind of annoying for some of us. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
<small>Richard Huxton wrote:</small><blockquote cite="mid:45F182B7.30507@archonet.com" type="cite"></blockquote><p><fontsize="2">There is a cost to increasing the stats values, otherwise it'd already<br /> beset at 1000. In your case I'm not sure if 100-200 vs 8-9 messages is<br /> enough to skew things. Only one way to findout...<br /></font><small>Well, I tried. The situation is:<br /><br /> - when I look for a subscriber which the plannerthinks to have more messages --> it uses the index on crd<br /> - when I look for a subscriber which the plannerthinks to have less messages --> it uses the index on subscriber_id<br /><br /> I think that's OK, even if it stilldon't do what I really would like to see it do :-)<br /><br /> Thanks for all your help, maybe I know a little bit moreabout poostgresql than before.<br /><br /> Zizi</small><br />
Alvaro Herrera wrote: > > Would you mind instructing your mail system to skip converting your > text/plain messages to HTML? It's kind of annoying for some of us. > PostgreSQL Replication, Consulting, Custom Development, 24x7 support > Sorry about that - is this message OK now? Zizi
Mezei Zoltán wrote: > Alvaro Herrera wrote: >> >> Would you mind instructing your mail system to skip converting your >> text/plain messages to HTML? It's kind of annoying for some of us. >> PostgreSQL Replication, Consulting, Custom Development, 24x7 support >> > Sorry about that - is this message OK now? That's done the trick. -- Richard Huxton Archonet Ltd
Mezei Zoltán wrote: > Richard Huxton wrote: >> >> There is a cost to increasing the stats values, otherwise it'd >> already be set at 1000. In your case I'm not sure if 100-200 vs 8-9 >> messages is enough to skew things. Only one way to find out... >> > Well, I tried. The situation is: > > - when I look for a subscriber which the planner thinks to have more > messages --> it uses the index on crd - when I look for a subscriber > which the planner thinks to have less messages --> it uses the index > on subscriber_id > > I think that's OK, even if it still don't do what I really would like > to see it do :-) Well, it's decision about where the costs cross over will be in the other "cost" settings in postgresql.conf. The thing is, it's not worth tuning for just this one query, because you can easily end up running everything else more slowly. > Thanks for all your help, maybe I know a little bit more about > poostgresql than before. -- Richard Huxton Archonet Ltd