Thread: Deceiding which index to use

Deceiding which index to use

From
Mezei Zoltán
Date:
<font size="-1"><font face="Tahoma">Hi!<br /><br /> I have two tables with some indices on them:<br /><br /> CREATE
TABLEsubscriber<br /> (<br />   id serial NOT NULL,<br />   anumber character varying(32) NOT NULL,<br />   CONSTRAINT
subscriber_pkPRIMARY KEY (id)<br /> ) <br /><br /> CREATE INDEX anumber_idx_numeric<br />   ON subscriber<br />   USING
btree<br/>   (anumber::numeric);<br /><br /> CREATE TABLE output_message_log<br /> (<br />   id serial NOT NULL,<br />
 subscriber_id integer NOT NULL,<br />   crd timestamp without time zone NOT NULL DEFAULT now(),<br />   CONSTRAINT
output_message_log_pkPRIMARY KEY (id),<br />   CONSTRAINT subscriber_fk FOREIGN KEY (subscriber_id)<br />      
REFERENCESsubscriber (id) MATCH SIMPLE<br />       ON UPDATE NO ACTION ON DELETE NO ACTION,<br /> ) <br /><br /> CREATE
INDEXcrd_idx<br />   ON output_message_log<br />   USING btree<br />   (crd);<br /><br /> CREATE INDEX
subscriber_id_idx<br/>   ON output_message_log<br />   USING btree<br />   (subscriber_id);<br /><br /> I would like to
runa query like this one:<br /><br /> select l.id<br /> from output_message_log l join subscriber s on l.subscriber_id
=s.id <br /> where s.anumber::numeric = 5555555555<br /> order by l.crd desc<br /> limit 41<br /> offset 20<br /><br />
Thething I do not understand is why postgresql wants to use crd_idx:<br /><br /> "Limit  (cost=4848.58..14788.18
rows=41width=12) (actual time=7277.115..8583.814 rows=41 loops=1)"<br /> "  ->  Nested Loop  (cost=0.00..1195418.42
rows=4931width=12) (actual time=92.083..8583.713 rows=61 loops=1)"<br /> "        ->  Index Scan Backward using
crd_idxon output_message_log l  (cost=0.00..17463.80 rows=388646 width=16) (actual time=0.029..975.095 rows=271447
loops=1)"<br/> "        ->  Index Scan using subscriber_pk on subscriber s  (cost=0.00..3.02 rows=1 width=4) (actual
time=0.026..0.026rows=0 loops=271447)"<br /> "              Index Cond: ("outer".subscriber_id = s.id)"<br />
"             Filter: ((anumber)::numeric = 36308504669::numeric)"<br /> "Total runtime: 8584.016 ms"<br /><br /> I
wouldlike postgresql to use </font></font><font size="-1"><font face="Tahoma">subscriber_id_idx which resulst in a far
lessexecution time on this database.<br /><br /> I tried to lower random_page_cost, but that didn't help as an index is
alreadyused, just not the "good" one.<br /><br /> Could you please comment on this issue and suggest some possible
soulutions?<br/><br /> Thanks,<br /><br /> Zizi<br /></font></font><font size="-1"><font face="Tahoma"><br
/></font></font>

Re: Deceiding which index to use

From
Richard Huxton
Date:
Mezei Zoltán wrote:
> Hi!
>
> I have two tables with some indices on them:
>
> CREATE TABLE subscriber
> (
>   id serial NOT NULL,
>   anumber character varying(32) NOT NULL,
>   CONSTRAINT subscriber_pk PRIMARY KEY (id)
> )
>
> CREATE INDEX anumber_idx_numeric
>   ON subscriber
>   USING btree
>   (anumber::numeric);

> I would like to run a query like this one:
>
> select l.id
> from output_message_log l join subscriber s on l.subscriber_id = s.id
> where s.anumber::numeric = 5555555555
> order by l.crd desc
> limit 41
> offset 20

Q1. Why are you storing a numeric in a varchar?
Q2. How many unique values does anumber have? And how many rows in
subscriber?
Q3. What happens if you create the index on plain (anumber) and then
test against '555555555'?

--
   Richard Huxton
   Archonet Ltd

Re: Deceiding which index to use

From
Mezei Zoltán
Date:
<small><font face="Tahoma">Richard Huxton wrote:<br /> > Mezei Zoltán wrote:<br /> > Q1. Why are you storing a
numericin a varchar?<br /><br /> Because it's not always numeric info. :/<br /><br /> > Q2. How many unique values
doesanumber have? And how many rows in<br /> > subscriber?<br /><br /> About 10k distinct anumbers and 20k rows.
Nothingspecial...<br /><br /> > Q3. What happens if you create the index on plain (anumber) and then<br /> > test
against'555555555'?<br /><br /> Nothing, everything is the same - the problem lies on the other table's index usage,
usingthis index is fine.<br /><br /> Zizi</font><font face="Tahoma"></font><br /></small> 

Re: Deceiding which index to use

From
Richard Huxton
Date:
Mezei Zoltán wrote:
> Richard Huxton wrote:
>  > Mezei Zoltán wrote:
>  > Q1. Why are you storing a numeric in a varchar?
>
> Because it's not always numeric info. :/
>
>  > Q2. How many unique values does anumber have? And how many rows in
>  > subscriber?
>
> About 10k distinct anumbers and 20k rows. Nothing special...

And does the planner know that?
SELECT * FROM pg_stats WHERE tablename='subscriber' AND attname='anumber';
It's the n_distinct you're interested in, and perhaps most_common_freqs.

>  > Q3. What happens if you create the index on plain (anumber) and then
>  > test against '555555555'?
>
> Nothing, everything is the same - the problem lies on the other table's index
> usage, using this index is fine.

The planner has to guess how many matches it will have for
subscriber=5555555. Based on that choice, it will either:
   a. Do the join, then find the highest crd values (sort)
   b. Scan the crd values backwards and then join
It's chosen (b) because it's estimating the numbers of matches
incorrectly. I'm wondering whether the system can't see through your
function-call (the cast to numeric) to determine how many matches it's
going to get for any given value.

If the system can't be persuaded into getting its estimates more
accurate, it might be worth trying an index on (subscriber_id,crd) and
dropping the index on (crd) - if that's reasonable for your query patterns.

--
   Richard Huxton
   Archonet Ltd

Re: Deceiding which index to use

From
Mezei Zoltán
Date:
Richard Huxton wrote: <blockquote cite="mid:45F1775A.8030701@archonet.com" type="cite"></blockquote><p><font
size="2">Anddoes the planner know that?<br /> SELECT * FROM pg_stats WHERE tablename='subscriber' AND
attname='anumber';<br/> It's the n_distinct you're interested in, and perhaps most_common_freqs.</font><br
/><small>n_distinctis -0.359322 and most_common_vals contains about 10 different anumbers (which are corretct),
most_common_freqsare between 0.01 and 0.001. What does n_distinct exactly mean? Why is it negative?</small><br
/><blockquotecite="mid:45F1775A.8030701@archonet.com" type="cite"><p><font size="2">> Nothing, everything is the
same- the problem lies on the other table's index<br /> > usage, using this index is fine.<br /><br /> The planner
hasto guess how many matches it will have for<br /> subscriber=5555555. Based on that choice, it will either:<br />   
a.Do the join, then find the highest crd values (sort)<br />    b. Scan the crd values backwards and then join<br />
It'schosen (b) because it's estimating the numbers of matches<br /> incorrectly. I'm wondering whether the system can't
seethrough your<br /> function-call (the cast to numeric) to determine how many matches it's<br /> going to get for any
givenvalue.<br /></font></blockquote><small>It can see through the cast - I have just tried to create the same database
omittingthe non-numeric anumbers and the results are the same.</small><br /><blockquote
cite="mid:45F1775A.8030701@archonet.com"type="cite"><p><font size="2">If the system can't be persuaded into getting its
estimatesmore<br /> accurate, it might be worth trying an index on (subscriber_id,crd) and<br /> dropping the index on
(crd)- if that's reasonable for your query patterns.<br /></font></blockquote><small>I'll try that one if the negative
n_distinctvalue can be a correct one :-)<br /><br /> Zizi</small><br /> 

Re: Deceiding which index to use

From
Richard Huxton
Date:
Mezei Zoltán wrote:
> Richard Huxton wrote:
>>
>> And does the planner know that?
>> SELECT * FROM pg_stats WHERE tablename='subscriber' AND attname='anumber';
>> It's the n_distinct you're interested in, and perhaps most_common_freqs.
>>
> n_distinct is -0.359322 and most_common_vals contains about 10 different
> anumbers (which are corretct), most_common_freqs are between 0.01 and 0.001.
> What does n_distinct exactly mean? Why is it negative?

It's saying that it's a ratio, so if you doubled the number of
subscribers it would expect that the number of unique anumber's would
double too. So you've got about 36% of the rows with unique values -
pretty much what you said earlier. That's not bad, since the planner
only uses an estimate.

OK - so the next place to look is the distribution of values for
subscriber_id on the output_message_log. Does that have some subscribers
with many rows and lots with hardly any? If so, you might need to
increase the stats on that column:

ALTER TABLE output_message_log ALTER COLUMN subscriber_id SET STATISTICS
<num>;
ANALYSE output_message_log (subscriber_id);

The <num> defaults to 10, but can be set as high as 1000. You want to
try and capture the "big" subscribers.

--
   Richard Huxton
   Archonet Ltd

Re: Deceiding which index to use

From
Mezei Zoltán
Date:
Richard Huxton wrote: <blockquote cite="mid:45F17B70.60001@archonet.com" type="cite"></blockquote><p><font size="2">OK
-so the next place to look is the distribution of values for<br /> subscriber_id on the output_message_log. Does that
havesome subscribers<br /> with many rows and lots with hardly any?</font><small>Hmm... There are about 1.5k
subscriberswith 100-200 messages each - all the other 19k has an average of 8.9 messages, most of them having only 1
message.I think that's exactly the situation you mention...</small><br /><blockquote
cite="mid:45F17B70.60001@archonet.com"type="cite"><p><font size="2">If so, you might need to<br /> increase the stats
onthat column:<br /><br /> ALTER TABLE output_message_log ALTER COLUMN subscriber_id SET STATISTICS<br />
<num>;<br/> ANALYSE output_message_log (subscriber_id);<br /><br /> The <num> defaults to 10, but can be
setas high as 1000. You want to<br /> try and capture the "big" subscribers.<br /></font></blockquote><small>So if I'm
correct:this statistics gathering can be fine tuned, and if i set the <num> to 1000 then not only the first 10
subsribers(with most messages) will be stored in pg_stats, but the first 1000? Is 1000 a hard-coded
highest-possible-value?I think it would be best to set that to simething like 1800-1900 as I have about that many
subsciberswith high message count.<br /><br /> Zizi<br /></small> 

Re: Deceiding which index to use

From
Richard Huxton
Date:
Mezei Zoltán wrote:
> Richard Huxton wrote:
>>
>> OK - so the next place to look is the distribution of values for
>> subscriber_id on the output_message_log. Does that have some subscribers
>> with many rows and lots with hardly any?
>>
> Hmm... There are about 1.5k subscribers with 100-200 messages each - all the
> other 19k has an average of 8.9 messages, most of them having only 1 message. I
> think that's exactly the situation you mention...

[snip alter table ... set statistics]

> So if I'm correct: this statistics gathering can be fine tuned, and if i set the
> <num> to 1000 then not only the first 10 subsribers (with most messages) will be
> stored in pg_stats, but the first 1000? Is 1000 a hard-coded
> highest-possible-value? I think it would be best to set that to simething like
> 1800-1900 as I have about that many subscibers with high message count.

There is a cost to increasing the stats values, otherwise it'd already
be set at 1000. In your case I'm not sure if 100-200 vs 8-9 messages is
enough to skew things. Only one way to find out...

--
   Richard Huxton
   Archonet Ltd

Re: Deceiding which index to use

From
Alvaro Herrera
Date:
Mezei Zoltán wrote:
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
>   <meta content="text/html;charset=UTF-8" http-equiv="Content-Type">
> </head>
> <body bgcolor="#ffffff" text="#000000">
> Richard Huxton wrote:
> <blockquote cite="mid:45F17B70.60001@archonet.com" type="cite">
>   <meta http-equiv="Content-Type" content="text/html; ">
>   <meta name="Generator" content="MS Exchange Server version 6.5.7226.0">
>   <title>Re: [PERFORM] Deceiding which index to use</title>
> <!-- Converted from text/plain format -->

Would you mind instructing your mail system to skip converting your
text/plain messages to HTML?  It's kind of annoying for some of us.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Deceiding which index to use

From
Mezei Zoltán
Date:
<small>Richard Huxton wrote:</small><blockquote cite="mid:45F182B7.30507@archonet.com"
type="cite"></blockquote><p><fontsize="2">There is a cost to increasing the stats values, otherwise it'd already<br />
beset at 1000. In your case I'm not sure if 100-200 vs 8-9 messages is<br /> enough to skew things. Only one way to
findout...<br /></font><small>Well, I tried. The situation is:<br /><br /> - when I look for a subscriber which the
plannerthinks to have more messages --> it uses the index on crd<br /> - when I look for a subscriber which the
plannerthinks to have less messages --> it uses the index on subscriber_id<br /><br /> I think that's OK, even if it
stilldon't do what I really would like to see it do :-)<br /><br /> Thanks for all your help, maybe I know a little bit
moreabout poostgresql than before.<br /><br /> Zizi</small><br /> 

Re: Deceiding which index to use

From
Mezei Zoltán
Date:
Alvaro Herrera wrote:
>
> Would you mind instructing your mail system to skip converting your
> text/plain messages to HTML?  It's kind of annoying for some of us.
> PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>
Sorry about that - is this message OK now?

Zizi

Re: Deceiding which index to use

From
Richard Huxton
Date:
Mezei Zoltán wrote:
> Alvaro Herrera wrote:
>>
>> Would you mind instructing your mail system to skip converting your
>> text/plain messages to HTML?  It's kind of annoying for some of us.
>> PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>>
> Sorry about that - is this message OK now?

That's done the trick.

--
   Richard Huxton
   Archonet Ltd

Re: Deceiding which index to use

From
Richard Huxton
Date:
Mezei Zoltán wrote:
> Richard Huxton wrote:
>>
>> There is a cost to increasing the stats values, otherwise it'd
>> already be set at 1000. In your case I'm not sure if 100-200 vs 8-9
>> messages is enough to skew things. Only one way to find out...
>>
> Well, I tried. The situation is:
>
> - when I look for a subscriber which the planner thinks to have more
> messages --> it uses the index on crd - when I look for a subscriber
> which the planner thinks to have less messages --> it uses the index
> on subscriber_id
>
> I think that's OK, even if it still don't do what I really would like
> to see it do :-)

Well, it's decision about where the costs cross over will be in the
other "cost" settings in postgresql.conf. The thing is, it's not worth
tuning for just this one query, because you can easily end up running
everything else more slowly.

> Thanks for all your help, maybe I know a little bit more about
> poostgresql than before.


--
   Richard Huxton
   Archonet Ltd