Re: varchar index joins not working? - Mailing list pgsql-performance

From Adam Gundy
Subject Re: varchar index joins not working?
Date
Msg-id 48038E21.50606@starsilk.net
Whole thread Raw
In response to Re: varchar index joins not working?  (Richard Huxton <dev@archonet.com>)
Responses Re: varchar index joins not working?  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: varchar index joins not working?  (Richard Huxton <dev@archonet.com>)
List pgsql-performance
Richard Huxton wrote:
> Adam Gundy wrote:
>> On Thu, Apr 10, 2008 at 8:52 AM, Adam Gundy <adam@starsilk.net> wrote:
>>> Richard Huxton wrote:
>>>> How many distinct values do you have in groups.groupid and
>>> group_access.group_id?
>>>  for the small database (since it shows the same problem):
>>>
>>>  group_access: 280/268
>>>  groups: 2006/139
>>>
>>>  for the large database:
>>>
>>>  group_access: same
>>>  groups: 1712647/140
>>>
>>>  the groupid key is an MD5 hash, so it should be uniformly distributed.
>>> maybe that throws the stats? but, again, char works, varchar doesn't :-(
>>
>> OK, I'm thinking the varchar/char part is not the issue.
>
> Good, because it's not :-)

hmm. unfortunately it did turn out to be (part) of the issue. I've
discovered that mixing char and varchar in a stored procedure does not
coerce the types, and ends up doing seq scans all the time.

changing something like this:

proc x ( y char(32) )
{
    select * from groups where groupid = y
}

into this:

proc x ( y varchar(32) )
{
    select * from groups where groupid = y
}

and suddenly postgres does index lookups in the stored proc... way faster.

>> I *think* that when I analyze using char instead of varchar, it is
>> recording a stat for the large group, but for some reason with varchar
>> doesn't add a stat for that one.
>>
>> so, the real question is, how do I fix this? I can turn the stats way
>> up to 1000, but that doesn't guarantee that I'll get a stat for the
>> large group :-(
>
> Well, by default it will be tracking the 10 most common values (and how
> often they occur). As you say, this can be increased to 1000 (although
> it obviously takes longer to check 1000 rather than 10).
>
> We can have a look at the stats with something like:
> SELECT * FROM pg_stats WHERE tablename='group_access' AND attname='uid';
> You'll be interested in n_distinct, most_common_vals and most_common_freqs.
>
> However, I think the problem may be that PG doesn't track cross-column
> stats, so it doesn't know that a particular uid implies one or more
> particular groupid values.

I doubt we could get stats stable enough for this. the number of groups
will hopefully be much larger at some point.

it's a shame the index entries can't be used to provide information to
the planner, eg a rough count of the number of entries for a given key
(or subset). it would be nice to be able to create eg a counted btree
when you know you have this kind of data as a hint to the planner.

>> can I turn the statistics off completely for this column? I'm guessing
>> that if I can, that will mean it takes a guess based on the number of
>> distinct values in the groups table, which is still large number of
>> records, possibly enough to trigger the seqscan anyway.
>
> No - can't disable stats. Besides, you want it the other way around -
> index scans for all groups except the largest.

actually, disabling seqscan at the server level gives extremely good
response times. I ended up rewriting a few queries that were scanning
the whole group for no good reason, and bitmap index hashing seems to
take care of things nicely.

queries have gone from 30+ seconds to < 0.1 seconds.

>> does postgres have a way of building a 'counted index' that the
>> planner can use for it's record counts? some way of forcibly
>> maintaining a stat for every group?
>
> No, but let's see what's in pg_stats.

no real help there. either it hits the group being read, and does a good
plan, or it doesn't, and tries to seqscan (unless I disable it). even
forcing stats to 1000 only bandaids the situation, given the number of
groups will eventually exceed that..


Attachment

pgsql-performance by date:

Previous
From: Greg Smith
Date:
Subject: Re: shared_buffers performance
Next
From: Tom Lane
Date:
Subject: Re: varchar index joins not working?