Thread: ANALYZE versus expression indexes with nondefault opckeytype

ANALYZE versus expression indexes with nondefault opckeytype

From
Tom Lane
Date:
I've been poking at the FTS problem example recently submitted by
Artur Dabrowski.  It contains an index declared like so:

CREATE INDEX idx_keywords_ger ON search_tabUSING gin (to_tsvector('german'::regconfig, keywords));

which can be queried like so

select * from search_tab where to_tsvector('german', keywords) @@ to_tsquery('german', 'whatever');

which is all fine and matches suggestions in our documentation.  But I
was dismayed to discover that ANALYZE wasn't storing any statistics
for this expression index, which rather hobbles the planner's ability
to produce useful estimates for the selectivity of such WHERE clauses.

The reason for that turns out to be this bit in analyze.c:
                       /*                        * Can't analyze if the opclass uses a storage type
  * different from the expression result type. We'd get                        * confused because the type shown in
pg_attributefor                        * the index column doesn't match what we are getting                        *
fromthe expression. Perhaps this can be fixed                        * someday, but for now, punt.
 */                       if (exprType(indexkey) !=                           Irel[ind]->rd_att->attrs[i]->atttypid)
                      continue;
 

Since a tsvector index does have a storage type different from tsvector
(in fact, it's shown as TEXT), this code decides to punt and not analyze
the column.  I think it's past time we did something about this.

After a bit of study of the code, it appears to me that it's not too
difficult to fix: we just have to use the expression's result type
rather than the index column's atttypid in the subsequent processing.
ANALYZE never actually looks at the index column contents (indeed
cannot, since our index AMs provide no API for extracting the contents),
so atttypid isn't really all that relevant to it.  However, this'll
imply an API/ABI break for custom typanalyze functions: we'll need to
store the actual type OID in struct VacAttrStats, and use that rather
than looking to attr->atttypid or any of the other type-dependent
fields of the Form_pg_attribute.  So that seems to make it not practical
to back-patch.

I thought of an ugly hack that would avoid an API/ABI break: since
VacAttrStats.attr is a copy of the pg_attribute data, we could scribble
on it, changing atttypid, attlen, attbyval, etc to match the index
expression result type.  This seems pretty grotty, but it would allow
the fix to be back-patched into existing branches.

Comments?  I'm not sure which way to jump here.
        regards, tom lane


Re: ANALYZE versus expression indexes with nondefault opckeytype

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> After a bit of study of the code, it appears to me that it's not too
> difficult to fix: we just have to use the expression's result type
> rather than the index column's atttypid in the subsequent processing.
> ANALYZE never actually looks at the index column contents (indeed
> cannot, since our index AMs provide no API for extracting the contents),

I don't think it'll be an issue, but just in case- is there any reason
this will cause trouble for the possible index-only quals/scans work?

> So that seems to make it not practical to back-patch.

But we could get it into 9.0..

> I thought of an ugly hack that would avoid an API/ABI break: since
> VacAttrStats.attr is a copy of the pg_attribute data, we could scribble
> on it, changing atttypid, attlen, attbyval, etc to match the index
> expression result type.  This seems pretty grotty, but it would allow
> the fix to be back-patched into existing branches.

Yeah, that's rather nasty. :/

> Comments?  I'm not sure which way to jump here.

For my 2c- let's get it fixed for 9.0 cleanly and encourage people who
run into this to upgrade to that once it's released.  Perhaps I've
missed it, but I don't recall seeing many complaints about this.
Thanks,
    Stephen

Re: ANALYZE versus expression indexes with nondefault opckeytype

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> After a bit of study of the code, it appears to me that it's not too
>> difficult to fix: we just have to use the expression's result type
>> rather than the index column's atttypid in the subsequent processing.
>> ANALYZE never actually looks at the index column contents (indeed
>> cannot, since our index AMs provide no API for extracting the contents),

> I don't think it'll be an issue, but just in case- is there any reason
> this will cause trouble for the possible index-only quals/scans work?

Not that I can see.  What ANALYZE and the planner want to do is provide
stats on the behavior of the indexed expression.  The fact that certain
index opclasses don't actually store that value is of no interest.  Even
if it did become of interest later on, we're under no compulsion to not
redefine the contents of pg_statistic when we need to.

>> So that seems to make it not practical to back-patch.

> But we could get it into 9.0..

Hmm.  I was thinking it was a bit late for 9.0 too, since it'd
invalidate any testing anyone's done of their custom typanalyze
functions.  However, since we're still only in beta, maybe that's
acceptable.

>> Comments?  I'm not sure which way to jump here.

> For my 2c- let's get it fixed for 9.0 cleanly and encourage people who
> run into this to upgrade to that once it's released.  Perhaps I've
> missed it, but I don't recall seeing many complaints about this.

It's only been very recently that we had any useful stats capability
that could apply in such situations (in fact I think we still haven't
actually shipped a non-bogus version of tsvector typanalyze :-().
So I'm not sure anyone would have realized they were missing something.
But it would be nice to have this working in 8.4, and I'll be quite
unhappy if we don't have it in 9.0.

If people feel it's not too late for an ABI break in 9.0 then I'm
willing to compromise on backpatching the "clean" fix.  I'm not entirely
convinced it's that much cleaner than the hack solution though.
        regards, tom lane


Re: ANALYZE versus expression indexes with nondefault opckeytype

From
Robert Haas
Date:
On Jul 31, 2010, at 11:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It's only been very recently that we had any useful stats capability
> that could apply in such situations (in fact I think we still haven't
> actually shipped a non-bogus version of tsvector typanalyze :-().
> So I'm not sure anyone would have realized they were missing something.
> But it would be nice to have this working in 8.4, and I'll be quite
> unhappy if we don't have it in 9.0.
>
> If people feel it's not too late for an ABI break in 9.0 then I'm
> willing to compromise on backpatching the "clean" fix.  I'm not entirely
> convinced it's that much cleaner than the hack solution though.

I think this whole discussion is starting with the wrong premise. This is not a bug fix; therefore, it's 9.1 material.
Ifanyone else had suggested slipping this into 9.0, let alone 8.4, we would have punted it in a heartbeat. 

...Robert



Re: ANALYZE versus expression indexes with nondefault opckeytype

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I think this whole discussion is starting with the wrong premise. This
> is not a bug fix; therefore, it's 9.1 material.

Failing to store stats isn't a bug?
        regards, tom lane


Re: ANALYZE versus expression indexes with nondefault opckeytype

From
Robert Haas
Date:
On Sat, Jul 31, 2010 at 12:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I think this whole discussion is starting with the wrong premise. This
>> is not a bug fix; therefore, it's 9.1 material.
>
> Failing to store stats isn't a bug?

Well, it kind of sounds more like you're removing a known limitation
than fixing a bug.  It's not as if the behavior fails to match the
comment.  I'm pretty hesitant to see us making any changes to 9.0 that
aren't necessary to fix existing bugs or new regressions.  What I want
to do with 9.0 is get it stable and ship it.  I'm not really terribly
concerned about the possibility of an ABI break even at this late
date, but I *am* concerned about the possibility either of (1)
unforeseen consequences necessitating further patching or (2) getting
distracted from the business of getting the release out the door.
We've been in feature freeze for more than five months, so I think
it's certainly time try to reduce to an absolute minimum the number of
changes that "need" to be made before release.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company