Re: Odd problem with performance in duplicate database - Mailing list pgsql-performance

From Tom Lane
Subject Re: Odd problem with performance in duplicate database
Date
Msg-id 27573.1060648162@sss.pgh.pa.us
Whole thread Raw
In response to Re: Odd problem with performance in duplicate database  ("Peter Darley" <pdarley@kinesis-cem.com>)
Responses Re: Odd problem with performance in duplicate database  (Josh Berkus <josh@agliodbs.com>)
Re: Odd problem with performance in duplicate database  (Josh Berkus <josh@agliodbs.com>)
List pgsql-performance
Josh Berkus <josh@agliodbs.com> writes:
> Tom,
>> Okay, here's our problem:
>>
>> live DB: tgroup_id  n_distinct = -1
>>
>> test DN: tgroup_id  n_distinct = 11
>>
>> The former estimate actually means that it thinks tgroup_id is a unique
>> column, whereas the latter says there are only 11 distinct values in the
>> column.  I assume the former is much nearer to the truth (how many rows
>> in cases, and how many distinct tgroup_id values)?

> The real case is that there are 113 distinct tgroup_ids, which cover
> about 10% of the population of cases.  The other 90% is NULL.  The
> average tgroup_id is shared between 4.7 cases.

> So this seems like sampling error.

Partly.  The numbers suggest that in ANALYZE's default sample of 3000
rows, it's only finding about a dozen non-null tgroup_ids (yielding the
0.996 null_frac value); and that in one case all dozen are different and
in the other case there are two duplicates.  It would help if you
boosted the stats target for this column by a factor of 10.  (You can
do that in 7.2, btw --- IIRC the only problem is that a pg_dump won't
show that you did so.)

But the other part of the problem is that in 7.2, the join selectivity
estimator is way off when you are joining a unique column (like the pkey
on the other side) to a column with a very large fraction of nulls.
We only discovered this recently; it's fixed as of 7.3.3:

2003-04-15 01:18  tgl

    * src/backend/utils/adt/selfuncs.c (REL7_3_STABLE): eqjoinsel's
    logic for case where MCV lists are not present should account for
    NULLs; in hindsight this is obvious since the code for the
    MCV-lists case would reduce to this when there are zero entries in
    both lists.  Per example from Alec Mitchell.

Possibly you could backpatch that into 7.2, although I'd think an update
to 7.3.4 would be a more profitable use of time.

            regards, tom lane

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Odd problem with performance in duplicate database
Next
From: "Christopher Kings-Lynne"
Date:
Subject: Re: Perfomance Tuning