Home > mailing lists

Re: Odd problem with performance in duplicate database - Mailing list pgsql-performance

From	Tom Lane
Subject	Re: Odd problem with performance in duplicate database
Date	August 12, 2003 00:30:06
Msg-id	27573.1060648162@sss.pgh.pa.us Whole thread Raw
In response to	Re: Odd problem with performance in duplicate database ("Peter Darley" <pdarley@kinesis-cem.com>)
Responses	Re: Odd problem with performance in duplicate database Re: Odd problem with performance in duplicate database
List	pgsql-performance

Tree view

Josh Berkus <josh@agliodbs.com> writes:
> Tom,
>> Okay, here's our problem:
>>
>> live DB: tgroup_id  n_distinct = -1
>>
>> test DN: tgroup_id  n_distinct = 11
>>
>> The former estimate actually means that it thinks tgroup_id is a unique
>> column, whereas the latter says there are only 11 distinct values in the
>> column.  I assume the former is much nearer to the truth (how many rows
>> in cases, and how many distinct tgroup_id values)?

> The real case is that there are 113 distinct tgroup_ids, which cover
> about 10% of the population of cases.  The other 90% is NULL.  The
> average tgroup_id is shared between 4.7 cases.

> So this seems like sampling error.

Partly.  The numbers suggest that in ANALYZE's default sample of 3000
rows, it's only finding about a dozen non-null tgroup_ids (yielding the
0.996 null_frac value); and that in one case all dozen are different and
in the other case there are two duplicates.  It would help if you
boosted the stats target for this column by a factor of 10.  (You can
do that in 7.2, btw --- IIRC the only problem is that a pg_dump won't
show that you did so.)

But the other part of the problem is that in 7.2, the join selectivity
estimator is way off when you are joining a unique column (like the pkey
on the other side) to a column with a very large fraction of nulls.
We only discovered this recently; it's fixed as of 7.3.3:

2003-04-15 01:18  tgl

    * src/backend/utils/adt/selfuncs.c (REL7_3_STABLE): eqjoinsel's
    logic for case where MCV lists are not present should account for
    NULLs; in hindsight this is obvious since the code for the
    MCV-lists case would reduce to this when there are zero entries in
    both lists.  Per example from Alec Mitchell.

Possibly you could backpatch that into 7.2, although I'd think an update
to 7.3.4 would be a more profitable use of time.

            regards, tom lane

pgsql-performance by date:

From: Tom Lane
Date: 12 August 2003, 00:14:52
Subject: Re: Odd problem with performance in duplicate database

From: "Christopher Kings-Lynne"
Date: 12 August 2003, 00:44:06
Subject: Re: Perfomance Tuning

Re: Odd problem with performance in duplicate database - Mailing list pgsql-performance

Previous

Next