Re: [GENERAL] how to get accurate values in pg_statistic - Mailing list pgsql-performance

From Bruce Momjian
Subject Re: [GENERAL] how to get accurate values in pg_statistic
Date
Msg-id 200309071622.h87GMt909846@candle.pha.pa.us
Whole thread Raw
In response to Re: [GENERAL] how to get accurate values in pg_statistic  (Mary Edie Meredith <maryedie@osdl.org>)
List pgsql-performance
I have learned you can use:

    SET random = 0;

to force identical statistics every time you run ANALYZE.

---------------------------------------------------------------------------

Mary Edie Meredith wrote:
> I certainly don't claim that it is appropriate to force customers into a
> full analysis, particularly if random sampling versus a full scan of the
> data reveals little to no performance differences in the plans.  Being
> able to sample accurately is _very nice for large tables.
>
> For our testing purposes, however, consistent results are extremely
> important. We have observed that small difference in one plan for one of
> 22 queries can cause a difference in the DBT-3 results.  If this
> happens, a small change in performance runs between two Linux kernels
> may appear to be due to the kernels, when in fact it is due to the plan
> change.
>
> We know that the plans are _exactly the same if the data in the
> pg_statistics table is the same from run to run (all other things being
> equal).  So what we need to have is identical optimizer costs
> (pg_statistics) for the same table data for each.
>
> I feel certain that the pg_statistics table will be identical from run
> to run if analyze looks at every row.   Thus our hope to find a way to
> get that.
>
> We did runs over night.  We can confirm that VACUUM FULL ANALYZE does
> not produce the same pg_statistics run to run.  With the default (10)
> default_statistics_target the plans are also different.
>
> We ran additional tests with default_statistics_target set to 1000 (the
> max I believe).  The plans are the same over the different runs, but the
> pg_statistics table has different cost values.  The performance results
> of the runs are consistent (we would expect this with the same plans).
> The resulting performance metrics are similar to the best plans we see
> using the default histogram size (good news).
>
> However, we worry that one day the cost will change enough for whatever
> reason to cause a plan change, especially for a larger database scale
> factor (database size/row size).
>
> I know we appear to be an isolated case, but customers also do testing
> and may have the same consistency issues we have.  I can also imagine
> cases where customers want to guarantee that plans stay the same
> (between replicated sites, for example).  If two developers are
> analyzing changes to the optimizer, don't you want the costs used for
> testing on their two systems to be identical for comparison purposes?
>
> Anyway, IMHO I believe that an option for an ANALYZE FULL ("sampling"
> all rows) would be valuable.   Any other ideas for how to force this
> without code change are very welcome.
>
> Thanks for your info!
>
>
>
> On Thu, 2003-09-04 at 16:16, Tom Lane wrote:
> > Mary Edie Meredith <maryedie@osdl.org> writes:
> > > Stephan Szabo kindly responded to our earlier queries suggesting we look
> > > at default_statistics_target and ALTER TABLE ALTER COLUMN SET
> > > STATISTICS.
> >
> > > These determine the number of bins in the histogram for a given column.
> > > But for a large number of rows (for example 6 million) the maximum value
> > > (1000) does not guarantee that ANALYZE will do a full scan of the table.
> > > We do not see a way to guarantee the same statistics run to run without
> > > forcing ANALYZE to examine every row of every table.
> >
> > Do you actually still have a problem with the plans changing when the
> > stats target is above 100 or so?  I think the notion of "force ANALYZE
> > to do a full scan" is inherently wrongheaded ... it certainly would not
> > produce numbers that have anything to do with ordinary practice.
> >
> > If you have data statistics that are so bizarre that the planner still
> > gets things wrong with a target of 1000, then I'd like to know more
> > about why.
> >
> >             regards, tom lane
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 2: you can get off all lists at once with the unregister command
> >     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
> --
> Mary Edie Meredith <maryedie@osdl.org>
> Open Source Development Lab
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

pgsql-performance by date:

Previous
From: "Jim C. Nasby"
Date:
Subject: Poor pg_dump performance
Next
From: Pailloncy Jean-Gérard
Date:
Subject: slow plan for min/max