Logarithmic data frequency distributions and the query planner - Mailing list pgsql-performance

From Jerry Gamache
Subject Logarithmic data frequency distributions and the query planner
Date
Msg-id 4C34E998.3020000@idilia.com
Whole thread Raw
Responses Re: Logarithmic data frequency distributions and the query planner  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
On 8.1, I have a very interesting database where the distributions of
some values in a multi-million rows table is logarithmic (i.e. the most
frequent value is an order of magnitude more frequent than the next
ones). If I analyze the table, the statistics become extremely skewed
towards the most frequent values and this prevents the planner from
giving any good results on queries that do not target these entries.

In a recent case, the planner estimated that the number of returned rows
would be ~13% of the table size and from this bad assumption generated a
very slow conservative plan that executed in days. If I set the
statistics at zero for that table, the planner uses a hardcoded ratio
(seems like 0.5%) for the number of returned rows and this helps
generating a plan that executes in 3 minutes (still sub-optimal, but not
as bad).

Generating partial index for the less frequent cases helps, but this
solution is not flexible enough for our needs as the number of complex
queries grow. We are mostly left with pre-generating a lot of temporary
tables whenever the planner over-evaluates the number of rows generated
by a subquery (query execution was trimmed from 3 minutes to 30 seconds
using this technique) or using the OFFSET 0 tweak, but it would be nice
if the planner could handle this on its own.

Am I missing something obvious? Setting the statistics for this table to
zero seems awkward even if it gives good results.
Jerry.



pgsql-performance by date:

Previous
From: Robert Haas
Date:
Subject: Re: big data - slow select (speech search)
Next
From: Tom Lane
Date:
Subject: Re: Logarithmic data frequency distributions and the query planner