Re: Yet another abort-early plan disaster on 9.3 - Mailing list pgsql-performance

From Craig James
Subject Re: Yet another abort-early plan disaster on 9.3
Date
Msg-id CAFwQ8rd0C47ZkCGONPa1UfJg2nJzyi-7kAQ6-jjapV+GU-Mo0Q@mail.gmail.com
Whole thread Raw
In response to Re: Yet another abort-early plan disaster on 9.3  ("Tomas Vondra" <tv@fuzzy.cz>)
Responses Re: Yet another abort-early plan disaster on 9.3
List pgsql-performance
On Fri, Oct 10, 2014 at 5:10 AM, Tomas Vondra <tv@fuzzy.cz> wrote:
> I've gone looking for papers on this topic but from what I read this
> isn't so. To get any noticeable improvement you need to read 10-50% of
> the table and that's effectively the same as reading the entire table
> -- and it still had pretty poor results. All the research I could find
> went into how to analyze the whole table while using a reasonable
> amount of scratch space and how to do it incrementally.

I think it's really difficult to discuss the estimation without some basic
agreement on what are the goals. Naturally, we can't get a perfect
estimator with small samples (especially when the sample size is fixed and
not scaling with the table). But maybe we can improve the estimates
without scanning most of the table?

FWIW I've been playing with the adaptive estimator described in [1] and
the results looks really interesting, IMHO. So far I was testing it on
synthetic datasets outside the database, but I plan to use it instead of
our estimator, and do some more tests.

We've solved this problem using an external (non-Postgres) dynamically optimizing index. In addition to the "early abort," we also require an efficient "late start", the equivalent of "offset 100 limit 10". It's a common problem for web sites that let users page through data with just a tiny amount of state information (a cookie).

Our index is for chemical structures. Chemicals are indexed on chemical fragments. A search typically starts with 50-200 indexed "columns" (chemical fragments). The query is always flat, "A and B and ... and Z". The indexed fragments are both correlated (the existence of one strongly raises the chances of another) and anti-correlated (certain combinations are very rare).

The dynamic optimizer watches the performance of each index in real time. It promotes highly selective indexes and demotes or removes redundant indexes. In a typical query, the initial 50-200 indexes are reduced to 5-10 indexes within the first 100-200 rows examined. The remaining indexes have little correlation yet retain most of the selectivity. (One critical factor with a dynamic optimizer is that the data must be randomized before it's presented to the optimizer. Databases tend to have clusters of similar data. If the optimizer starts in such a cluster, it will optimize poorly.)

Our query is simple (a flat AND) compared to what Postgres has to handle. Even so, a dynamic optimizer is the only effective solution.

Static planners simply can't handle the "early abort" condition, even with good statistics. Many have pointed out that data are "lumpy" rather than well distributed. A more subtle problem is that you can have evenly distributed data, but badly distributed correlations. "Agnes" and "Bob" may be names that are distributed well in a real-estate database, but it might happen that all of the information about homes whose owners' names are "Agnes" and "Bob" occurs at the very end of all of your data because they just got married and bought a house.

The end result is that even with perfect statistics on each column, you're still screwed. The combinatorial explosion of possible correlations between indexes is intractable.

Craig

pgsql-performance by date:

Previous
From: "Tomas Vondra"
Date:
Subject: Re: Yet another abort-early plan disaster on 9.3
Next
From: Tomas Vondra
Date:
Subject: Re: Yet another abort-early plan disaster on 9.3