Re: index problems (again) - Mailing list pgsql-general

From Peter J. Holzer
Subject Re: index problems (again)
Date
Msg-id 20160312220021.GB20873@rorschach.hjp.at
Whole thread Raw
In response to Re: index problems (again)  (Geoff Winkless <pgsqladmin@geoff.dj>)
Responses Re: index problems (again)  (Geoff Winkless <pgsqladmin@geoff.dj>)
List pgsql-general
On 2016-03-12 21:00:04 +0000, Geoff Winkless wrote:
> On 12 March 2016 at 18:43, Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
> > The question is what can be done to improve the situation.
> >
> > Tom thinks that correlation statistics would help. That seems plausible
> > to me.
[...]
> > You claim that no statistics are needed.
>
> Well that's a bit confrontational.

Sorry. Didn't want to sound confrontational. I was just repeating points
made by Tom and you previously in this thread to establish a baseline.

> > That may or may not be true: You haven't proposed an alternate method
> > yet.
>
> You could make an assumption that perfect distribution isn't true:
> that actually the distribution is within a certain _deviation_ of that
> perfect distribution. It wouldn't have to have been very much to make
> the index-only scan win here and would still keep the planner from
> choosing less optimal queries most of the time (and where it did end
> up making the "wrong" choice it's not going to be far off anyway).
>
> But I'm making assumptions here, I'm aware of that. Chances are that
> actually most people's data _does_ fit into this perfect distribution
> set. Is there any research that shows that real-world data usually
> does?

I don't think most people's data is perfectly distributed. But as you
say most data is probably within some deviation of being perfectly
distributed and as long as that deviation isn't too big it doesn't
matter.

But there are certainly some common examples of highly correlated
columns. Having a serial id and a date as in your case is probably quite
common. Another example might be a surrogate primary key which is
computed from some other fields (e.g. a timeseries code starting with a
country code, or a social security number starting with the birth date,
...). That's probably not that uncommon either.

So, I agree with you. This is a problem and it should be fixed. I'm just
sceptical that it can be done with a simple cost adjustment.


> As Jeff points out I'd have a much larger win in this instance by
> someone spending the time implementing skip index scans rather than
> messing with the planner :)

Yeah. I think I have some code which could benefit from this, too. I'll
have to try that trick from the wiki.

        hp

--
   _  | Peter J. Holzer    | I want to forget all about both belts and
|_|_) |                    | suspenders; instead, I want to buy pants
| |   | hjp@hjp.at         | that actually fit.
__/   | http://www.hjp.at/ |   -- http://noncombatant.org/

Attachment

pgsql-general by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: Q: extract database name from directory dump
Next
From: "Peter J. Holzer"
Date:
Subject: Re: Unable to match same value in field.