Re: LIMIT causes SEQSCAN in subselect - Mailing list pgsql-performance

From Mike Rylander
Subject Re: LIMIT causes SEQSCAN in subselect
Date
Msg-id b918cf3d04121106375a955ce8@mail.gmail.com
Whole thread Raw
In response to Re: LIMIT causes SEQSCAN in subselect  (Josh Berkus <josh@agliodbs.com>)
List pgsql-performance
On Fri, 10 Dec 2004 21:40:18 -0800, Josh Berkus <josh@agliodbs.com> wrote:
> Mike,
> The fact that the estimator knows that the LIMIT is pointless because there
> are less rows in the subselect than the LIMIT will return is not something we
> want to count on; sometimes the estimator has innaccurate information.  The
> UNIQUE index makes this more certain, except that I'm not sure that the
> planner distinguishes between actual UNIQUE indexes and columns which are
> estimated unique (per the pg_stats).   And I think you can see in your case
> that there's quite a difference between a column we're CERTAIN is unique,
> versus a column we THINK is unique.

Absolutely.  At first I was going to ask if perhaps using the stats to
discard the LIMIT would be possible, but since the stats are only
guidelines I dropped that.  The stats are just so tempting!

>
> > I realize this is a rather specialized case and not really great form.
>
> Exactly.   You've grasped the main issue: that this has not been optimized
> because it's bizarre and not very sensible query writing.   Someday we'll get
> around to optimizing the really wierd queries, but there's still a lot of
> work to be done on the common ones (like count(*) ...).

Absolutely.  And if I can help out with the common cases to gain some
Karmic currency I will. ;)  After thinking about it some more, I don't
think those queries we really all that wacky though.  The problem with
the example is that the generated query is very simple, and real-world
queries that would be used in the subselect  would be much more
complex, and row estimation would be untrustworthy without a UNIQUE
index.

>
> Keep in mind that the only reason we support LIMIT inside subqueries in the
> first place is a workaround to slow aggregates, and a way to do RANK.  It's
> certainly not SQL-standard.
>

No it's not, but then nobody ever accused the authors of the SQL spec
of being omniscient...  I' cant think of another way to get, say, a
'top 10' list from a subselect, or use a paging iterator (LIMIT ..
OFFSET ..) as the seed for an outer query.  Well, other than an SRF of
course.

> > Just a matter of
> > defining result sets independently, and creating a simple wrapper to
> > join them.
>
> Well, if you think so, you know where to submit patches ...
>

Well, I do, but I was talking about it being 'easy' in the middleware.
 Just let PG handle optimizing the subselects.

For example, you have a pile of predefined SELECTS that don't know
they are related and are used for simple lookups.  You tell the SQL
generator thingy that it should use two of those, queries A and B,
that they are related on x, and that you want to see the 'id' from A
and the 'value' from B.  Instead of having to preplan every possible
combination of JOINS the SQL generator will toss the preplanned ones
into subselects and join them in the outer query instead of having to
rip them apart and calculate the join syntax.  And yes, I know that
view will take care of most of that for me... :)

Thanks for all your comments.  Pretty much what I expected, but I
thought I'd raise a use case.  I'll just have to give the query
builder more smarts.

> --
> Josh Berkus
> Aglio Database Solutions
> San Francisco
>


--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org

pgsql-performance by date:

Previous
From: "Steinar H. Gunderson"
Date:
Subject: Re: [GENERAL] Query is not using index when it should
Next
From: Michael Fuhr
Date:
Subject: Re: [GENERAL] Query is not using index when it should