Thread: plan problem

plan problem

From
Ken Geis
Date:
I am trying to find an efficient way to draw a random sample from a
complex query.  I also want it to be easy to use within my application.

So I've defined a view that encapsulates the query.  The id in the
"driving" table is exposed, and I run a query like:

select * from stats_record_view
  where id in (select id from driver_stats
                order by random()
                limit 30000);

driver_stats.id is unique, the primary key.  The problem I'm having is
that neither the ORDER BY nor the LIMIT change the uniqueness of that
column, but the planner doesn't know that.  It does a HashAggregate to
make sure the results are unique.  It thinks that 200 rows will come out
of that operation, and then 200 rows is small enough that it thinks a
Nested Loop is the best way to proceed from there.

I can post more query plan, but I don't think it would be that very
helpful.  I'm considering just making a sample table and creating an
analogous view around that.  I'd like to be able to keep this as simple
as possible though.


Ken



Re: plan problem

From
Richard Huxton
Date:
On Tuesday 06 April 2004 21:25, Ken Geis wrote:
> I am trying to find an efficient way to draw a random sample from a
> complex query.  I also want it to be easy to use within my application.
>
> So I've defined a view that encapsulates the query.  The id in the
> "driving" table is exposed, and I run a query like:
>
> select * from stats_record_view
>   where id in (select id from driver_stats
>                 order by random()
>                 limit 30000);

How about a join?

SELECT s.*
FROM
stats_record_view s
JOIN
(SELECT id FROM driver_stats ORDER BY random() LIMIT 30000) AS r
ON s.id = r.id;

Or, what about a cursor and fetch forward (or back?) a random number of rows
before each fetch. That's probably not going to be so random though.

Also worth checking the various list archives - this has come up in the past,
but some time ago.

--
  Richard Huxton
  Archonet Ltd

Re: plan problem

From
Ken Geis
Date:
Richard Huxton wrote:
> On Tuesday 06 April 2004 21:25, Ken Geis wrote:
>
>>I am trying to find an efficient way to draw a random sample from a
>>complex query.  I also want it to be easy to use within my application.
>>
>>So I've defined a view that encapsulates the query.  The id in the
>>"driving" table is exposed, and I run a query like:
>>
>>select * from stats_record_view
>>  where id in (select id from driver_stats
>>                order by random()
>>                limit 30000);
>
>
> How about a join?
>
> SELECT s.*
> FROM
> stats_record_view s
> JOIN
> (SELECT id FROM driver_stats ORDER BY random() LIMIT 30000) AS r
> ON s.id = r.id;

Yes, I tried this too after I sent the first mail, and this was somewhat
better.  I ended up adding a random column to the driving table, putting
an index on it, and exposing that column in the view.  Now I can say

SELECT * FROM stats_record_view WHERE random < 0.093;

For my application, it's OK if the same sample is picked time after time
and it may change if data is added.

...
> Also worth checking the various list archives - this has come up in the past,
> but some time ago.

There are some messages in the archives about how to get a random
sample.  I know how to do that, and that's not why I posted my message.
  Are you saying that the planner behavior I spoke of is in the
archives?  I wouldn't know what to search on to find that thread.  Does
anyone think that the planner issue has merit to address?  Can someone
help me figure out what code I would look at?


Ken Geis



Re: plan problem

From
Richard Huxton
Date:
On Wednesday 07 April 2004 10:03, Ken Geis wrote:
> Richard Huxton wrote:
> > On Tuesday 06 April 2004 21:25, Ken Geis wrote:
> >>I am trying to find an efficient way to draw a random sample from a
> >>complex query.  I also want it to be easy to use within my application.
> >>
> >>So I've defined a view that encapsulates the query.  The id in the
> >>"driving" table is exposed, and I run a query like:
> >>
> >>select * from stats_record_view
> >>  where id in (select id from driver_stats
> >>                order by random()
> >>                limit 30000);
> >
> > How about a join?
> >
> > SELECT s.*
> > FROM
> > stats_record_view s
> > JOIN
> > (SELECT id FROM driver_stats ORDER BY random() LIMIT 30000) AS r
> > ON s.id = r.id;
>
> Yes, I tried this too after I sent the first mail, and this was somewhat
> better.  I ended up adding a random column to the driving table, putting
> an index on it, and exposing that column in the view.  Now I can say
>
> SELECT * FROM stats_record_view WHERE random < 0.093;
>
> For my application, it's OK if the same sample is picked time after time
> and it may change if data is added.

Fair enough - that'll certainly do it.

> > Also worth checking the various list archives - this has come up in the
> > past, but some time ago.
>
> There are some messages in the archives about how to get a random
> sample.  I know how to do that, and that's not why I posted my message.
>   Are you saying that the planner behavior I spoke of is in the
> archives?  I wouldn't know what to search on to find that thread.  Does
> anyone think that the planner issue has merit to address?  Can someone
> help me figure out what code I would look at?

I was assuming after getting a random subset they'd see the same problem you
are. If not, probably worth looking at. In which case, an EXPLAIN ANALYZE of
your original query would be good.

--
  Richard Huxton
  Archonet Ltd

Re: plan problem

From
Tom Lane
Date:
Ken Geis <kgeis@speakeasy.org> writes:
> Does anyone think that the planner issue has merit to address?  Can
> someone help me figure out what code I would look at?

The planner doesn't currently attempt to "drill down" into a sub-select-
in-FROM to find statistics about the variables emitted by the sub-select.
So it's just falling back to a default estimate of the number of
distinct values coming out of the sub-select.

The "drilling down" part is not hard; the difficulty comes from trying
to figure out whether and how the stats from the underlying column would
need to be adjusted for the behavior of the sub-select itself.  As an
example, the result of (SELECT DISTINCT foo FROM bar) would usually have
much different stats from the raw bar.foo column.  In your example, the
LIMIT clause potentially affects the stats by reducing the number of
distinct values.

Now in most situations where the sub-select wouldn't change the stats,
there's no issue anyway because the planner will flatten the sub-select
into the main query.  So we really have to figure out the adjustment
part before we can think about doing much here.

            regards, tom lane