Thread: generate_series woes

generate_series woes

From

Harald Fuchs

Date:

14 April 2008, 06:22:13

I think there's something sub-optimal with generate_series.
In the following, "documents" is a table with more than 120000 rows,
vacuumed and analyzed before the queries.

  EXPLAIN ANALYZE
  SELECT count (d.id), floor (s.val / 5000)
  FROM generate_series (1::INT, 5009) AS s (val)
  LEFT JOIN documents d ON d.id = s.val
  GROUP BY 2
  ORDER BY 2;

This returns:

 Sort  (cost=4231.52..4232.02 rows=200 width=8) (actual time=41.886..41.887 rows=2 loops=1)
   Sort Key: (floor(((s.val / 5000))::double precision))
   Sort Method:  quicksort  Memory: 25kB
   ->  HashAggregate  (cost=4219.88..4223.88 rows=200 width=8) (actual time=41.843..41.846 rows=2 loops=1)
         ->  Nested Loop Left Join  (cost=0.00..4214.88 rows=1000 width=8) (actual time=1.274..38.193 rows=5009
loops=1)
               ->  Function Scan on generate_series s  (cost=0.00..12.50 rows=1000 width=4) (actual time=1.209..3.102
rows=5009loops=1) 
               ->  Index Scan using documents_pkey on documents d  (cost=0.00..4.18 rows=1 width=4) (actual
time=0.004..0.005rows=1 loops=5009) 
                     Index Cond: (d.id = s.val)
 Total runtime: 42.218 ms

Now let's wrap generate_series into an SQL function:

  CREATE FUNCTION genser (int, int) RETURNS SETOF int AS $$
    SELECT * FROM generate_series ($1, $2) AS g(x);
  $$ LANGUAGE sql;

  EXPLAIN ANALYZE
  SELECT count (d.id), floor (s.val / 5000)
  FROM genser (1::INT, 5009) AS s (val)
  LEFT JOIN documents d ON d.id = s.val
  GROUP BY 2
  ORDER BY 2;

Not surprisingly, this returns the same plan:

 Sort  (cost=4479.02..4479.52 rows=200 width=8) (actual time=43.606..43.607 rows=2 loops=1)
   Sort Key: (floor(((s.val / 5000))::double precision))
   Sort Method:  quicksort  Memory: 25kB
   ->  HashAggregate  (cost=4467.38..4471.38 rows=200 width=8) (actual time=43.559..43.561 rows=2 loops=1)
         ->  Nested Loop Left Join  (cost=0.00..4462.38 rows=1000 width=8) (actual time=3.564..39.740 rows=5009
loops=1)
               ->  Function Scan on genser s  (cost=0.00..260.00 rows=1000 width=4) (actual time=3.503..5.435 rows=5009
loops=1)
               ->  Index Scan using documents_pkey on documents d  (cost=0.00..4.18 rows=1 width=4) (actual
time=0.004..0.005rows=1 loops=5009) 
                     Index Cond: (d.id = s.val)
 Total runtime: 44.047 ms
(9 rows)

But look what happens if we tell PostgreSQL how many rows "genser"
will return:

  CREATE FUNCTION genser (int, int) RETURNS SETOF int AS $$
    SELECT * FROM generate_series ($1, $2) AS g(x);
  $$ LANGUAGE sql ROWS 5009;

  EXPLAIN ANALYZE
  SELECT count (d.id), floor (s.val / 5000)
  FROM genser (1::INT, 5009) AS s (val)
  LEFT JOIN documents d ON d.id = s.val
  GROUP BY 2
  ORDER BY 2;

Now we get a better plan:

 Sort  (cost=15545.54..15546.04 rows=200 width=8) (actual time=27.857..27.859 rows=2 loops=1)
   Sort Key: (floor(((s.val / 5000))::double precision))
   Sort Method:  quicksort  Memory: 25kB
   ->  HashAggregate  (cost=15533.89..15537.89 rows=200 width=8) (actual time=27.817..27.819 rows=2 loops=1)
         ->  Merge Right Join  (cost=1610.15..15508.85 rows=5009 width=8) (actual time=7.714..24.133 rows=5009 loops=1)
               Merge Cond: (d.id = s.val)
               ->  Index Scan using documents_pkey on documents d  (cost=0.00..13472.20 rows=125518 width=4) (actual
time=0.045..6.112rows=5010 loops=1) 
               ->  Sort  (cost=1610.15..1622.67 rows=5009 width=4) (actual time=7.651..9.501 rows=5009 loops=1)
                     Sort Key: s.val
                     Sort Method:  quicksort  Memory: 427kB
                     ->  Function Scan on genser s  (cost=0.00..1302.34 rows=5009 width=4) (actual time=3.559..5.262
rows=5009loops=1) 
 Total runtime: 28.445 ms
(12 rows)

Since generate_series is a builtin function, can't it tell how many
rows it will return?

Re: generate_series woes

From

"Merlin Moncure"

Date:

15 April 2008, 11:15:21

On Mon, Apr 14, 2008 at 5:21 AM, Harald Fuchs <hari.fuchs@googlemail.com> wrote:
> I think there's something sub-optimal with generate_series.
>  In the following, "documents" is a table with more than 120000 rows,
>  vacuumed and analyzed before the queries.

everything is working exactly as intended.  while it's obvious to you
that the generate series function returns a particular number of rows
based on your supplied inputs, it's not (yet) obvious to the planner.
your genser function supplies the hint the planner needs and it
adjusts the plan.  most set returning functions (particularly
non-immutable ones) are not so easy to determine the # of rows from
the input parameters anyways.

merlin

Re: generate_series woes

From

Harald Fuchs

Date:

16 April 2008, 08:55:19

In article <b42b73150804150715r83cad1doa166230ec509f0d@mail.gmail.com>,
"Merlin Moncure" <mmoncure@gmail.com> writes:

> On Mon, Apr 14, 2008 at 5:21 AM, Harald Fuchs <hari.fuchs@googlemail.com> wrote:
>> I think there's something sub-optimal with generate_series.
>> In the following, "documents" is a table with more than 120000 rows,
>> vacuumed and analyzed before the queries.

> everything is working exactly as intended.  while it's obvious to you
> that the generate series function returns a particular number of rows
> based on your supplied inputs, it's not (yet) obvious to the planner.

Which was exactly my point.  Since generate_series is a builtin
function, the planner could theoretically know the number of rows
returned, thus choosing a better plan.

OTOH, the difference between theory and reality is in theory smaller
than in reality.

> your genser function supplies the hint the planner needs and it
> adjusts the plan.  most set returning functions (particularly
> non-immutable ones) are not so easy to determine the # of rows from
> the input parameters anyways.

Yes, of course.  I used "genser" just to show that there is a better plan.

Re: generate_series woes

From

hubert depesz lubaczewski

Date:

16 April 2008, 09:20:32

On Mon, Apr 14, 2008 at 11:21:58AM +0200, Harald Fuchs wrote:
> I think there's something sub-optimal with generate_series.
> In the following, "documents" is a table with more than 120000 rows,
> vacuumed and analyzed before the queries.
> Since generate_series is a builtin function, can't it tell how many
> rows it will return?

i think it would be better off not to limit some functionality for
builtin functions. it would be much nicer to have the ability to hint
planer about rowcount from function *in* the sql.

something like:

select i from generate_series(1,10) {hint:10} as i;

i'm not proposiung syntax. i'm suggesting the functionality.

depesz

--
quicksil1er: "postgres is excellent, but like any DB it requires a
highly paid DBA.  here's my CV!" :)
http://www.depesz.com/ - blog dla ciebie (i moje CV)

Re: generate_series woes

From

Volkan YAZICI

Date:

16 April 2008, 09:32:28

hubert depesz lubaczewski <depesz@depesz.com> writes:
> i think it would be better off not to limit some functionality for
> builtin functions. it would be much nicer to have the ability to hint
> planer about rowcount from function *in* the sql.
>
> something like:
>
> select i from generate_series(1,10) {hint:10} as i;
>
> i'm not proposiung syntax. i'm suggesting the functionality.

I'm strongly declined for such non-SQL compliant solutions. I'd be
appreciated if hackers can solve the problem internally, without bugging
SQL syntax.


Regards.

Re: generate_series woes

From

hubert depesz lubaczewski

Date:

16 April 2008, 09:35:22

On Wed, Apr 16, 2008 at 03:37:22PM +0300, Volkan YAZICI wrote:
> I'm strongly declined for such non-SQL compliant solutions. I'd be
> appreciated if hackers can solve the problem internally, without bugging
> SQL syntax.

for generate_series - sure. but i have functions which change (in a
known way) number of returned rows based on their arguments.

depesz

--
quicksil1er: "postgres is excellent, but like any DB it requires a
highly paid DBA.  here's my CV!" :)
http://www.depesz.com/ - blog dla ciebie (i moje CV)

Re: generate_series woes

From

"Merlin Moncure"

Date:

16 April 2008, 10:01:20

On Wed, Apr 16, 2008 at 8:37 AM, Volkan YAZICI <yazicivo@ttmail.com> wrote:
> hubert depesz lubaczewski <depesz@depesz.com> writes:
>  > i think it would be better off not to limit some functionality for
>  > builtin functions. it would be much nicer to have the ability to hint
>  > planer about rowcount from function *in* the sql.
>  >
>  > something like:
>  >
>  > select i from generate_series(1,10) {hint:10} as i;
>  >
>  > i'm not proposiung syntax. i'm suggesting the functionality.
>
>  I'm strongly declined for such non-SQL compliant solutions. I'd be
>  appreciated if hackers can solve the problem internally, without bugging
>  SQL syntax.

maybe -- just an idle thought -- the 'ROWS' clause of create function
could be expanded to take a simple expression based on the input
parameters.

merlin

Re: generate_series woes

From

Sam Mason

Date:

16 April 2008, 14:59:47

On Wed, Apr 16, 2008 at 09:01:10AM -0400, Merlin Moncure wrote:
> On Wed, Apr 16, 2008 at 8:37 AM, Volkan YAZICI <yazicivo@ttmail.com> wrote:
> > hubert depesz lubaczewski <depesz@depesz.com> writes:
> > > select i from generate_series(1,10) {hint:10} as i;
> > >
> > > i'm not proposiung syntax. i'm suggesting the functionality.
> >
> > I'm strongly declined for such non-SQL compliant solutions. I'd be
> > appreciated if hackers can solve the problem internally, without bugging
> > SQL syntax.
>
> maybe -- just an idle thought -- the 'ROWS' clause of create function
> could be expanded to take a simple expression based on the input
> parameters.

In computer science terms, I think you mean that you want something
known as dependant types.  Full dependant types are probably overkill
for PG, but some very limited form would be a fun project.  The idea,
in general, is to move code into the typesystem that is run during
typechecking and used to prove that your code is doing the right thing.
Within PG, this seems to naturally (from my naive viewpoint) extend to
the case of providing hints (proofs would be the normal use) to the
planner about what's going to happen when the code is run.

This seems to imply that types couldn't be stored as OIDs any more
(you'd be creating and destroying lots while planning the query) so
would probably change the structure of the code rather significantly.

  Sam