Thread: Re: [pgsql-hackers-win32] Poor Performance for large queries in functions

Re: [pgsql-hackers-win32] Poor Performance for large queries in functions

From
John Meinel
Date:
Tom Lane wrote:
> John Meinel <john@johnmeinel.com> writes:
>
>>... However, if I try to
>>bundle this query up into a server side function, it runs very slow (10
>>seconds). I'm trying to figure out why, but since I can't run EXPLAIN
>>ANALYZE inside a function, I don't really know what else to do.
>
>
> A parameterized query inside a function is basically the same as a
> PREPARE'd query with parameters at the SQL level.  So you can
> investigate what's happening here with
>
>     PREPARE foo(int) AS
>         SELECT * FROM object WHERE id in (
>             SELECT id FROM data_t WHERE project_id = $1
>         UNION SELECT ... ;
>
>     EXPLAIN ANALYZE EXECUTE foo(48542);
>
> I'm not sure where the problem is either, so please do send along the
> results.
>
>             regards, tom lane
>
> PS: pgsql-performance would be a more appropriate venue for this
> discussion.

Well, I think I tracked the problem down to the fact that the column
does not have a "not null" constraint on it. Here is a demonstration.
Basically, I have 3 tables, tobjects, tdata, and tproject. tdata
basically just links between tobjects and tproject, but isn't required
to link to tproject. Yes, the real data has more columns, but this shows
the problem.

jfmeinel=> \d tobjects
    Table "public.tobjects"
  Column |  Type   | Modifiers
--------+---------+-----------
  id     | integer | not null
Indexes:
     "tobjects_pkey" primary key, btree (id)

jfmeinel=> \d tproject
    Table "public.tproject"
  Column |  Type   | Modifiers
--------+---------+-----------
  id     | integer | not null
Indexes:
     "tproject_pkey" primary key, btree (id)

jfmeinel=> \d tdata
        Table "public.tdata"
    Column   |  Type   | Modifiers
------------+---------+-----------
  id         | integer | not null
  project_id | integer |
Indexes:
     "tdata_pkey" primary key, btree (id)
     "tdata_project_id_idx" btree (project_id)
Foreign-key constraints:
     "tdata_id_fkey" FOREIGN KEY (id) REFERENCES tobjects(id) ON UPDATE
CASCADE ON DELETE CASCADE
     "tdata_project_id_fkey" FOREIGN KEY (project_id) REFERENCES
        tproject(id) ON UPDATE CASCADE ON DELETE SET DEFAULT

jfmeinel=> select count(*) from tdata;
  count
--------
  545768

jfmeinel=> select count(*) - count(project_id) from tdata;
  ?column?
----------
       240

So tdata(project_id) is almost completely full, of the 540000+ entries,
only 240 are null.


jfmeinel=> prepare myget(int) as select id from tdata
jfmeinel->     where project_id = $1;
PREPARE

jfmeinel=> explain analyze execute myget(30000);
                                                  QUERY PLAN
--------------------------------------------------------------------
  Seq Scan on tdata  (cost=0.00..9773.10 rows=181923 width=4)
    (actual time=1047.000..1047.000 rows=0 loops=1)
    Filter: (project_id = $1)
  Total runtime: 1047.000 ms

jfmeinel=> explain analyze select id from tdata where project_id = 30000;
                                                          QUERY PLAN

-------------------------------------------------------------------------
  Index Scan using tdata_project_id_idx on tdata  (cost=0.00..4.20
rows=1 width=4) (actual time=0.000..0.000 rows=0 loops =1)
    Index Cond: (project_id = 30000)
  Total runtime: 0.000 ms

So notice that when doing the actual select it is able to do the index
query. But for some reason with a prepared statement, it is not able to
do it.

Any ideas?

Since I only have the integers now, I can send the data to someone if
they care to investigate it. It comes to 2.2M as a .tar.bz2, so
obviously I'm not going to spam the list.

If I rewrite myget as:
prepare myget(int) as select id from tdata where project_id = 30000; it
does the right thing again. So it's something about how a variable
interacts with an indexed column with null values.

Note: I've tried creating a script that generates dummy data to show
this problem and I have failed (it always performed the query correctly.)

But this test data definitely shows the problem. And yes, I've vacuum
analyzed all over the place.

John
=:->

PS> I tested this on PostgreSQL 7.4.3, and it did not demonstrate this
problem. I am using PostgreSQL 8.0.0beta2 (probably -dev1)

Attachment

Re: [pgsql-hackers-win32] Poor Performance for large queries

From
Richard Huxton
Date:
John Meinel wrote:
>
> So notice that when doing the actual select it is able to do the index
> query. But for some reason with a prepared statement, it is not able to
> do it.
>
> Any ideas?

In the index-using example, PG knows the value you are comparing to. So,
it can make a better estimate of how many rows will be returned. With
the prepared/compiled version it has to come up with a plan that makes
sense for any value.

If you look back at the explain output you'll see PG is guessing 181,923
rows will match with the prepared query but only 1 for the second query.
If in fact you returned that many rows, you wouldn't want to use the
index - it would mean fetching values twice.

The only work-around if you are using plpgsql functions is to use
EXECUTE to make sure your queries are planned for each value provided.

--
   Richard Huxton
   Archonet Ltd

Re: [pgsql-hackers-win32] Poor Performance for large queries

From
John Meinel
Date:
Richard Huxton wrote:
> John Meinel wrote:
>
>>
>> So notice that when doing the actual select it is able to do the index
>> query. But for some reason with a prepared statement, it is not able
>> to do it.
>>
>> Any ideas?
>
>
> In the index-using example, PG knows the value you are comparing to. So,
> it can make a better estimate of how many rows will be returned. With
> the prepared/compiled version it has to come up with a plan that makes
> sense for any value.
>
> If you look back at the explain output you'll see PG is guessing 181,923
> rows will match with the prepared query but only 1 for the second query.
> If in fact you returned that many rows, you wouldn't want to use the
> index - it would mean fetching values twice.
>
> The only work-around if you are using plpgsql functions is to use
> EXECUTE to make sure your queries are planned for each value provided.
>
I suppose that make sense. If the number was small (< 100) then there
probably would be a lot of responses. Because the tproject table is all
small integers.

But for a large number, it probably doesn't exist on that table at all.

Thanks for the heads up.

John
=:->

Attachment

Re: [pgsql-hackers-win32] Poor Performance for large queries in functions

From
Tom Lane
Date:
[ enlarging on Richard's response a bit ]

John Meinel <john@johnmeinel.com> writes:
> jfmeinel=> explain analyze execute myget(30000);
>                                                   QUERY PLAN
> --------------------------------------------------------------------
>   Seq Scan on tdata  (cost=0.00..9773.10 rows=181923 width=4)
>     (actual time=1047.000..1047.000 rows=0 loops=1)
>     Filter: (project_id = $1)
>   Total runtime: 1047.000 ms

> jfmeinel=> explain analyze select id from tdata where project_id = 30000;
>                                                           QUERY PLAN

> -------------------------------------------------------------------------
>   Index Scan using tdata_project_id_idx on tdata  (cost=0.00..4.20
> rows=1 width=4) (actual time=0.000..0.000 rows=0 loops =1)
>     Index Cond: (project_id = 30000)
>   Total runtime: 0.000 ms

> So notice that when doing the actual select it is able to do the index
> query. But for some reason with a prepared statement, it is not able to
> do it.

This isn't a "can't do it" situation, it's a "doesn't want to do it"
situation, and it's got nothing whatever to do with null or not null.
The issue is the estimated row count, which in the first case is so high
as to make the seqscan approach look cheaper.  So the real question here
is what are the statistics on the column that are making the planner
guess such a large number when it has no specific information about the
compared-to value.  Do you have one extremely common value in the column?
Have you done an ANALYZE recently on the table, and if so can you show
us the pg_stats row for the column?

            regards, tom lane

Re: [pgsql-hackers-win32] Poor Performance for large queries

From
John Meinel
Date:
Tom Lane wrote:
> [ enlarging on Richard's response a bit ]
>
> John Meinel <john@johnmeinel.com> writes:
>
>>jfmeinel=> explain analyze execute myget(30000);
>>                                                  QUERY PLAN
>>--------------------------------------------------------------------
>>  Seq Scan on tdata  (cost=0.00..9773.10 rows=181923 width=4)
>>    (actual time=1047.000..1047.000 rows=0 loops=1)
>>    Filter: (project_id = $1)
>>  Total runtime: 1047.000 ms
>
>
>>jfmeinel=> explain analyze select id from tdata where project_id = 30000;
>>                                                          QUERY PLAN
>
>
>>-------------------------------------------------------------------------
>>  Index Scan using tdata_project_id_idx on tdata  (cost=0.00..4.20
>>rows=1 width=4) (actual time=0.000..0.000 rows=0 loops =1)
>>    Index Cond: (project_id = 30000)
>>  Total runtime: 0.000 ms
>
>
>>So notice that when doing the actual select it is able to do the index
>>query. But for some reason with a prepared statement, it is not able to
>>do it.
>
>
> This isn't a "can't do it" situation, it's a "doesn't want to do it"
> situation, and it's got nothing whatever to do with null or not null.
> The issue is the estimated row count, which in the first case is so high
> as to make the seqscan approach look cheaper.  So the real question here
> is what are the statistics on the column that are making the planner
> guess such a large number when it has no specific information about the
> compared-to value.  Do you have one extremely common value in the column?
> Have you done an ANALYZE recently on the table, and if so can you show
> us the pg_stats row for the column?
>
>             regards, tom lane
>

The answer is "yes" that particular column has very common numbers in
it. Project id is a number from 1->21. I ended up modifying my query
such that I do the bulk of the work in a regular UNION SELECT so that
all that can be optimized, and then I later do another query for this
row in an 'EXECUTE ...' so that unless I'm actually requesting a small
number, the query planner can notice that it can do an indexed query.

I'm pretty sure this is just avoiding worst case scenario. Because it is
true that if I use the number 18, it will return 500,000 rows. Getting
those with an indexed lookup would be very bad. But typically, I'm doing
numbers in a very different range, and so the planner was able to know
that it would not likely find that number.

Thanks for pointing out what the query planner was thinking, I was able
to work around it.

John
=:->

Attachment