Thread: Simply join in PostrgeSQL takes too long

Simply join in PostrgeSQL takes too long

From
Vitaly Belman
Date:
Hello pgsql-performance,

  I discussed the whole subject for some time in DevShed and didn't
  achieve much (as for results). I wonder if any of you guys can help
  out:

  http://forums.devshed.com/t136202/s.html

Regards,
 Vitaly Belman

 ICQ: 1912453
 AIM: VitalyB1984
 MSN: tmdagent@hotmail.com
 Yahoo!: VitalyBe


Re: Simply join in PostrgeSQL takes too long

From
"Atesz"
Date:
Hi,

You can try some variation:

SELECT
  book_id
FROM
  bookgenres, genre_children
WHERE
   bookgenres.genre_id = genre_children.genre_child_id AND
   genre_children.genre_id = 1
GROUP BY book_id
LIMIT 10

The next works if the 'genre_child_id' is UNIQUE on the 'genre_children'
table.

SELECT
  book_id
FROM
  bookgenres
WHERE
   bookgenres.genre_id = (SELECT genre_child_id FROM genre_children
WHERE genre_id = 1)
GROUP BY book_id
LIMIT 10

You may need some index. Try these with EXPLAIN!
CREATE INDEX bookgenres_genre_id_book_id ON bookgenres(genre_id,
book_id);  or
CREATE INDEX bookgenres_book_id_genre_id ON bookgenres(book_id,
genre_id);
CREATE INDEX genre_children_genre_id ON genre_children(genre_id);

Regards, Antal Attila



Re: Simply join in PostrgeSQL takes too long

From
Nick Barr
Date:
Vitaly Belman wrote:
> Hello pgsql-performance,
>
>   I discussed the whole subject for some time in DevShed and didn't
>   achieve much (as for results). I wonder if any of you guys can help
>   out:
>
>   http://forums.devshed.com/t136202/s.html
>

So cutting and pasting:

----- SCHEMA -----
CREATE TABLE bv_bookgenres (
     book_id INT NOT NULL,
     genre_id INT NOT NULL
);
CREATE TABLE bv_genre_children (
     genre_id INT,
     genre_child_id INT
);
-------------------

----- QUERY -----
select DISTINCT
       book_id
from
   bookgenres,
   genre_children
WHERE
      bookgenres.genre_id = genre_children.genre_child_id AND
      genre_children.genre_id = 1
LIMIT 10
-----------------

----- EXPLAIN ANALYZE -----
QUERY PLAN
Limit  (cost=6503.51..6503.70 rows=10 width=4) (actual
time=703.000..703.000 rows=10 loops=1)
   ->  Unique  (cost=6503.51..6738.20 rows=12210 width=4) (actual
time=703.000..703.000 rows=10 loops=1)
         ->  Sort  (cost=6503.51..6620.85 rows=46937 width=4) (actual
time=703.000..703.000 rows=24 loops=1)
               Sort Key: bv_bookgenres.book_id
               ->  Merge Join  (cost=582.45..2861.57 rows=46937 width=4)
(actual time=46.000..501.000 rows=45082 loops=1)
                     Merge Cond: ("outer".genre_id = "inner".genre_child_id)
                     ->  Index Scan using genre_id on bv_bookgenres
(cost=0.00..1462.84 rows=45082 width=8) (actual time=0.000..158.000
rows=45082 loops=1)
                     ->  Sort  (cost=582.45..598.09 rows=6256 width=2)
(actual time=46.000..77.000 rows=49815 loops=1)
                           Sort Key: bv_genre_children.genre_child_id
                           ->  Index Scan using genre_id2 on
bv_genre_children  (cost=0.00..187.98 rows=6256 width=2) (actual
time=0.000..31.000 rows=6379 loops=1)
                                 Index Cond: (genre_id = 1)
Total runtime: 703.000 ms
-------------------------------

----- CONF SETTINGS -----
shared_buffers = 1000        # min 16, at least max_connections*2, 8KB each
sort_mem = 10000
#work_mem = 1024        # min 64, size in KB
#maintenance_work_mem = 16384    # min 1024, size in KB
#max_stack_depth = 2048        # min 100, size in KB
-------------------------

Have you VACUUM ANALYZED recently. If not do that then rerun the EXPLAIN
ANALYZE.

You might wanna bump shared_buffers. You have 512MB RAM right? You
probably want to bump shared_buffers to 10000, restart PG then run a
VACUUM ANALYZE. Then rerun the EXPLAIN ANALYZE.

If that doesnt help try doing a

ALTER TABLE bv_genre_children ALTER COLUMN genre_child_id SET STATISTICS
100;

followed by a:

VACUUM ANALYZE bv_genre_children;

You might also want to be tweaking the effective_cache_size parameter in
  postgresql.conf, but I am unsure how this would work on Windows. Does
Windows have a kernel disk cache anyone?




HTH

Nick






Re: Simply join in PostrgeSQL takes too long

From
Rod Taylor
Date:
On Tue, 2004-04-27 at 17:27, Vitaly Belman wrote:
> Hello pgsql-performance,
>
>   I discussed the whole subject for some time in DevShed and didn't
>   achieve much (as for results). I wonder if any of you guys can help
>   out:
>
>   http://forums.devshed.com/t136202/s.html

You're taking the wrong approach. Rather than using a select query to
ensure that the book_id is distinct, add a constraint to the table so
that is guaranteed.

        CREATE UNIQUE INDEX bv_bookgeneres_unq ON bv_bookgenres(book_id,
        genre_id);

Now you can do a simple join (Drop the DISTINCT keyword) and achieve the
same results.

The point is that a book cannot be of a certain genre more than once.

Without the distinct, this should take a matter of a few milliseconds to
execute.



Re: Simply join in PostrgeSQL takes too long

From
Josh Berkus
Date:
Vitaly,

I'm afraid that your helper on DevShed is right; 7.5 for Windows is still in
development, we've not even *started* to check it for performance yet.

Since the Merge Join is taking 90% of your query time, I might suggest
increasing shared_buffers and sort_mem to see if that helps.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Simply join in PostrgeSQL takes too long

From
Manfred Koizar
Date:
On Tue, 27 Apr 2004 18:01:34 -0400, Rod Taylor <pg@rbt.ca> wrote:
>On Tue, 2004-04-27 at 17:27, Vitaly Belman wrote:
>> Hello pgsql-performance,
>>
>>   I discussed the whole subject for some time in DevShed and didn't
>>   achieve much (as for results). I wonder if any of you guys can help
>>   out:
>>
>>   http://forums.devshed.com/t136202/s.html

>The point is that a book cannot be of a certain genre more than once.

Rod, he has a hierarchy of genres.  Genre 1 has 6379 child genres and a
book can be in more than one of these.

Vitaly, though LIMIT makes this look like a small query, DISTINCT
requires the whole result set to be retrieved.  0.7 seconds doesn't look
so bad for several thousand rows.  Did you try with other genre_ids?

Maybe a merge join is not the best choice.  Set enable_mergejoin to
false and see whether you get a (hopefully faster) hash join, assuming
that sort_mem is large enough to keep the hash table in memory.

If you send me your table contents I'll try it on Linux.

Servus
 Manfred

Re: Simply join in PostrgeSQL takes too long

From
Rod Taylor
Date:
> Rod, he has a hierarchy of genres.  Genre 1 has 6379 child genres and a
> book can be in more than one of these.

     bookgenres.genre_id = genre_children.genre_child_id AND
     genre_children.genre_id = 1

I see, sorry. I didn't notice the genre_child_id in the where clause.
First glance had them all as genre_id.

When I run into this I usually create a 3rd table managed by triggers
that would relate the book to all genre entries. Insert takes a little
longer, but the selects can still be very quick.

The below plpgsql forces the kind of algorithm we wish the planner could
choose. It should be fairly quick irregardless of dataset.


CREATE OR REPLACE FUNCTION book_results(numeric) RETURNS SETOF numeric
AS
'
DECLARE
  v_genre ALIAS FOR $1;
  v_limit integer = 10;
  t_rows RECORD;
  v_transmitted integer = 0;

  v_transmitted_values numeric[] = ARRAY[1];

BEGIN
  FOR t_rows IN SELECT book_id
                  FROM bv_bookgenres AS b
                  JOIN bv_genre_children AS g ON (b.genre_id =
g.genre_child_id)
                 WHERE g.genre_id = v_genre
                  LOOP

    -- If this is a new value, transmit it to the end user
    IF NOT t_rows.book_id = ANY(v_transmitted_values) THEN
      v_transmitted_values := array_append(v_transmitted_values,
t_rows.book_id);
      v_transmitted := v_transmitted + 1;
      RETURN NEXT t_rows.book_id;
    END IF;

    EXIT WHEN v_transmitted >= v_limit;
  END LOOP;

  RETURN;
END;
' LANGUAGE plpgsql;

EXPLAIN ANALYZE SELECT * FROM book_results(1);
SELECT * FROM book_results(1);


Re: Simply join in PostrgeSQL takes too long

From
Manfred Koizar
Date:
On Wed, 28 Apr 2004 08:23:35 -0400, Rod Taylor <pg@rbt.ca> wrote:
>The below plpgsql forces the kind of algorithm we wish the planner could
>choose. It should be fairly quick irregardless of dataset.

That reminds me of hash aggregation.  So here's another idea for Vitaly:

    SELECT book_id
      FROM ...
     WHERE ...
     GROUP BY book_id
     LIMIT ...

Servus
 Manfred

Re: Simply join in PostrgeSQL takes too long

From
Rod Taylor
Date:
On Thu, 2004-04-29 at 13:13, Manfred Koizar wrote:
> On Wed, 28 Apr 2004 08:23:35 -0400, Rod Taylor <pg@rbt.ca> wrote:
> >The below plpgsql forces the kind of algorithm we wish the planner could
> >choose. It should be fairly quick irregardless of dataset.
>
> That reminds me of hash aggregation.  So here's another idea for Vitaly:

The reason for the function is that the sort routines (hash aggregation
included) will not stop in mid-sort, although I believe that feature is
on the TODO list.

I believe Vitaly will achieve 10ms or less query times using that
function.



Re: Simply join in PostrgeSQL takes too long

From
Manfred Koizar
Date:
On Thu, 29 Apr 2004 13:36:47 -0400, Rod Taylor <pg@rbt.ca> wrote:
>The reason for the function is that the sort routines (hash aggregation
>included) will not stop in mid-sort

Good point.

Servus
 Manfred

Re: Simply join in PostrgeSQL takes too long

From
Vitaly Belman
Date:
Hello Manfred,

I thank everyone for helping me on this - So many tips.

I am in the middle of going through them all, till now disabling the
enable_mergejoin really helped.

Also, I agree that the design might be flawed (I could use triggers
and stuff like that) but for now I am just comparing how my project
will run on PostgreSQL (Considering migration from MySQL).

I'll be reporting back on how the other stuff helped.

Regards,
 Vitaly Belman

 ICQ: 1912453
 AIM: VitalyB1984
 MSN: tmdagent@hotmail.com
 Yahoo!: VitalyBe

Wednesday, April 28, 2004, 11:24:41 AM, you wrote:

MK> On Tue, 27 Apr 2004 18:01:34 -0400, Rod Taylor <pg@rbt.ca> wrote:
>>On Tue, 2004-04-27 at 17:27, Vitaly Belman wrote:
>>> Hello pgsql-performance,
>>>
>>>   I discussed the whole subject for some time in DevShed and didn't
>>>   achieve much (as for results). I wonder if any of you guys can help
>>>   out:
>>>
>>>   http://forums.devshed.com/t136202/s.html

>>The point is that a book cannot be of a certain genre more than once.

MK> Rod, he has a hierarchy of genres.  Genre 1 has 6379 child genres and a
MK> book can be in more than one of these.

MK> Vitaly, though LIMIT makes this look like a small query, DISTINCT
MK> requires the whole result set to be retrieved.  0.7 seconds doesn't look
MK> so bad for several thousand rows.  Did you try with other genre_ids?

MK> Maybe a merge join is not the best choice.  Set enable_mergejoin to
MK> false and see whether you get a (hopefully faster) hash join, assuming
MK> that sort_mem is large enough to keep the hash table in memory.

MK> If you send me your table contents I'll try it on Linux.

MK> Servus
MK>  Manfred


Re: Simply join in PostrgeSQL takes too long

From
Josh Berkus
Date:
Vitaly,

> I am in the middle of going through them all, till now disabling the
> enable_mergejoin really helped.

In that case, your random_page_cost is probably too low.   Check the ratio of
per-tuple times on index vs. seqscan seeks.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco