Thread: Simply join in PostrgeSQL takes too long
Hello pgsql-performance, I discussed the whole subject for some time in DevShed and didn't achieve much (as for results). I wonder if any of you guys can help out: http://forums.devshed.com/t136202/s.html Regards, Vitaly Belman ICQ: 1912453 AIM: VitalyB1984 MSN: tmdagent@hotmail.com Yahoo!: VitalyBe
Hi, You can try some variation: SELECT book_id FROM bookgenres, genre_children WHERE bookgenres.genre_id = genre_children.genre_child_id AND genre_children.genre_id = 1 GROUP BY book_id LIMIT 10 The next works if the 'genre_child_id' is UNIQUE on the 'genre_children' table. SELECT book_id FROM bookgenres WHERE bookgenres.genre_id = (SELECT genre_child_id FROM genre_children WHERE genre_id = 1) GROUP BY book_id LIMIT 10 You may need some index. Try these with EXPLAIN! CREATE INDEX bookgenres_genre_id_book_id ON bookgenres(genre_id, book_id); or CREATE INDEX bookgenres_book_id_genre_id ON bookgenres(book_id, genre_id); CREATE INDEX genre_children_genre_id ON genre_children(genre_id); Regards, Antal Attila
Vitaly Belman wrote: > Hello pgsql-performance, > > I discussed the whole subject for some time in DevShed and didn't > achieve much (as for results). I wonder if any of you guys can help > out: > > http://forums.devshed.com/t136202/s.html > So cutting and pasting: ----- SCHEMA ----- CREATE TABLE bv_bookgenres ( book_id INT NOT NULL, genre_id INT NOT NULL ); CREATE TABLE bv_genre_children ( genre_id INT, genre_child_id INT ); ------------------- ----- QUERY ----- select DISTINCT book_id from bookgenres, genre_children WHERE bookgenres.genre_id = genre_children.genre_child_id AND genre_children.genre_id = 1 LIMIT 10 ----------------- ----- EXPLAIN ANALYZE ----- QUERY PLAN Limit (cost=6503.51..6503.70 rows=10 width=4) (actual time=703.000..703.000 rows=10 loops=1) -> Unique (cost=6503.51..6738.20 rows=12210 width=4) (actual time=703.000..703.000 rows=10 loops=1) -> Sort (cost=6503.51..6620.85 rows=46937 width=4) (actual time=703.000..703.000 rows=24 loops=1) Sort Key: bv_bookgenres.book_id -> Merge Join (cost=582.45..2861.57 rows=46937 width=4) (actual time=46.000..501.000 rows=45082 loops=1) Merge Cond: ("outer".genre_id = "inner".genre_child_id) -> Index Scan using genre_id on bv_bookgenres (cost=0.00..1462.84 rows=45082 width=8) (actual time=0.000..158.000 rows=45082 loops=1) -> Sort (cost=582.45..598.09 rows=6256 width=2) (actual time=46.000..77.000 rows=49815 loops=1) Sort Key: bv_genre_children.genre_child_id -> Index Scan using genre_id2 on bv_genre_children (cost=0.00..187.98 rows=6256 width=2) (actual time=0.000..31.000 rows=6379 loops=1) Index Cond: (genre_id = 1) Total runtime: 703.000 ms ------------------------------- ----- CONF SETTINGS ----- shared_buffers = 1000 # min 16, at least max_connections*2, 8KB each sort_mem = 10000 #work_mem = 1024 # min 64, size in KB #maintenance_work_mem = 16384 # min 1024, size in KB #max_stack_depth = 2048 # min 100, size in KB ------------------------- Have you VACUUM ANALYZED recently. If not do that then rerun the EXPLAIN ANALYZE. You might wanna bump shared_buffers. You have 512MB RAM right? You probably want to bump shared_buffers to 10000, restart PG then run a VACUUM ANALYZE. Then rerun the EXPLAIN ANALYZE. If that doesnt help try doing a ALTER TABLE bv_genre_children ALTER COLUMN genre_child_id SET STATISTICS 100; followed by a: VACUUM ANALYZE bv_genre_children; You might also want to be tweaking the effective_cache_size parameter in postgresql.conf, but I am unsure how this would work on Windows. Does Windows have a kernel disk cache anyone? HTH Nick
On Tue, 2004-04-27 at 17:27, Vitaly Belman wrote: > Hello pgsql-performance, > > I discussed the whole subject for some time in DevShed and didn't > achieve much (as for results). I wonder if any of you guys can help > out: > > http://forums.devshed.com/t136202/s.html You're taking the wrong approach. Rather than using a select query to ensure that the book_id is distinct, add a constraint to the table so that is guaranteed. CREATE UNIQUE INDEX bv_bookgeneres_unq ON bv_bookgenres(book_id, genre_id); Now you can do a simple join (Drop the DISTINCT keyword) and achieve the same results. The point is that a book cannot be of a certain genre more than once. Without the distinct, this should take a matter of a few milliseconds to execute.
Vitaly, I'm afraid that your helper on DevShed is right; 7.5 for Windows is still in development, we've not even *started* to check it for performance yet. Since the Merge Join is taking 90% of your query time, I might suggest increasing shared_buffers and sort_mem to see if that helps. -- -Josh Berkus Aglio Database Solutions San Francisco
On Tue, 27 Apr 2004 18:01:34 -0400, Rod Taylor <pg@rbt.ca> wrote: >On Tue, 2004-04-27 at 17:27, Vitaly Belman wrote: >> Hello pgsql-performance, >> >> I discussed the whole subject for some time in DevShed and didn't >> achieve much (as for results). I wonder if any of you guys can help >> out: >> >> http://forums.devshed.com/t136202/s.html >The point is that a book cannot be of a certain genre more than once. Rod, he has a hierarchy of genres. Genre 1 has 6379 child genres and a book can be in more than one of these. Vitaly, though LIMIT makes this look like a small query, DISTINCT requires the whole result set to be retrieved. 0.7 seconds doesn't look so bad for several thousand rows. Did you try with other genre_ids? Maybe a merge join is not the best choice. Set enable_mergejoin to false and see whether you get a (hopefully faster) hash join, assuming that sort_mem is large enough to keep the hash table in memory. If you send me your table contents I'll try it on Linux. Servus Manfred
> Rod, he has a hierarchy of genres. Genre 1 has 6379 child genres and a > book can be in more than one of these. bookgenres.genre_id = genre_children.genre_child_id AND genre_children.genre_id = 1 I see, sorry. I didn't notice the genre_child_id in the where clause. First glance had them all as genre_id. When I run into this I usually create a 3rd table managed by triggers that would relate the book to all genre entries. Insert takes a little longer, but the selects can still be very quick. The below plpgsql forces the kind of algorithm we wish the planner could choose. It should be fairly quick irregardless of dataset. CREATE OR REPLACE FUNCTION book_results(numeric) RETURNS SETOF numeric AS ' DECLARE v_genre ALIAS FOR $1; v_limit integer = 10; t_rows RECORD; v_transmitted integer = 0; v_transmitted_values numeric[] = ARRAY[1]; BEGIN FOR t_rows IN SELECT book_id FROM bv_bookgenres AS b JOIN bv_genre_children AS g ON (b.genre_id = g.genre_child_id) WHERE g.genre_id = v_genre LOOP -- If this is a new value, transmit it to the end user IF NOT t_rows.book_id = ANY(v_transmitted_values) THEN v_transmitted_values := array_append(v_transmitted_values, t_rows.book_id); v_transmitted := v_transmitted + 1; RETURN NEXT t_rows.book_id; END IF; EXIT WHEN v_transmitted >= v_limit; END LOOP; RETURN; END; ' LANGUAGE plpgsql; EXPLAIN ANALYZE SELECT * FROM book_results(1); SELECT * FROM book_results(1);
On Wed, 28 Apr 2004 08:23:35 -0400, Rod Taylor <pg@rbt.ca> wrote: >The below plpgsql forces the kind of algorithm we wish the planner could >choose. It should be fairly quick irregardless of dataset. That reminds me of hash aggregation. So here's another idea for Vitaly: SELECT book_id FROM ... WHERE ... GROUP BY book_id LIMIT ... Servus Manfred
On Thu, 2004-04-29 at 13:13, Manfred Koizar wrote: > On Wed, 28 Apr 2004 08:23:35 -0400, Rod Taylor <pg@rbt.ca> wrote: > >The below plpgsql forces the kind of algorithm we wish the planner could > >choose. It should be fairly quick irregardless of dataset. > > That reminds me of hash aggregation. So here's another idea for Vitaly: The reason for the function is that the sort routines (hash aggregation included) will not stop in mid-sort, although I believe that feature is on the TODO list. I believe Vitaly will achieve 10ms or less query times using that function.
On Thu, 29 Apr 2004 13:36:47 -0400, Rod Taylor <pg@rbt.ca> wrote: >The reason for the function is that the sort routines (hash aggregation >included) will not stop in mid-sort Good point. Servus Manfred
Hello Manfred, I thank everyone for helping me on this - So many tips. I am in the middle of going through them all, till now disabling the enable_mergejoin really helped. Also, I agree that the design might be flawed (I could use triggers and stuff like that) but for now I am just comparing how my project will run on PostgreSQL (Considering migration from MySQL). I'll be reporting back on how the other stuff helped. Regards, Vitaly Belman ICQ: 1912453 AIM: VitalyB1984 MSN: tmdagent@hotmail.com Yahoo!: VitalyBe Wednesday, April 28, 2004, 11:24:41 AM, you wrote: MK> On Tue, 27 Apr 2004 18:01:34 -0400, Rod Taylor <pg@rbt.ca> wrote: >>On Tue, 2004-04-27 at 17:27, Vitaly Belman wrote: >>> Hello pgsql-performance, >>> >>> I discussed the whole subject for some time in DevShed and didn't >>> achieve much (as for results). I wonder if any of you guys can help >>> out: >>> >>> http://forums.devshed.com/t136202/s.html >>The point is that a book cannot be of a certain genre more than once. MK> Rod, he has a hierarchy of genres. Genre 1 has 6379 child genres and a MK> book can be in more than one of these. MK> Vitaly, though LIMIT makes this look like a small query, DISTINCT MK> requires the whole result set to be retrieved. 0.7 seconds doesn't look MK> so bad for several thousand rows. Did you try with other genre_ids? MK> Maybe a merge join is not the best choice. Set enable_mergejoin to MK> false and see whether you get a (hopefully faster) hash join, assuming MK> that sort_mem is large enough to keep the hash table in memory. MK> If you send me your table contents I'll try it on Linux. MK> Servus MK> Manfred
Vitaly, > I am in the middle of going through them all, till now disabling the > enable_mergejoin really helped. In that case, your random_page_cost is probably too low. Check the ratio of per-tuple times on index vs. seqscan seeks. -- -Josh Berkus Aglio Database Solutions San Francisco