Thread: PostgreSQL caching
Hello, I have the following problem: When I run some query after I just run the Postmaster, it takse several seconds to execute (sometimes more than 10), if I rerun it again afterwards, it takes mere milliseconds. So, I guess it has to do with PostgreSQL caching.. But how exactly does it work? What does it cache? And how can I control it? I would like to load selected information in the memory before a user runs the query. Can I do it somehow? As PostgreSQL is used in my case as webserver, it isn't really helping if the user has to wait 10 seconds every time he goes to a new page (even if refreshing the page would be really quick, sine Postgre already loaded the data to memory). P.S If the query or its EXPLAIN are critical for a better understanding, let me know. Regards, Vitaly Belman ICQ: 1912453 AIM: VitalyB1984 MSN: tmdagent@hotmail.com Yahoo!: VitalyBe
while you weren't looking, Vitaly Belman wrote: > So, I guess it has to do with PostgreSQL caching.. But how exactly > does it work? What does it cache? And how can I control it? PostgreSQL uses the operating system's disk cache. You can hint to the postmaster how much memory is available for caching with the effective_cache_size directive in your postgresql.conf. If you're running a *nix OS, you can find this by watching `top` for a while; in the header, there's a "cached" value (or something to that effect). Watching this value, you can determine a rough average and set your effective_cache_size to that rough average, or perhaps slightly less. I'm not sure how to get this value on Windows. Pgsql uses the OS's disk cache instead of its own cache management because the former is more likely to persist. If the postmaster managed the cache, as soon as the last connection died, the memory allocated for caching would be released, and all the cached data would be lost. Relying instead on the OS to cache data means that, whether or not there's a postmaster, so long as there has been one, there'll be some data cached. You can "prepopulate" the OS disk cache by periodically running a handful of SELECT queries that pull from your most commonly accessed tables in a background process. (A good way of doing that is simply to run your most commonly executed SELECTS.) Those queries should take the performance hit of fetching from disk, while your regular queries hit the cache. /rls -- Rosser Schwarz Total Card, Inc.
Vitaly Belman wrote: > Hello, > > I have the following problem: > > When I run some query after I just run the Postmaster, it takse > several seconds to execute (sometimes more than 10), if I rerun it > again afterwards, it takes mere milliseconds. > > So, I guess it has to do with PostgreSQL caching.. But how exactly > does it work? What does it cache? And how can I control it? There are two areas of cache - PostgreSQL's shared buffers and the operating system's disk-cache. You can't directly control what data is cached, it just keeps track of recently used data. It sounds like PG isn't being used for a while so your OS decides to use its cache for webserver files. > I would like to load selected information in the memory before a user > runs the query. Can I do it somehow? As PostgreSQL is used in my case > as webserver, it isn't really helping if the user has to wait 10 > seconds every time he goes to a new page (even if refreshing the page > would be really quick, sine Postgre already loaded the data to > memory). If you could "pin" data in the cache it would run quicker, but at the cost of everything else running slower. Suggested steps: 1. Read the configuration/tuning guide at: http://www.varlena.com/varlena/GeneralBits/Tidbits/index.php 2. Post a sample query/explain analyse that runs very slowly when not cached. 3. If needs be, you can write a simple timed script that performs a query. Or, the autovacuum daemon might be what you want. -- Richard Huxton Archonet Ltd
Hello Richard and Rosser, Thank you both for the answers. I tried creating a semi cache by running all the queries and indeed it worked and I might use such way in the future if needed, yet though, I can't help but to feel it isn't exactly the right way to work around this problem. If I do, I might as well increase the effective_cache value as pointed by the config docs. Also on this subject, previously I was only fighting with queries that run poorly even if you run them 10 days in the row.. They don't seem to be cached at all. Does it cahce the query result? If so, it should make any query run almost immediately the second time. If it doesn't cache the actual result, what does it cache? If you'll be so kind though, I'd be glad if you could spot anything to speed up in this query. Here's the query and its plan that happens without any caching: ------------------------------------------------------------------------------------------------------------- QUERY ----- SELECT bv_books. * , vote_avg, vote_count FROM bv_bookgenres, bv_books WHERE bv_books.book_id = bv_bookgenres.book_id AND bv_bookgenres.genre_id = 5830 ORDER BY vote_avg DESC LIMIT 10 OFFSET 0; QUERY PLAN ---------- Limit (cost=2337.41..2337.43 rows=10 width=76) (actual time=7875.000..7875.000 rows=10 loops=1) -> Sort (cost=2337.41..2337.94 rows=214 width=76) (actual time=7875.000..7875.000 rows=10 loops=1) Sort Key: bv_books.vote_avg -> Nested Loop (cost=0.00..2329.13 rows=214 width=76) (actual time=16.000..7844.000 rows=1993 loops=1) -> Index Scan using i_bookgenres_genre_id on bv_bookgenres (cost=0.00..1681.54 rows=214 width=4) (actualtime=16.000..3585.000 rows=1993 loops=1) Index Cond: (genre_id = 5830) -> Index Scan using bv_books_pkey on bv_books (cost=0.00..3.01 rows=1 width=76) (actual time=2.137..2.137rows=1 loops=1993) Index Cond: (bv_books.book_id = "outer".book_id) Total runtime: 7875.000 ms ------------------------------------------------------------------------------------------------------------- Some general information: bv_books holds 17000 rows. bv_bookgenres holds 938797 rows. Using the WHERE (genre_id == 5838) it cuts the number of book_ids to around 2000. As far as indexes are concerned, there's an index on all the rows mentioned in the query (as can be seen from the explain), including the vote_avg row. Thanks and regards, Vitaly Belman ICQ: 1912453 AIM: VitalyB1984 MSN: tmdagent@hotmail.com Yahoo!: VitalyBe Friday, May 21, 2004, 6:34:12 PM, you wrote: RH> Vitaly Belman wrote: >> Hello, >> >> I have the following problem: >> >> When I run some query after I just run the Postmaster, it takse >> several seconds to execute (sometimes more than 10), if I rerun it >> again afterwards, it takes mere milliseconds. >> >> So, I guess it has to do with PostgreSQL caching.. But how exactly >> does it work? What does it cache? And how can I control it? RH> There are two areas of cache - PostgreSQL's shared buffers and the RH> operating system's disk-cache. You can't directly control what data is RH> cached, it just keeps track of recently used data. It sounds like PG RH> isn't being used for a while so your OS decides to use its cache for RH> webserver files. >> I would like to load selected information in the memory before a user >> runs the query. Can I do it somehow? As PostgreSQL is used in my case >> as webserver, it isn't really helping if the user has to wait 10 >> seconds every time he goes to a new page (even if refreshing the page >> would be really quick, sine Postgre already loaded the data to >> memory). RH> If you could "pin" data in the cache it would run quicker, but at the RH> cost of everything else running slower. RH> Suggested steps: RH> 1. Read the configuration/tuning guide at: RH> http://www.varlena.com/varlena/GeneralBits/Tidbits/index.php RH> 2. Post a sample query/explain analyse that runs very slowly when not RH> cached. RH> 3. If needs be, you can write a simple timed script that performs a RH> query. Or, the autovacuum daemon might be what you want.
dev@archonet.com (Richard Huxton) writes: > If you could "pin" data in the cache it would run quicker, but at the > cost of everything else running slower. > > Suggested steps: > 1. Read the configuration/tuning guide at: > http://www.varlena.com/varlena/GeneralBits/Tidbits/index.php > 2. Post a sample query/explain analyse that runs very slowly when not > cached. > 3. If needs be, you can write a simple timed script that performs a > query. Or, the autovacuum daemon might be what you want. I don't think this case will be anywhere near so simple to resolve. I have seen this phenomenon occur when a query needs to pull a moderate number of blocks into memory to satisfy a query that involves some moderate number of rows. Let's say you need 2000 rows, which fit into 400 blocks. The first time the query runs, it needs to pull those 400 blocks off disk, which requires 400 reads of 8K of data. That can easily take a few seconds of I/O. The second time, not only are those blocks cached, they are probably cached in the buffer cache, so that the I/O overhead disappears. There's very likely no problem with the table statistics; they are leading to the right query plan, which happens to need to do 5 seconds of I/O to pull the data into memory. What is essentially required is the "prescient cacheing algorithm," where the postmaster must consult /dev/esp in order to get a prediction of what blocks it may need to refer to in the next sixty seconds. -- (format nil "~S@~S" "cbbrowne" "cbbrowne.com") http://cbbrowne.com/info/linuxdistributions.html "Normally, we don't do people's homework around here, but Venice is a very beautiful city, so I'll make a small exception." --- Robert Redelmeier compromises his principles
> What is essentially required is the "prescient cacheing algorithm," > where the postmaster must consult /dev/esp in order to get a > prediction of what blocks it may need to refer to in the next sixty > seconds. Easy enough. Television does it all the time with live shows. The guy with the buzzer always seems to know what will be said before they say it. All we need is a 5 to 10 second delay...
Rosser Schwarz wrote: > PostgreSQL uses the operating system's disk cache. ... in addition to its own buffer cache, which is stored in shared memory. You're correct though, in that the best practice is to keep the PostgreSQL cache small and give more memory to the operating system's disk cache. > Pgsql uses the OS's disk cache instead of its own cache management > because the former is more likely to persist. If the postmaster > managed the cache, as soon as the last connection died, the memory > allocated for caching would be released, and all the cached data > would be lost. No; the cache is stored in shared memory. It wouldn't persist over postmaster restarts (without some scheme of saving and restoring it), but that has nothing to do with why the OS disk cache is usually kept larger than the PG shared buffer cache. -Neil
Not knowing a whole lot about the internals of Pg, one thing jumped out at me, that each trip to get data from bv_books took 2.137 ms, which came to over 4.2 seconds right there. The problem "seems" to be the 1993 times that the nested loop spins, as almost all of the time is spent there. Personally, I am amazed that it takes 3.585 seconds to index scan i_bookgenres_genre_id. Is that a composite index? Analyzing the taables may help, as the optimizer appears to mispredict the number of rows returned. I would be curious to see how it performs with an "IN" clause, which I would suspect would go quite a bit fasrer. Try the following: SELECT bv_books. * , vote_avg, vote_count FROM bv_bookgenres, bv_books WHERE bv_books.book_id IN ( SELECT book_id FROM bv_genres WHERE bv_bookgenres.genre_id = 5830 ) AND bv_bookgenres.genre_id = 5830 ORDER BY vote_avg DESC LIMIT 10 OFFSET 0; In this query, all of the book_id values are pulled at once. Who knows? If you get statisctics on this, please post. Marty
Hello Marty, MS> Is that a composite index? It is a regular btree index. What is a composite index? MS> Analyzing the taables may help, as the optimizer appears to MS> mispredict the number of rows returned. I'll try analyzing, but I highly doubt that it would help. I analyzed once already and haven't changed the data since. MS> I would be curious to see how it performs with an "IN" clause, MS> which I would suspect would go quite a bit fasrer. Actually it reached 20s before I canceled it... Here's the explain: QUERY PLAN Limit (cost=3561.85..3561.88 rows=10 width=76) -> Sort (cost=3561.85..3562.39 rows=214 width=76) Sort Key: bv_books.vote_avg -> Nested Loop (cost=1760.75..3553.57 rows=214 width=76) -> Index Scan using i_bookgenres_genre_id on bv_bookgenres (cost=0.00..1681.54 rows=214 width=0) Index Cond: (genre_id = 5830) -> Materialize (cost=1760.75..1761.01 rows=26 width=76) -> Nested Loop (cost=1682.07..1760.75 rows=26 width=76) -> HashAggregate (cost=1682.07..1682.07 rows=26 width=4) -> Index Scan using i_bookgenres_genre_id on bv_bookgenres (cost=0.00..1681.54 rows=214width=4) Index Cond: (genre_id = 5830) -> Index Scan using bv_books_pkey on bv_books (cost=0.00..3.01 rows=1 width=76) Index Cond: (bv_books.book_id = "outer".book_id) Thank you for your try. Regards, Vitaly Belman ICQ: 1912453 AIM: VitalyB1984 MSN: tmdagent@hotmail.com Yahoo!: VitalyBe Friday, May 21, 2004, 11:10:56 PM, you wrote: MS> Not knowing a whole lot about the internals of Pg, one thing jumped out MS> at me, that each trip to get data from bv_books took 2.137 ms, which MS> came to over 4.2 seconds right there. MS> The problem "seems" to be the 1993 times that the nested loop spins, as MS> almost all of the time is spent there. MS> Personally, I am amazed that it takes 3.585 seconds to index scan MS> i_bookgenres_genre_id. Is that a composite index? Analyzing the MS> taables may help, as the optimizer appears to mispredict the number of MS> rows returned. MS> I would be curious to see how it performs with an "IN" clause, which I MS> would suspect would go quite a bit fasrer. Try the following: MS> SELECT bv_books. * , MS> vote_avg, MS> vote_count MS> FROM bv_bookgenres, MS> bv_books MS> WHERE bv_books.book_id IN ( MS> SELECT book_id MS> FROM bv_genres MS> WHERE bv_bookgenres.genre_id = 5830 MS> ) MS> AND bv_bookgenres.genre_id = 5830 MS> ORDER BY vote_avg DESC LIMIT 10 OFFSET 0; MS> In this query, all of the book_id values are pulled at once. MS> Who knows? MS> If you get statisctics on this, please post. MS> Marty MS> ---------------------------(end of MS> broadcast)--------------------------- MS> TIP 4: Don't 'kill -9' the postmaster
> Hello Marty, > > MS> Is that a composite index? > > It is a regular btree index. What is a composite index? My apologies. A composite index is one that consists of multiple fields (aka multicolumn index). The reason I ask is that it was spending almost half the time just searching bv_bookgenres, which seemed odd. I may be speaking out of turn since I am not overly familiar with Pg's quirks and internals. A composite index, or any index of a large field, will lower the number of index items stored per btree node, thereby lowering the branching factor and increasing the tree depth. On tables with many rows, this can result in many more disk accesses for reading the index. An index btree that is 6 levels deep will require at least seven disk accesses (6 for the index, one for the table row) per row retrieved. Not knowing the structure of the indexes, it's hard to say too much about it. The fact that a 1993 row select from an indexed table took 3.5 seconds caused me to take notice. > MS> I would be curious to see how it performs with an "IN" clause, > MS> which I would suspect would go quite a bit fasrer. > > Actually it reached 20s before I canceled it... Here's the explain: I believe that. The code I posted had a nasty join bug. If my math is right, the query was trying to return 1993*1993, or just under 4 million rows. I didn't see the table structure, but I assume that the vote_avg and vote_count fields are in bv_bookgenres. If no fields are actually needed from bv_bookgenres, then the query might be constructed in a way that only the index would be read, without loading any row data. I think that you mentioned this was for a web app. Do you actually have a web page that displays 2000 rows of data? Good luck, Marty
Vitaly Belman wrote: > > If you'll be so kind though, I'd be glad if you could spot anything to > speed up in this query. Here's the query and its plan that happens > without any caching: > > ------------------------------------------------------------------------------------------------------------- > QUERY > ----- > SELECT bv_books. * , > vote_avg, > vote_count > FROM bv_bookgenres, > bv_books > WHERE bv_books.book_id = bv_bookgenres.book_id AND > bv_bookgenres.genre_id = 5830 > ORDER BY vote_avg DESC LIMIT 10 OFFSET 0; > > QUERY PLAN > ---------- > Limit (cost=2337.41..2337.43 rows=10 width=76) (actual time=7875.000..7875.000 rows=10 loops=1) > -> Sort (cost=2337.41..2337.94 rows=214 width=76) (actual time=7875.000..7875.000 rows=10 loops=1) > Sort Key: bv_books.vote_avg > -> Nested Loop (cost=0.00..2329.13 rows=214 width=76) (actual time=16.000..7844.000 rows=1993 loops=1) > -> Index Scan using i_bookgenres_genre_id on bv_bookgenres (cost=0.00..1681.54 rows=214 width=4) (actualtime=16.000..3585.000 rows=1993 loops=1) > Index Cond: (genre_id = 5830) > -> Index Scan using bv_books_pkey on bv_books (cost=0.00..3.01 rows=1 width=76) (actual time=2.137..2.137rows=1 loops=1993) > Index Cond: (bv_books.book_id = "outer".book_id) > Total runtime: 7875.000 ms Presuming that vote_avg is a field in the table bv_bookgenres, try a composite index on genre_id and vote_avg and then see if you can use the limit clause to reduce the number of loop iterations from 1993 to 10. CREATE INDEX test_idx ON bv_bookgenres (genre_id, vote_avg); The following query tries to force that execution lan and, presuming there is a foreign key relation between bv_books.book_id AND bv_bookgenres.book_id, I expect it will give the same results, but be carefull with NULL's: SELECT bv_books. * , vote_avg, vote_count FROM ( SELECT bg.* FROM bv_bookgenres bg WHERE bg.genre_id = 5830 ORDER BY bg.vote_avg DESC LIMIT 10 ) bv_bookgenres, bv_books WHERE bv_books.book_id = bv_bookgenres.book_id ORDER BY vote_avg DESC LIMIT 10; Jochem -- I don't get it immigrants don't work and steal our jobs - Loesje
Hello Jochem and Marty, I guess I should have posted the table structure before =(: Table structure + Indexes ------------------------- CREATE TABLE public.bv_books ( book_id serial NOT NULL, book_title varchar(255) NOT NULL, series_id int4, series_index int2, annotation_desc_id int4, description_desc_id int4, book_picture varchar(255) NOT NULL, vote_avg float4 NOT NULL, vote_count int4 NOT NULL, CONSTRAINT bv_books_pkey PRIMARY KEY (book_id) ) WITH OIDS; CREATE INDEX i_books_vote_avg ON public.bv_books USING btree (vote_avg); CREATE INDEX i_books_vote_count ON public.bv_books USING btree (vote_count); ------------------------- CREATE TABLE public.bv_bookgenres ( book_id int4 NOT NULL, genre_id int4 NOT NULL, CONSTRAINT bv_bookgenres_pkey PRIMARY KEY (book_id, genre_id), CONSTRAINT fk_bookgenres_book_id FOREIGN KEY (book_id) REFERENCES public.bv_books (book_id) ON UPDATE RESTRICT ON DELETERESTRICT ) WITH OIDS; CREATE INDEX i_bookgenres_book_id ON public.bv_bookgenres USING btree (book_id); CREATE INDEX i_bookgenres_genre_id ON public.bv_bookgenres USING btree (genre_id); ------------------------- MS> I didn't see the table structure, but I assume that the vote_avg and MS> vote_count fields are in bv_bookgenres. If no fields are actually MS> needed from bv_bookgenres, then the query might be constructed in a way MS> that only the index would be read, without loading any row data. I didn't understand you. vote_avg is stored in bv_books.. So yes, the only thing I need from bv_bookgenres is the id of the book, but I can't store this info in bv_books because there is N to N relationship between them - every book can belong to a number of genres... If that's what you meant. MS> I think that you mentioned this was for a web app. Do you actually have MS> a web page that displays 2000 rows of data? Well.. It is all "paginated", you can access 2000 items of the data (as there are actually 2000 books in the genre) but you only see 10 items at a time.. I mean, probably no one would go over the 2000 books, but I can't just hide them =\. JvD> Presuming that vote_avg is a field in the table bv_bookgenres, JvD> try a composite index on genre_id and vote_avg and then see if JvD> you can use the limit clause to reduce the number of loop JvD> iterations from 1993 to 10. I'm afraid your idea is invalid in my case =\... Naturally I could eventually do data coupling to gain perforemnce boost if this issue will not be solved in other ways. I'll keep your idea in mind anyway, thanks. Once again thanks for you feedback. Regards, Vitaly Belman ICQ: 1912453 AIM: VitalyB1984 MSN: tmdagent@hotmail.com Yahoo!: VitalyBe Tuesday, May 25, 2004, 6:37:44 PM, you wrote: JvD> Vitaly Belman wrote: >> >> If you'll be so kind though, I'd be glad if you could spot anything to >> speed up in this query. Here's the query and its plan that happens >> without any caching: >> >> ------------------------------------------------------------------------------------------------------------- >> QUERY >> ----- >> SELECT bv_books. * , >> vote_avg, >> vote_count >> FROM bv_bookgenres, >> bv_books >> WHERE bv_books.book_id = bv_bookgenres.book_id AND >> bv_bookgenres.genre_id = 5830 >> ORDER BY vote_avg DESC LIMIT 10 OFFSET 0; >> >> QUERY PLAN >> ---------- >> Limit (cost=2337.41..2337.43 rows=10 width=76) (actual >> time=7875.000..7875.000 rows=10 loops=1) >> -> Sort (cost=2337.41..2337.94 rows=214 width=76) (actual >> time=7875.000..7875.000 rows=10 loops=1) >> Sort Key: bv_books.vote_avg >> -> Nested Loop (cost=0.00..2329.13 rows=214 width=76) >> (actual time=16.000..7844.000 rows=1993 loops=1) >> -> Index Scan using i_bookgenres_genre_id on >> bv_bookgenres (cost=0.00..1681.54 rows=214 width=4) (actual >> time=16.000..3585.000 rows=1993 loops=1) >> Index Cond: (genre_id = 5830) >> -> Index Scan using bv_books_pkey on bv_books >> (cost=0.00..3.01 rows=1 width=76) (actual time=2.137..2.137 rows=1 >> loops=1993) >> Index Cond: (bv_books.book_id = "outer".book_id) >> Total runtime: 7875.000 ms JvD> Presuming that vote_avg is a field in the table bv_bookgenres, JvD> try a composite index on genre_id and vote_avg and then see if JvD> you can use the limit clause to reduce the number of loop JvD> iterations from 1993 to 10. JvD> CREATE INDEX test_idx ON bv_bookgenres (genre_id, vote_avg); JvD> The following query tries to force that execution lan and, JvD> presuming there is a foreign key relation between JvD> bv_books.book_id AND bv_bookgenres.book_id, I expect it will give JvD> the same results, but be carefull with NULL's: JvD> SELECT bv_books. * , JvD> vote_avg, JvD> vote_count JvD> FROM ( JvD> SELECT bg.* JvD> FROM bv_bookgenres bg JvD> WHERE bg.genre_id = 5830 JvD> ORDER BY JvD> bg.vote_avg DESC JvD> LIMIT 10 JvD> ) bv_bookgenres, JvD> bv_books JvD> WHERE bv_books.book_id = bv_bookgenres.book_id JvD> ORDER BY JvD> vote_avg DESC JvD> LIMIT 10; JvD> Jochem
Vitaly, This looks like there might be some room for performance improvement... > MS> I didn't see the table structure, but I assume > MS> that the vote_avg and > MS> vote_count fields are in bv_bookgenres. > > I didn't understand you. vote_avg is stored in bv_books. Ok. That helps. The confusion (on my end) came from the SELECT clause of the query you provided: > SELECT bv_books. * , > vote_avg, > vote_count All fields from bv_books were selected (bv_books.*) along with vote_agv and vote_count. My assumption was that vote_avg and vote_count were therefore not in bv_books. At any rate, a query with an IN clause should help quite a bit: SELECT bv_books. * FROM bv_books WHERE bv_books.book_id IN ( SELECT book_id FROM bv_genres WHERE bv_bookgenres.genre_id = 5830 ) ORDER BY vote_avg DESC LIMIT 10 OFFSET 0; Give it a whirl. Marty
On Tue, 2004-05-25 at 15:53, Vitaly Belman wrote: > >> > >> QUERY PLAN > >> ---------- > >> Limit (cost=2337.41..2337.43 rows=10 width=76) (actual > >> time=7875.000..7875.000 rows=10 loops=1) > >> -> Sort (cost=2337.41..2337.94 rows=214 width=76) (actual > >> time=7875.000..7875.000 rows=10 loops=1) > >> Sort Key: bv_books.vote_avg > >> -> Nested Loop (cost=0.00..2329.13 rows=214 width=76) > >> (actual time=16.000..7844.000 rows=1993 loops=1) > >> -> Index Scan using i_bookgenres_genre_id on > >> bv_bookgenres (cost=0.00..1681.54 rows=214 width=4) (actual > >> time=16.000..3585.000 rows=1993 loops=1) > >> Index Cond: (genre_id = 5830) > >> -> Index Scan using bv_books_pkey on bv_books > >> (cost=0.00..3.01 rows=1 width=76) (actual time=2.137..2.137 rows=1 > >> loops=1993) > >> Index Cond: (bv_books.book_id = "outer".book_id) > >> Total runtime: 7875.000 ms > A question and two experiments... what version of postgresql is this? Try reindexing i_bookgenres_genre_id and capture the explain analyze for that. If it doesn't help try doing set enable_indexscan = false and capture the explain analyze for that. Robert Treat -- Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL
Hello Marty, Nick and Robert, NB> Depending on what version of PG you are running, IN might take a while NB> to complete. If so try an EXISTS instead RT> A question and two experiments... what version of postgresql is this? I am using the newer 7.5dev native Windows port. For this reason I don't think that IN will cause any trouble (I read that this issue was resolved in 7.4). MS> At any rate, a query with an IN clause should help quite a bit MS> SELECT bv_books. * MS> FROM bv_books MS> WHERE bv_books.book_id IN ( MS> SELECT book_id MS> FROM bv_genres MS> WHERE bv_bookgenres.genre_id = 5830 MS> ) MS> ORDER BY vote_avg DESC LIMIT 10 OFFSET 0; It looks like it helps a bit (though you meant "FROM bv_bookgenres", right?). I can't tell you how MUCH it helped though, because of two reasons: 1) As soon as I run a query, it is cached in the memory and I can't really find a good way to flush it out of there to test again except a full computer reset (shutting postmaster down doesn't help). If you have a better idea on this, do tell me =\ (Reminding again, I am on Windows). 2) I *think* I resolved this issue, at least for most of the genre_ids (didn't go through them all, but tried a few with different book count and the results looked quite good). The fault was partly mine, a few weeks ago I increase the statistics for the genre_id column a bit too much (from 10 to 70), I was unsure how exactly it works (and still am) but it helped for a few genre_ids that had a high book count, yet it also hurt the performence for the genres without as much ids. I now halved the stastics (to 58) and almost everything looks good now. Because of that I'll stop working on that query for a while (unless you have some more performance tips on the subject). Big thanks to everyone who helped.. And I might bring this issue later again, it it still will cause too much troubles. RT> Try reindexing i_bookgenres_genre_id and capture the explain RT> analyze for that. Is that's what you meant "REINDEX INDEX i_bookgenres_genre_id"? But it returns no messages what-so-ever =\. I can EXPLAIN it either. RT> If it doesn't help try doing set enable_indexscan = false and RT> capture the explain analyze for that. Here it is: ------------------------------------------------------------------------------------------------------------------------------------------ QUERY PLAN Limit (cost=41099.93..41099.96 rows=10 width=76) (actual time=6734.000..6734.000 rows=10 loops=1) -> Sort (cost=41099.93..41100.45 rows=208 width=76) (actual time=6734.000..6734.000 rows=10 loops=1) Sort Key: bv_books.vote_count -> Merge Join (cost=40229.21..41091.92 rows=208 width=76) (actual time=6078.000..6593.000 rows=1993 loops=1) Merge Cond: ("outer".book_id = "inner".book_id) -> Sort (cost=16817.97..16818.49 rows=208 width=4) (actual time=1062.000..1062.000 rows=1993 loops=1) Sort Key: bv_bookgenres.book_id -> Seq Scan on bv_bookgenres (cost=0.00..16809.96 rows=208 width=4) (actual time=0.000..1047.000 rows=1993loops=1) Filter: (genre_id = 5830) -> Sort (cost=23411.24..23841.04 rows=171918 width=76) (actual time=5016.000..5189.000 rows=171801 loops=1) Sort Key: bv_books.book_id -> Seq Scan on bv_books (cost=0.00..4048.18 rows=171918 width=76) (actual time=0.000..359.000 rows=171918loops=1) Total runtime: 6734.000 ms ------------------------------------------------------------------------------------------------------------------------------------------ Regards, Vitaly Belman ICQ: 1912453 AIM: VitalyB1984 MSN: tmdagent@hotmail.com Yahoo!: VitalyBe Wednesday, May 26, 2004, 1:24:18 AM, you wrote: MS> Vitaly, MS> This looks like there might be some room for performance improvement... >> MS> I didn't see the table structure, but I assume >> MS> that the vote_avg and >> MS> vote_count fields are in bv_bookgenres. >> >> I didn't understand you. vote_avg is stored in bv_books. MS> Ok. That helps. The confusion (on my end) came from the SELECT clause MS> of the query you provided: >> SELECT bv_books. * , >> vote_avg, >> vote_count MS> All fields from bv_books were selected (bv_books.*) along with vote_agv MS> and vote_count. My assumption was that vote_avg and vote_count were MS> therefore not in bv_books. MS> At any rate, a query with an IN clause should help quite a bit: MS> SELECT bv_books. * MS> FROM bv_books MS> WHERE bv_books.book_id IN ( MS> SELECT book_id MS> FROM bv_genres MS> WHERE bv_bookgenres.genre_id = 5830 MS> ) MS> ORDER BY vote_avg DESC LIMIT 10 OFFSET 0; MS> Give it a whirl. MS> Marty MS> ---------------------------(end of MS> broadcast)--------------------------- MS> TIP 6: Have you searched our list archives? MS> http://archives.postgresql.org
Vitaly, > I am using the newer 7.5dev native Windows port. For this reason I > don't think that IN will cause any trouble (I read that this issue was > resolved in 7.4). Well, for performance, all bets are off for the dev Windows port. Last I checked, the Win32 team was still working on *stability* and hadn't yet even looked at performance. Not that you can't improve the query, just that it might not fix the problem. Therefore ... your detailed feedback is appreciated, especially if you can compare stuff to the same database running on a Linux, Unix, or BSD machine. -- Josh Berkus Aglio Database Solutions San Francisco
Hello Josh, JB> Not that you can't improve the query, just that it might not fix JB> the problem. Yes, I'm aware it might be slower than the Linux version, but then, as you said, I still can improve the query (as I did with your help now). But true, if there's something awfully wrong with Win32 port performance, I might be doing some overwork... JB> Therefore ... your detailed feedback is appreciated, especially if you can JB> compare stuff to the same database running on a Linux, Unix, or BSD machine. I can't easily install Linux right now.. But I am considering using it through VMWare. Do you think it would suffice as a comprasion? From what I saw (e.g http://usuarios.lycos.es/hernandp/articles/vpcvs.html) the performance are bad only when it's coming to graphics, otherwise it looks pretty good. Regards, Vitaly Belman ICQ: 1912453 AIM: VitalyB1984 MSN: tmdagent@hotmail.com Yahoo!: VitalyBe Wednesday, May 26, 2004, 7:17:35 PM, you wrote: JB> Vitaly, >> I am using the newer 7.5dev native Windows port. For this reason I >> don't think that IN will cause any trouble (I read that this issue was >> resolved in 7.4). JB> Well, for performance, all bets are off for the dev Windows port. Last I JB> checked, the Win32 team was still working on *stability* and hadn't yet even JB> looked at performance. Not that you can't improve the query, just that it JB> might not fix the problem. JB> Therefore ... your detailed feedback is appreciated, especially if you can JB> compare stuff to the same database running on a Linux, Unix, or BSD machine.
> > Hello Josh, > > JB> Not that you can't improve the query, just that it might not fix > JB> the problem. > > Yes, I'm aware it might be slower than the Linux version, but then, as > you said, I still can improve the query (as I did with your help now). > > But true, if there's something awfully wrong with Win32 port > performance, I might be doing some overwork... > > JB> Therefore ... your detailed feedback is appreciated, especially if you > can > JB> compare stuff to the same database running on a Linux, Unix, or BSD > machine. > > I can't easily install Linux right now.. But I am considering using it > through VMWare. Do you think it would suffice as a comprasion? > > From what I saw (e.g > http://usuarios.lycos.es/hernandp/articles/vpcvs.html) the performance > are bad only when it's coming to graphics, otherwise it looks pretty > good. > > Regards, > Vitaly Belman > An interesting alternative that I've been using lately is colinux (http://colinux.sf.net). It lets you run linux in windows and compared to vmware, I find it remarkably faster and when it is idle less resource intensive. I have vmware but if I'm only going to use a console based program, colinux seems to outperform it. Note that it may simply be interactive processes that run better because it has a simpler interface and does not try to emulate the display hardware. (Therefore no X unless you use vmware) It seems though that there is less overhead and if that's the case, then everything should run faster. Also note that getting it installed is a little more work than vmware. If you're running it on a workstation that you use for normal day-to-day tasks though I think you'll like it because you can detach the terminal and let it run in the background. When I do that I often forget it is running because it produces such a low load on the system. If you are going to give it a try, the one trick I used to get things going was to download the newest beta of winpcap and then the networking came up easily. Everything else was a piece of cake. Matthew Nuzum | Makers of "Elite Content Management System" www.followers.net | View samples of Elite CMS in action matt@followers.net | http://www.followers.net/portfolio/