Thread: Odd sorting behaviour
[Apologies if this reaches the list twice -- I sent a copy before subscribing, but it seems to be stuck waiting for listmaster forever, so I subscribed and sent it again.] Hi, I'm trying to find out why one of my queries is so slow -- I'm primarily using PostgreSQL 7.2 (Debian stable), but I don't really get much better performance with 7.4 (Debian unstable). My prototype table looks like this: CREATE TABLE opinions ( prodid INTEGER NOT NULL, uid INTEGER NOT NULL, opinion INTEGER NOT NULL, PRIMARY KEY ( prodid, uid ) ); In addition, there are separate indexes on prodid and uid. I've run VACUUM ANALYZE before all queries, and they are repeatable. (If anybody needs the data, that could be arranged -- it's not secret or anything :-) ) My query looks like this: EXPLAIN ANALYZE SELECT o3.prodid, SUM(o3.opinion*o12.correlation) AS total_correlation FROM opinions o3 RIGHT JOIN ( SELECT o2.uid, SUM(o1.opinion*o2.opinion)/SQRT(count(*)+0.0) AS correlation FROM opinions o1 LEFT JOIN opinions o2 ON o1.prodid=o2.prodid WHERE o1.uid=1355 GROUP BY o2.uid ) o12 ON o3.uid=o12.uid LEFT JOIN ( SELECT o4.prodid, COUNT(*) as num_my_comments FROM opinions o4 WHERE o4.uid=1355 GROUP BY o4.prodid ) nmc ON o3.prodid=nmc.prodid WHERE nmc.num_my_comments IS NULL AND o3.opinion<>0 AND o12.correlation<>0 GROUP BY o3.prodid ORDER BY total_correlation desc; And produces the query plan at http://www.samfundet.no/~sesse/queryplan.txt (The lines were a bit too long to include in an e-mail :-) ) Note that the "o3.opinion<>0 AND o12.correleation<>0" lines are an optimization; I can run the query fine without them and it will produce the same results, but it goes slower both in 7.2 and 7.4. There are a few oddities here: - The "subquery scan o12" phase outputs 1186 rows, yet 83792 are sorted. Where do the other ~82000 rows come from? And why would it take ~100ms to sort the rows at all? (In earlier tests, this was _one full second_ but somehow that seems to have improved, yet without really improving the overall query time. shared_buffers is 4096 and sort_mem is 16384, so it should really fit into RAM.) - Why does it use uid_index for an index scan on the table, when it obviously has no filter on it (since it returns all the rows)? Furthermore, why would this take half a second? (The machine is a 950MHz machine with SCSI disks.) - Also, the outer sort (the sorting of the 58792 rows from the merge join) is slow. :-) 7.4 isn't really much better: http://www.samfundet.no/~sesse/queryplan74.txt Note that this is run on a machine with almost twice the speed (in terms of CPU speed, at least). The same oddities are mostly present (such as o12 returning 1186 rows, but 58788 rows are sorted), so I really don't understand what's going on here. Any ideas on how to improve this? /* Steinar */ -- Homepage: http://www.sesse.net/
Hi, I'm really stuck and I wonder if any of you could help. I have an application which will be sitting on a quite large database (roughly 8-16GB). The nature of the application is such that, on a second by second basis, the working set of the database is likely to be a substantial portion (e.g. between 50 and 70%) of the data - Just imagine an almost stochastic sampling of the data in each table, and you'll get an idea. Potentially quite smelly. To start with, I thought. No problems. Just configure a DB server with an obscene amount of RAM (e.g. 64GB), and configure PG with a shared buffer cache that is big enough to hold every page of data in the database, plus 10% or whatever to allow for a bit of room, ensuring that there is enough RAM in the box so that all the backend processes can do their thing, and all the other services can do their thing, and the swap system on the host remains idle. Apparently not :( I've read a number of places now saying that the PG cache has an optimal size which isn't "as big as you can make it without affecting other stuff on the machine". The suggestion is to let linux take the strain for the lion's share of the caching (using its buffer cache), and just make the PG cache big enough to hold the data it needs for individual queries. ___ Ignoring for a moment the problem of answering the question 'so how big shall I make the PG cache?', and ignoring the possibility that as the database content changes over the months this answer will need updating from time to time for optimal performance, does anyone have any actual experience with trying to maintain a large, mainly RAM resident database? What is it about the buffer cache that makes it so unhappy being able to hold everything? I don't want to be seen as a cache hit fascist, but isn't it just better if the data is just *there*, available in the postmaster's address space ready for each backend process to access it, rather than expecting the Linux cache mechanism, optimised as it may be, to have to do the caching? Is it that the PG cache entries are accessed through a 'not particularly optimal for large numbers of tuples' type of strategy? (Optimal though it might be for more modest numbers). And on a more general note, with the advent of 64 bit addressing and rising RAM sizes, won't there, with time, be more and more DB applications that would want to capitalise on the potential speed improvements that come with not having to work hard to get the right bits in the right bit of memory all the time? And finally, am I worrying too much, and actually this problem is common to all databases? Thanks for reading, Andy
> What is it about the buffer cache that makes it so unhappy being able to > hold everything? I don't want to be seen as a cache hit fascist, but isn't > it just better if the data is just *there*, available in the postmaster's > address space ready for each backend process to access it, rather than > expecting the Linux cache mechanism, optimised as it may be, to have to do > the caching? Because the PostgreSQL buffer management algorithms are pitiful compared to Linux's. In 7.5, it's improved with the new ARC algorithm, but still - the Linux disk buffer cache will be very fast. Chris
Thanks, Chris. > > What is it about the buffer cache that makes it so unhappy being able to > > hold everything? I don't want to be seen as a cache hit fascist, but isn't > > it just better if the data is just *there*, available in the postmaster's > > address space ready for each backend process to access it, rather than > > expecting the Linux cache mechanism, optimised as it may be, to have to do > > the caching? > > Because the PostgreSQL buffer management algorithms are pitiful compared > to Linux's. In 7.5, it's improved with the new ARC algorithm, but still > - the Linux disk buffer cache will be very fast. > I've had that reply elsewhere too. Initially, I was afraid that there was a memory copy involved if the OS buffer cache supplied a block of data to PG, but I've learned a lot more about the linux buffer cache, so it now makes more sense to me why it's not a terrible thing to let the OS manage the lions' share of the caching on a high RAM system. On another thread, (not in this mailing list), someone mentioned that there are a class of databases which, rather than caching bits of database file (be it in the OS buffer cache or the postmaster workspace), construct a a well indexed memory representation of the entire data in the postmaster workspace (or its equivalent), and this, remaining persistent, allows the DB to service backend queries far quicker than if the postmaster was working with the assumption that most of the data was on disk (even if, in practice, large amounts or perhaps even all of it resides in OS cache). Though I'm no stranger to data management in general, I'm still in a steep learning curve for databases in general and PG in particular, but I just wondered how big a subject this is in the development group for PG at the moment? After all, we're now seeing the first wave of 'reasonably priced' 64 bit servers supported by a proper 64 bit OS (e.g. linux). HP are selling a 4 Opteron server which can take 256GB of RAM, and that starts at $10000 (ok - they don't give you that much RAM for that price - not yet, anyway!) This is the future, isn't it? Each year, a higher percentage of DB applications will be able to fit entirely in RAM, and that percentage is going to be quite significant in just a few years. The disk system gets relegated to a data preload on startup and servicing the writes as the server does its stuff. Regards, Andy
Quoth andy_ballingall@bigfoot.com ("Andy Ballingall"): > This is the future, isn't it? Each year, a higher percentage of DB > applications will be able to fit entirely in RAM, and that percentage is > going to be quite significant in just a few years. The disk system gets > relegated to a data preload on startup and servicing the writes as the > server does its stuff. Regrettably, this may be something that fits better with MySQL, as it already has an architecture oriented to having different "storage engines" in behind. There may be merit to the notion of implementing in-memory databases; some assumptions change: - You might use bitmap indices, although that probably "kills" MVCC; - You might use T-trees rather than B-trees for indices, although research seems to indicate that B-trees win out if there is a great deal of concurrent access; - It can become worthwhile to use compression schemes to fit more records into memory that wouldn't be worthwhile if using demand paging. If you really want to try this, then look at Konstantin Knizhnik's FastDB system: http://www.ispras.ru/~knizhnik/fastdb.html It assumes that your application will be a monolithic C++ process; if that isn't the case, then performance will probably suffer due to throwing in context switches. The changes in assumptions are pretty vital ones, that imply you're heading in a fairly different direction than that which PostgreSQL seems to be taking. That's not to say that there isn't merit to building a database system using T-trees and bitmap indices attuned to applications where main-memory storage is key; it's just that the proposal probably should go somewhere else. -- output = ("cbbrowne" "@" "cbbrowne.com") http://www3.sympatico.ca/cbbrowne/languages.html How does the guy who drives the snowplow get to work in the mornings?
Oops - sorry - I confused my numbers. The opteron machine in mind *only* has up to 64GB of RAM (e.g. HP DL585) - here's the datapage: http://h18004.www1.hp.com/products/servers/proliantdl585/index.html Still - with *just* 64GB of RAM, that would comfortably provide for the type of scenario I envisage. Is that still enough for your app? The 256GB number came from something I read saying that the current crop of 64 bit chips will allow up to 256GB of RAM in principle, so it is just a matter of time before the memory limit shoots up on these simple products. If you are prepared to pay a bit more, already there are some big memory options on linux: E.g. you can have up to 192GB in an SGI Altix 350: http://www.sgi.com/servers/altix/downloads/altix350_at_a_glance.pdf Or up to 4 terabytes in it's bigger brother the Altix 3000 - but that's getting a bit esoteric. http://www.sgi.com/servers/altix/ (This won lots of awards recently) The nice thing about the two things above is that they run linux in a single address space NUMA setup, and in theory you can just bolt on more CPUs and more RAM as your needs grow. Thanks, Andy ----- Original Message ----- From: "J. Andrew Rogers" <jrogers@neopolitan.com> To: "Andy Ballingall" <andy_ballingall@bigfoot.com> Sent: Friday, July 09, 2004 10:40 PM Subject: Re: [PERFORM] Working on huge RAM based datasets > On Fri, 2004-07-09 at 02:28, Andy Ballingall wrote: > > After all, we're now seeing the first wave of 'reasonably priced' 64 bit > > servers supported by a proper 64 bit OS (e.g. linux). HP are selling a 4 > > Opteron server which can take 256GB of RAM, and that starts at $10000 (ok - > > they don't give you that much RAM for that price - not yet, anyway!) > > > Which server is this?! They are selling an Opteron system that can hold > 256 GB of RAM? > > I looked on their site, and couldn't find anything like that. I run > some MASSIVE memory codes that don't need a lot of CPU, and if such a > box existed, I'd be very interested. > > cheers, > > j. andrew rogers > > > ----- Original Message ----- From: "J. Andrew Rogers" <jrogers@neopolitan.com> To: "Andy Ballingall" <andy_ballingall@bigfoot.com> Sent: Friday, July 09, 2004 10:40 PM Subject: Re: [PERFORM] Working on huge RAM based datasets > On Fri, 2004-07-09 at 02:28, Andy Ballingall wrote: > > After all, we're now seeing the first wave of 'reasonably priced' 64 bit > > servers supported by a proper 64 bit OS (e.g. linux). HP are selling a 4 > > Opteron server which can take 256GB of RAM, and that starts at $10000 (ok - > > they don't give you that much RAM for that price - not yet, anyway!) > > > Which server is this?! They are selling an Opteron system that can hold > 256 GB of RAM? > > I looked on their site, and couldn't find anything like that. I run > some MASSIVE memory codes that don't need a lot of CPU, and if such a > box existed, I'd be very interested. > > cheers, > > j. andrew rogers > > >
""Andy Ballingall"" <andy_ballingall@bigfoot.com> wrote in message news:011301c46597$15d145c0$0300a8c0@lappy... > On another thread, (not in this mailing list), someone mentioned that there > are a class of databases which, rather than caching bits of database file > (be it in the OS buffer cache or the postmaster workspace), construct a a > well indexed memory representation of the entire data in the postmaster > workspace (or its equivalent), and this, remaining persistent, allows the DB > to service backend queries far quicker than if the postmaster was working > with the assumption that most of the data was on disk (even if, in practice, > large amounts or perhaps even all of it resides in OS cache). As a historical note, System R (grandaddy of all relational dbs) worked this way. And it worked under ridiculous memory constraints by modern standards. Space-conscious MOLAP databases do this, FWIW. Sybase 11 bitmap indexes pretty much amount to this, too. I've built a SQL engine that used bitmap indexes within B-Tree indexes, making it practical to index every field of every table (the purpose of the engine). You can also build special-purpose in-memory representations to test for existence (of a key), when you expect a lot of failures. Google "superimposed coding" e.g. http://www.dbcsoftware.com/dbcnews/NOV94.TXT
On Thu, Jul 08, 2004 at 12:19:13PM +0200, Steinar H. Gunderson wrote: > I'm trying to find out why one of my queries is so slow -- I'm primarily > using PostgreSQL 7.2 (Debian stable), but I don't really get much better > performance with 7.4 (Debian unstable). My prototype table looks like this: I hate to nag, but it's been a week with no reply; did anybody look at this? Is there any more information I can supply to make it easier? /* Steinar */ -- Homepage: http://www.sesse.net/
Steinar, > - The "subquery scan o12" phase outputs 1186 rows, yet 83792 are sorted. Where > do the other ~82000 rows come from? And why would it take ~100ms to sort the > rows at all? (In earlier tests, this was _one full second_ but somehow that > seems to have improved, yet without really improving the overall query time. I'm puzzled by the "83792" rows as well. I've a feeling that Explain Analyze is failing to output a step. > - Why does it use uid_index for an index scan on the table, when it obviously > has no filter on it (since it returns all the rows)? In order to support the merge join. It should be a bit faster to do the sort using the index than the actual table. Also, because you pass the <> 0 condition. > Furthermore, why would > this take half a second? (The machine is a 950MHz machine with SCSI disks.) I don't see half a second here. > - Also, the outer sort (the sorting of the 58792 rows from the merge join) > is slow. :-) I don't see a sort after the merge join. Which version are we talking about? I'm looking at the 7.4 version because that outputs more detail. Most of your time is spent in that merge join. Why don't you try doubling sort_mem temporarily to see how it does? Or even raising shared_buffers? -- -Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: >> - The "subquery scan o12" phase outputs 1186 rows, yet 83792 are sorted. > Where >> do the other ~82000 rows come from? > I'm puzzled by the "83792" rows as well. I've a feeling that Explain > Analyze is failing to output a step. No, it's not missing anything. The number being reported here is the number of rows pulled from the plan node --- but this plan node is on the inside of a merge join, and one of the properties of merge join is that it will do partial rescans of its inner input in the presence of equal keys in the outer input. If you have, say, 10 occurrences of "42" in the outer input, then any "42" rows in the inner input have to be rescanned 10 times. EXPLAIN ANALYZE will count each of them as 10 rows returned by the input node. The large multiple here (80-to-one overscan) says that you've got a lot of duplicate values in the outer input. This is generally a good situation to *not* use a mergejoin in ;-). We do have some logic in the planner that attempts to estimate the extra cost involved in such rescanning, but I'm not sure how accurate the cost model is. > Most of your time is spent in that merge join. Why don't you try doubling > sort_mem temporarily to see how it does? Or even raising shared_buffers? Raising shared_buffers seems unlikely to help. I do agree with raising sort_mem --- not so much to make the merge faster as to encourage the thing to try a hash join instead. regards, tom lane
On Thu, Jul 15, 2004 at 12:52:38AM -0400, Tom Lane wrote: > No, it's not missing anything. The number being reported here is the > number of rows pulled from the plan node --- but this plan node is on > the inside of a merge join, and one of the properties of merge join is > that it will do partial rescans of its inner input in the presence of > equal keys in the outer input. If you have, say, 10 occurrences of > "42" in the outer input, then any "42" rows in the inner input have to > be rescanned 10 times. EXPLAIN ANALYZE will count each of them as 10 > rows returned by the input node. OK, that makes sense, although it seems to me as is loops= should have been something larger than 1 if the data was scanned multiple times. > The large multiple here (80-to-one overscan) says that you've got a lot > of duplicate values in the outer input. This is generally a good > situation to *not* use a mergejoin in ;-). We do have some logic in the > planner that attempts to estimate the extra cost involved in such > rescanning, but I'm not sure how accurate the cost model is. Hum, I'm not sure if I'm in the termiology here -- "outer input" in "A left join B" is A, right? But yes, I do have a lot of duplicates, that seems to match my data well. > Raising shared_buffers seems unlikely to help. I do agree with raising > sort_mem --- not so much to make the merge faster as to encourage the > thing to try a hash join instead. sort_mem is already 16384, which I thought would be plenty -- I tried increasing it to 65536 which made exactly zero difference. :-) /* Steinar */ -- Homepage: http://www.sesse.net/
Steinar, > sort_mem is already 16384, which I thought would be plenty -- I tried > increasing it to 65536 which made exactly zero difference. :-) Well, then the next step is increasing the statistical sampling on the 3 join columns in that table. Try setting statistics to 500 for each of the 3 cols, analyze, and see if that makes a difference. -- -Josh Berkus Aglio Database Solutions San Francisco
On Thu, Jul 15, 2004 at 11:11:33AM -0700, Josh Berkus wrote: >> sort_mem is already 16384, which I thought would be plenty -- I tried >> increasing it to 65536 which made exactly zero difference. :-) > Well, then the next step is increasing the statistical sampling on the 3 join > columns in that table. Try setting statistics to 500 for each of the 3 > cols, analyze, and see if that makes a difference. Made no difference on either version (7.2 or 7.4). BTW, you guys can stop Cc-ing me now; I'm subscribed. :-) /* Steinar */ -- Homepage: http://www.sesse.net/
On Thu, Jul 15, 2004 at 02:08:54PM +0200, Steinar H. Gunderson wrote: > sort_mem is already 16384, which I thought would be plenty -- I tried > increasing it to 65536 which made exactly zero difference. :-) I've tried some further tweaking, but I'm still unable to force it into doing a hash join -- any ideas how I can find out why it chooses a merge join? /* Steinar */ -- Homepage: http://www.sesse.net/
Steinar, > I've tried some further tweaking, but I'm still unable to force it into > doing a hash join -- any ideas how I can find out why it chooses a merge > join? I'm sorry, I can't really give your issue the attention it deserves. At this point, I'd have to get a copy of your database, and play around with alternate query structures; and I don't have time. Sorry! -- Josh Berkus Aglio Database Solutions San Francisco
Steinar, > I've tried some further tweaking, but I'm still unable to force it into > doing a hash join -- any ideas how I can find out why it chooses a merge > join? Actually, quick question -- have you tried setting enable_mergjoin=false to see the plan the system comes up with? Is it in fact faster? -- Josh Berkus Aglio Database Solutions San Francisco
On Tue, Jul 20, 2004 at 10:06:08AM -0700, Josh Berkus wrote: > Actually, quick question -- have you tried setting enable_mergjoin=false to > see the plan the system comes up with? Is it in fact faster? It is significantly faster -- 1200ms vs. 1900ms (on 7.4, at least). Some of the merge joins are changed to nested loop joins, though, which probably reduces the overall performance, so I guess there's more to gain if I can get it to convert only that merge join to a hash join. The sum and multiplication parts still take 400ms or so, though (is this normal? :-) ), so I guess there's a lower limit :-) I could of course post the updated query plan if anybody is interested; let me know. (The data is still available if anybody needs it as well, of course.) /* Steinar */ -- Homepage: http://www.sesse.net/
> I could of course post the updated query plan if anybody is interested; let > me know. (The data is still available if anybody needs it as well, of > course.) I've taken a look and managed to cut out quite a bit of used time. You'll need to confirm it's the same results though (I didn't -- it is the same number of results (query below) First off, "DROP INDEX prodid_index;". It doesn't help anything since the primary key is just as usable, but it does take enough space that it causes thrashing in the buffer_cache. Any queries based on prodid will use the index for the PRIMARY KEY instead. Secondly, I had no luck getting the hashjoin but this probably doesn't matter. I've assumed that the number of users will climb faster than the product set offered, and generated additional data via the below command run 4 times: INSERT INTO opinions SELECT prodid, uid + (SELECT max(uid) FROM opinions), opinion FROM opinions; I found that by this point, the hashjoin and mergejoin have essentially the same performance -- in otherwords, as you grow you'll want the mergejoin eventually so I wouldn't worry about it too much. New Query cuts about 1/3rd the time, forcing hashjoin gets another 1/3rd but see the above note: SELECT o3.prodid , SUM(o3.opinion*o12.correlation) AS total_correlation FROM opinions o3 -- Plain join okay since o12.correlation <> 0 -- eliminates any NULLs anyway. -- Was RIGHT JOIN JOIN (SELECT o2.uid , SUM(o1.opinion*o2.opinion)/SQRT(count(*)::numeric) AS correlation FROM opinions AS o1 JOIN opinions AS o2 USING (prodid) WHERE o1.uid = 1355 GROUP BY o2.uid ) AS o12 USING (uid) -- Was old Left join WHERE o3.prodid NOT IN (SELECT prodid FROM opinions AS o4 WHERE uid = 1355) AND o3.opinion <> 0 AND o12.correlation <> 0 GROUP BY o3.prodid ORDER BY total_correlation desc;
On Tue, Jul 20, 2004 at 10:18:19PM -0400, Rod Taylor wrote: > I've taken a look and managed to cut out quite a bit of used time. > You'll need to confirm it's the same results though (I didn't -- it is > the same number of results (query below) It looks very much like the same results. > Secondly, I had no luck getting the hashjoin but this probably doesn't > matter. I've assumed that the number of users will climb faster than the > product set offered, and generated additional data via the below command > run 4 times: Actually, the number of users won't climb that much faster; what will probably increase is the number of opinions. > I found that by this point, the hashjoin and mergejoin have essentially > the same performance -- in otherwords, as you grow you'll want the > mergejoin eventually so I wouldn't worry about it too much. Hm, OK. > -- Plain join okay since o12.correlation <> 0 > -- eliminates any NULLs anyway. > -- Was RIGHT JOIN OK, that makes sense (although I don't really see why it should be faster). > -- Was old Left join > WHERE o3.prodid NOT IN (SELECT prodid > FROM opinions AS o4 > WHERE uid = 1355) As my server is 7.2 and not 7.4, that obviously won't help much :-) Thanks anyway, though -- we'll upgrade eventually, and it'll help then. /* Steinar */ -- Homepage: http://www.sesse.net/
On Wed, 2004-07-21 at 06:04, Steinar H. Gunderson wrote: > On Tue, Jul 20, 2004 at 10:18:19PM -0400, Rod Taylor wrote: > > I've taken a look and managed to cut out quite a bit of used time. > > You'll need to confirm it's the same results though (I didn't -- it is > > the same number of results (query below) > > It looks very much like the same results. Oh.. On my (slow) laptop it cut the time back significantly.. > As my server is 7.2 and not 7.4, that obviously won't help much :-) Thanks > anyway, though -- we'll upgrade eventually, and it'll help then. I see. Yeah, avoid NOT IN like a plague on 7.2.