Thread: reindex/vacuum locking/performance?
I've read some posts that says vacuum doesn't lock, but my experience today indicates the opposite. It seemed that "vacuum full analyze" was locked waiting and so were other postmaster processes. It appeared to be deadlock, because all were in "WAITING" state according to ps. I let this go for about a 1/2 hour, and then killed the vacuum at which point all other processes completed normally. The same thing seemed to be happening with reindex on a table. It seems that the reindex locks the table and some other resource which then causes deadlock with other active processes. Another issue seems to be performance. A reindex on some indexes is taking 12 minutes or so. Vacuum seems to be slow, too. Way longer than the time it takes to reimport the entire database (30 mins). In summary, I suspect that it is better from a UI perspective to bring down the app on Sat at 3 a.m and reimport with a fixed time period than to live through reindexing/vacuuming which may deadlock. Am I missing something? Thanks, Rob
Rob Nagler <nagler@bivio.biz> writes: > I've read some posts that says vacuum doesn't lock, but my experience > today indicates the opposite. It seemed that "vacuum full analyze" > was locked waiting and so were other postmaster processes. vacuum full does require exclusive lock, plain vacuum does not. > It > appeared to be deadlock, because all were in "WAITING" state according > to ps. I let this go for about a 1/2 hour, and then killed the vacuum > at which point all other processes completed normally. It's considerably more likely that the vacuum was waiting for an open client transaction (that had a read or write lock on some table) to finish than that there was an undetected deadlock. I suggest looking at your client code. Also, in 7.3 or later you could look at the pg_locks view to work out exactly who has the lock that's blocking vacuum. > Another issue seems to be performance. A reindex on some indexes is > taking 12 minutes or so. Vacuum seems to be slow, too. Way longer > than the time it takes to reimport the entire database (30 mins). vacuum full is indeed slow. That's why we do not recommend it as a routine maintenance procedure. The better approach is to do plain vacuums often enough that you don't need vacuum full. In pre-7.4 releases you might need periodic reindexes too, depending on whether your usage patterns tickle the index-bloat problem. But it is easily demonstrable that reindexing is cheaper than rebuilding the database. > In summary, I suspect that it is better from a UI perspective to bring > down the app on Sat at 3 a.m and reimport with a fixed time period > than to live through reindexing/vacuuming which may deadlock. Am I > missing something? Almost certainly, though you've not provided enough detail to determine what. regards, tom lane
> vacuum full does require exclusive lock, plain vacuum does not. I think I need full, because there are updates on the table. As I understand it, an update in pg is an insert/delete, so it needs to be garbage collected. > It's considerably more likely that the vacuum was waiting for an open > client transaction (that had a read or write lock on some table) to > finish than that there was an undetected deadlock. I suggest looking at > your client code. Also, in 7.3 or later you could look at the pg_locks > view to work out exactly who has the lock that's blocking vacuum. My client code does a lot. I look at more often than I'd like to. :-) I don't understand why the client transaction would block if vacuum was waiting. Does vacuum lock the table and then try to get some other "open transaction" resource? Free space? I guess I don't understand what other resources would be required of vacuum. The client transactions are short (< 1s). They don't deadlock normally, only with reindex and vacuum did I see this behavior. > vacuum full is indeed slow. That's why we do not recommend it as a > routine maintenance procedure. The better approach is to do plain > vacuums often enough that you don't need vacuum full. The description of vacuum full implies that is required if the db is updated frequently. This db gets about 1 txn a second, possibly more at peak load. > In pre-7.4 > releases you might need periodic reindexes too, depending on whether > your usage patterns tickle the index-bloat problem. 7.3, and yes, we have date indexes as well as sequences for primary keys. > But it is easily > demonstrable that reindexing is cheaper than rebuilding the database. IOW, vacuum+reindex is faster than dump+restore? I didn't see this, then again, I had this locking problem, so the stats are distorted. One other question: The reindex seems to lock the table for the entire process as opposed to freeing the lock between index rebuilds. It was hard to see, but it seemed like the clients were locked for the entire "reindex table bla" command. Sorry for lack of detail, but I didn't expect these issues so I wasn't keeping track of the system state as closely as I should have. Next time. :-) Thanks, Rob
On Fri, Oct 03, 2003 at 15:47:01 -0600, Rob Nagler <nagler@bivio.biz> wrote: > > vacuum full does require exclusive lock, plain vacuum does not. > > I think I need full, because there are updates on the table. As I > understand it, an update in pg is an insert/delete, so it needs > to be garbage collected. Plain vacuum will mark the space used by deleted tuples as reusable. Most of the time this is good enough and you don't need to run vacuum full.
On Fri, 2003-10-03 at 17:47, Rob Nagler wrote: > They don't deadlock normally, > only with reindex and vacuum did I see this behavior. If you can provide a reproducible example of a deadlock induced by REINDEX + VACUUM, that would be interesting. (FWIW, I remember noticing a potential deadlock in the REINDEX code and posting to -hackers about it, but I've never seen it occur in a real-world situation...) > One other question: The reindex seems to lock the table for the entire > process as opposed to freeing the lock between index rebuilds. Yeah, I wouldn't be surprised if there is some room for optimizing the locks that are acquired by REINDEX. -Neil
nagler@bivio.biz (Rob Nagler) writes: > I've read some posts that says vacuum doesn't lock, but my experience > today indicates the opposite. It seemed that "vacuum full analyze" > was locked waiting and so were other postmaster processes. It > appeared to be deadlock, because all were in "WAITING" state according > to ps. I let this go for about a 1/2 hour, and then killed the vacuum > at which point all other processes completed normally. VACUUM FULL certainly does lock. See the man page: INPUTS FULL Selects ``full'' vacuum, which may reclaim more space, but takes much longer and exclusively locks the table. The usual answer is that you probably _didn't_ want to VACUUM FULL. VACUUM ('no full') does NOT block updates. > The same thing seemed to be happening with reindex on a table. It > seems that the reindex locks the table and some other resource which > then causes deadlock with other active processes. Not surprising either. While the reindex takes place, updates to that table have to be deferred. > Another issue seems to be performance. A reindex on some indexes is > taking 12 minutes or so. Vacuum seems to be slow, too. Way longer > than the time it takes to reimport the entire database (30 mins). That seems a little surprising. > In summary, I suspect that it is better from a UI perspective to > bring down the app on Sat at 3 a.m and reimport with a fixed time > period than to live through reindexing/vacuuming which may deadlock. > Am I missing something? Consider running pg_autovacuum, and thereby do a little bit of vacuuming here and there all the time. It DOESN'T block, so unless your system is really busy, it shouldn't slow things down to a major degree. -- "cbbrowne","@","libertyrms.info" <http://dev6.int.libertyrms.com/> Christopher Browne (416) 646 3304 x124 (land)
> > vacuum full does require exclusive lock, plain vacuum does not. > > I think I need full, because there are updates on the table. As I > understand it, an update in pg is an insert/delete, so it needs > to be garbage collected. Yes and no. You only need a plain VACUUM that is run often enough to recover space as fast as you need to grab it. For heavily updated tables run it often - I run it every 5 minutes on some tables. A VACUUM FULL is only needed if you haven't been running VACUUM often enough in the first place. > The description of vacuum full implies that is required if the db > is updated frequently. This db gets about 1 txn a second, possibly > more at peak load. Assuming you mean 1 update/insert per second that is an absolutely _trivial_ load on any reasonable hardware. You can do thousands of updates/second on hardware costing less than $2000. If you vacuum every hour then you will be fine. > IOW, vacuum+reindex is faster than dump+restore? I didn't see this, > then again, I had this locking problem, so the stats are distorted. REINDEX also locks tables like VACUUM FULL. Either is terribly slow, but unless you turn off fsync during the restore it's unlikely to be slower than dump & restore. Matt
Rob, > > I think I need full, because there are updates on the table. As I > > understand it, an update in pg is an insert/delete, so it needs > > to be garbage collected. > > Yes and no. You only need a plain VACUUM that is run often enough to > recover space as fast as you need to grab it. For heavily updated tables > run it often - I run it every 5 minutes on some tables. A VACUUM FULL is > only needed if you haven't been running VACUUM often enough in the first > place. Also, if you find that you need to run VACUUM FULL often, then you need to raise your max_fsm_pages. -- -Josh Berkus Aglio Database Solutions San Francisco
> > In summary, I suspect that it is better from a UI perspective to > > bring down the app on Sat at 3 a.m and reimport with a fixed time > > period than to live through reindexing/vacuuming which may deadlock. > > Am I missing something? > > Consider running pg_autovacuum, and thereby do a little bit of > vacuuming here and there all the time. It DOESN'T block, so unless > your system is really busy, it shouldn't slow things down to a major > degree. My real world experience on a *very* heavily updated OLTP type DB, following advice from this list (thanks guys!), is that there is essentially zero cost to going ahead and vacuuming as often as you feel like it. Go crazy, and speed up your DB! OK, that's on a quad CPU box with goodish IO, so maybe there are issues on very slow boxen, but in a heavy-update environment the advantages seem to easily wipe out the costs. Matt p.s. Sorry to sound like a "Shake'n'Vac" advert.
> Also, if you find that you need to run VACUUM FULL often, then > you need to > raise your max_fsm_pages. Yes and no. If it's run often enough then the number of tracked pages shouldn't need to be raised, but then again... ...max_fsm_pages should be raised anyway. I'm about to reclaim a Pentium 166 w/ 64MB of RAM from a friend I lent it to _many_ years ago, and I suspect PG would run happily on it as configured by default. Set it to at least 50,000 I say. What do you have to lose, I mean if they're not free then they're not tracked in the FSM right? Of course if anyone knows a reason _not_ to raise it then I'm all ears! Matt > > -- > -Josh Berkus > Aglio Database Solutions > San Francisco > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html >
On Fri, 2003-10-03 at 17:34, Christopher Browne wrote: > Not surprising either. While the reindex takes place, updates to that > table have to be deferred. Right, but that's no reason not to let SELECTs proceed, for example. (Whether that would actually be *useful* is another question...) -Neil
> > Also, if you find that you need to run VACUUM FULL often, then > > you need to > > raise your max_fsm_pages. > > Yes and no. If it's run often enough then the number of tracked pages > shouldn't need to be raised, but then again... Oops, sorry, didn't pay attention and missed the mention of FULL. My bad, ignore my OT useless response.
Neil Conway <neilc@samurai.com> writes: > On Fri, 2003-10-03 at 17:34, Christopher Browne wrote: >> Not surprising either. While the reindex takes place, updates to that >> table have to be deferred. > Right, but that's no reason not to let SELECTs proceed, for example. What if said SELECTs are using the index in question? I suspect it is true that REINDEX locks more than it needs to, but we should tread carefully about loosening it. regards, tom lane
On Fri, Oct 03, 2003 at 02:24:42PM -0600, Rob Nagler wrote: > I've read some posts that says vacuum doesn't lock, but my experience > today indicates the opposite. It seemed that "vacuum full analyze" VACUUM doesn't. VACUUM FULL does. -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
On Sat, Oct 04, 2003 at 12:29:55AM +0100, Matt Clark wrote: > My real world experience on a *very* heavily updated OLTP type DB, following > advice from this list (thanks guys!), is that there is essentially zero cost > to going ahead and vacuuming as often as you feel like it. Go crazy, and > speed up your DB! That's not quite true. If vacuums start running into each other, you can very easily start eating up all your I/O bandwidth. Even if you gots lots of it. Also, a vacuum pretty much destroys your shared buffers, so you have to be aware of that trade-off too. Vacuum is not free. It's _way_ cheaper than it used to be, though. A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
On Fri, Oct 03, 2003 at 11:49:03PM -0400, Tom Lane wrote: > > What if said SELECTs are using the index in question? That's a good reason to build a new index and, when it's done, drop the old one. It still prevents writes, of course. A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
> On Sat, Oct 04, 2003 at 12:29:55AM +0100, Matt Clark wrote: > > My real world experience on a *very* heavily updated OLTP type > DB, following > > advice from this list (thanks guys!), is that there is > essentially zero cost > > to going ahead and vacuuming as often as you feel like it. Go > crazy, and > > speed up your DB! > > That's not quite true. If vacuums start running into each other, you > can very easily start eating up all your I/O bandwidth. Even if you > gots lots of it. Very true, which is why all my scripts write a lockfile and delete it when they're finished, to prevent that happening. I should have mentioned that. > Also, a vacuum pretty much destroys your shared buffers, so you have > to be aware of that trade-off too. Vacuum is not free. It's _way_ > cheaper than it used to be, though. That's _very_ interesting. I've never been quite clear what's in shared buffers apart from scratch space for currently running transactions. Also the docs imply that vacuum uses it's own space for working in. Do you have more info on how it clobbers shared_buffers? M
On Sun, Oct 05, 2003 at 12:14:24PM +0100, Matt Clark wrote: > more info on how it clobbers shared_buffers? Vacuum is like a seqscan. It touches everything on a table. So it doesn't clobber them, but that's the latest data. It's unlikely your buffers are big enough to hold your database, unless your database is small. So you'll end up expiring potentially useful data in the buffer. A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
> On Sun, Oct 05, 2003 at 12:14:24PM +0100, Matt Clark wrote: > > more info on how it clobbers shared_buffers? > > Vacuum is like a seqscan. It touches everything on a table. So it > doesn't clobber them, but that's the latest data. It's unlikely your > buffers are big enough to hold your database, unless your database is > small. So you'll end up expiring potentially useful data in the > buffer. OK I'm definitely missing something here. I thought that the FSM was there to keep track of potentially free pages, and that all VACUUM did was double check and then write that info out for all to see? The promise being that a VACUUM FULL will walk all pages on disk and do a soft-shoe-shuffle to aggresively recover space, but a simple VACUUM won't (merely confirming pages as available for reuse). As for buffers, my understanding is that they are *not* meant to be big enough to hold the DB, as PG explicitly leaves caching up to the underlying OS. 'buffers' here meaning shared memory between PG processes, and 'cache' meaning OS cache. 'buffers' only need to be big enough to hold the intermediate calcs and the results for any current transactions? M
"Matt Clark" <matt@ymogen.net> writes: > OK I'm definitely missing something here. The point is that a big seqscan (either VACUUM or a plain table scan) hits a lot of pages, and thereby tends to fill your cache with pages that aren't actually likely to get hit again soon, perhaps pushing out pages that will be needed again soon. This happens at both the shared-buffer and kernel-disk-cache levels of caching. It would be good to find some way to prevent big seqscans from populating cache, but I don't know of any portable way to tell the OS that we don't want it to cache a page we are reading. regards, tom lane
> The point is that a big seqscan (either VACUUM or a plain table scan) > hits a lot of pages, and thereby tends to fill your cache with pages > that aren't actually likely to get hit again soon, perhaps pushing out > pages that will be needed again soon. This happens at both the > shared-buffer and kernel-disk-cache levels of caching. OK, I had thought (wrongly it seems, as usual, but this is how we learn!) that a plain VACUUM did not incur a read of all pages. I still don't understand *why* it does, but I'll take your word for it. Clearly if it distorts the 'normal' balance of pages in any caches, PG's or the OS's, that's a _bad thing_. I am currently in the nice position of having a DB that (just about) fits in RAM, so I pretty much don't care about read performance, but I will have to soon as it grows beyond 3GB :-( These conversations are invaluable in planning for that dread time... > It would be good to find some way to prevent big seqscans from > populating cache, but I don't know of any portable way to tell the OS > that we don't want it to cache a page we are reading. Quite. The only natural way would be to read those pages through some special device, but then you might as well do raw disk access from the get-go. Portability vs. Performance, the age old quandary. FWIW I and many others stand back in pure amazement at the sheer _quality_ of PostgreSQL. Rgds, Matt
"Matt Clark" <matt@ymogen.net> writes: > OK, I had thought (wrongly it seems, as usual, but this is how we learn!) > that a plain VACUUM did not incur a read of all pages. I still don't > understand *why* it does, but I'll take your word for it. Mainly 'cause it doesn't know where the dead tuples are till it's looked. Also, VACUUM is the data collector for the free space map, and so it is also charged with finding out how much free space exists on every page. regards, tom lane
> Mainly 'cause it doesn't know where the dead tuples are till it's > looked. At this point I feel very stupid... > Also, VACUUM is the data collector for the free space map, > and so it is also charged with finding out how much free space exists > on every page. Ah, now I just feel enlightened! That makes perfect sense. I think I had been conflating free pages with free space, without understanding what the difference was. Of course I still don't really understand, but at least I now _know_ I don't. Many thanks Matt
After a long battle with technology,matt@ymogen.net ("Matt Clark"), an earthling, wrote: >> The point is that a big seqscan (either VACUUM or a plain table scan) >> hits a lot of pages, and thereby tends to fill your cache with pages >> that aren't actually likely to get hit again soon, perhaps pushing out >> pages that will be needed again soon. This happens at both the >> shared-buffer and kernel-disk-cache levels of caching. > > OK, I had thought (wrongly it seems, as usual, but this is how we learn!) > that a plain VACUUM did not incur a read of all pages. I still don't > understand *why* it does, but I'll take your word for it. How does it know what to do on any given page if it does not read it in? It has to evaluate whether tuples can be thrown away or not, and that requires looking at the tuples. It may only be looking at a small portion of the page, but that still requires reading each page. No free lunch, unfortunately... -- wm(X,Y):-write(X),write('@'),write(Y). wm('cbbrowne','cbbrowne.com'). http://www3.sympatico.ca/cbbrowne/sgml.html "End users are just test loads for verifying that the system works, kind of like resistors in an electrical circuit." -- Kaz Kylheku in c.o.l.d.s
On Sat, 2003-10-04 at 11:22, Andrew Sullivan wrote: > Also, a vacuum pretty much destroys your shared buffers, so you have > to be aware of that trade-off too. True, although there is no reason that this necessary needs to be the case (at least, as far as the PostgreSQL shared buffer goes). As has been pointed out numerous times on -hackers and in the literature, using LRU for a DBMS shared buffer cache is far from optimal, and better algorithms have been proposed (e.g. LRU-K, ARC). We could even have the VACUUM command inform the bufmgr that the pages it is in the process of reading in are part of a seqscan, and so are unlikely to be needed in the immediate future. -Neil
Neil Conway <neilc@samurai.com> writes: > ... We could even have the > VACUUM command inform the bufmgr that the pages it is in the process of > reading in are part of a seqscan, and so are unlikely to be needed in > the immediate future. This would be relatively easy to fix as far as our own buffering is concerned, but the thing that's needed to make it really useful is to prevent caching of seqscan-read pages in the kernel disk buffers. I don't know any portable way to do that :-( regards, tom lane
On Sun, 2003-10-05 at 19:43, Tom Lane wrote: > This would be relatively easy to fix as far as our own buffering is > concerned, but the thing that's needed to make it really useful is > to prevent caching of seqscan-read pages in the kernel disk buffers. True. > I don't know any portable way to do that :-( For the non-portable way of doing this, are you referring to O_DIRECT? Even if it isn't available everywhere, it might be worth considering this at least for the platforms on which it is supported. -Neil
On Sun, Oct 05, 2003 at 07:32:47PM -0400, Neil Conway wrote: > been pointed out numerous times on -hackers and in the literature, using > LRU for a DBMS shared buffer cache is far from optimal, and better > algorithms have been proposed (e.g. LRU-K, ARC). We could even have the > VACUUM command inform the bufmgr that the pages it is in the process of > reading in are part of a seqscan, and so are unlikely to be needed in > the immediate future. Hey, when that happens, you'll find me first in line to praise the implementor; but until then, it's important that people not get the idea that vacuum is free. It is _way_ imporved, and on moderately loaded boxes, it'salmost unnoticable. But under heavy load, you need to be _real_ careful about calling vacuum. I think one of the biggest needs in the AVD is some sort of intelligence about current load on the postmaster, but I haven't the foggiest idea how to give it such intelligence. A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
On Sun, 5 Oct 2003, Neil Conway wrote: > > > I don't know any portable way to do that :-( > > For the non-portable way of doing this, are you referring to O_DIRECT? > > Even if it isn't available everywhere, it might be worth considering > this at least for the platforms on which it is supported. > I strongly agree here only if we can prove there is a benefit. I think it would be silly of us if some OS supported SnazzyFeatureC that was able to speed up PG by a large percentage (hopefully, in a rather non-invasive way in the code). But, I do see the problem here with bloat and PG being radically different platform to platform. I suppose we could dictate that at least N os's had to have it.. or perhaps supply it has contrib/ patches.... Something to think about. I'd be interested in tinkering with this, but I'm more interested at the moment of why (with proof, not antecdotal) Solaris is so much slower than Linux and what we cna do about this. We're looking to move a rather large Informix db to PG and ops has reservations about ditching Sun hardware. -- Jeff Trout <jeff@jefftrout.com> http://www.jefftrout.com/ http://www.stuarthamm.net/
On Mon, Oct 06, 2003 at 08:07:27AM -0400, Jeff wrote: > I strongly agree here only if we can prove there is a benefit. There's plenty of academic work which purports to show that LRU is far from the best choice. Just in principle, it seems obvious that a single-case seqscan-type operation (such as vacuum does) is a good way to lose your cache for no real gain. > I'd be interested in tinkering with this, but I'm more interested at the > moment of why (with proof, not antecdotal) Solaris is so much slower than > Linux and what we cna do about this. We're looking to move a rather large > Informix db to PG and ops has reservations about ditching Sun hardware. Interestingly, we're contemplating ditching Solaris because of the terrible reliability we're getting from the hardware. You can use truss to find some of the problems on Solaris. The open() syscall takes forever when you don't hit the Postgres shared buffers (even if you can be sure the page is in filesystem buffers -- we could demonstrate it on a 1 G database on a machine with 10 G of RAM). I've heard grumblings about spinlocks on Solaris which might explain this problem. I certainly notice that performance gets geometrically worse when you add a few hundred extra connections. A -- ---- Andrew Sullivan 204-4141 Yonge Street Afilias Canada Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Jeff wrote: > I'd be interested in tinkering with this, but I'm more interested at the > moment of why (with proof, not antecdotal) Solaris is so much slower than > Linux and what we cna do about this. We're looking to move a rather large > Informix db to PG and ops has reservations about ditching Sun hardware. Is linux on sparc hardware is an option..:-) Shridhar
On Mon, 6 Oct 2003, Andrew Sullivan wrote: > There's plenty of academic work which purports to show that LRU is > far from the best choice. Just in principle, it seems obvious that a > single-case seqscan-type operation (such as vacuum does) is a good > way to lose your cache for no real gain. > Logically bypassing caches for a seq scan also makes sense. > Interestingly, we're contemplating ditching Solaris because of the > terrible reliability we're getting from the hardware. > The reason ops likes solaris / sun is twofold. 1. we have a pile of big sun machines around. 2. Solaris / Sun is quite a bit more graceful in the egvent of a hardware failure. We've burned out our fair share of cpu's etc and solaris has been rather graceful about it. I've started profiling and running tests... currently it is leaning towards the sysv semaphores. I see in src/backend/port/ that pg_sema.c is linked to the sysv implementation. So what I did was create a semaphore set, and then fired off 5 copies of a program that attaches to that semaphore and then locks/unlocks it 1M times. 2xP2-450, Linux 2.4.18: 1 process: 221680 / sec, 5 process: 98039 / sec 4xUltraSparc II-400Mhz, Solaris 2.6: 1 proc: 142857 / sec, 5 process: 23809 So I'm guessing that is where a LOT of the suck is coming from. What I plan to do next is looking to see if there are other interprocess locking mechanisms on solaris (perhaps pthread_mutex with that inter-process flag or something) to see if I can get those numbers a little closer. -- Jeff Trout <jeff@jefftrout.com> http://www.jefftrout.com/ http://www.stuarthamm.net/
Jeff <threshar@torgo.978.org> writes: > I've started profiling and running tests... currently it is leaning > towards the sysv semaphores. I see in src/backend/port/ that pg_sema.c is > linked to the sysv implementation. Does Solaris have Posix semaphores? You could try using those instead. regards, tom lane
On Mon, 6 Oct 2003, Tom Lane wrote: > > Does Solaris have Posix semaphores? You could try using those instead. > > regards, tom lane Yep. It does. I learned them quick enough (using posix_sema.c as a guide) and found out that at least on Sol 2.6 they are slower than sysv - with 5 processes it went to about 16k lock/unlock a second. I'm going to try to find a box around here I can get sol(8|9) on that has sufficient disk space and see. I'm guessing sun has likely made improvements... Another odd thing I'm trying to work out is why my profiles come out so radically different on the linux box and the sun box. Sun: 31.17 18.90 18.90 internal_mcount 19.10 30.48 11.58 8075381 0.00 0.00 _bt_checkkeys 5.66 33.91 3.43 24375253 0.00 0.00 FunctionCall2 4.82 36.83 2.92 8073010 0.00 0.00 _bt_step 3.51 38.96 2.13 14198 0.15 0.15 _read 2.77 40.64 1.68 8069040 0.00 0.00 varchareq 2.59 42.21 1.57 28454 0.06 0.23 _bt_next 2.29 43.60 1.39 1003 1.39 1.40 AtEOXact_Buffers 1.86 44.73 1.13 16281197 0.00 0.00 pg_detoast_datum 1.81 45.83 1.10 _mcount 1.68 46.85 1.02 2181 0.47 0.47 pglz_decompress Linux: 11.14 0.62 0.62 1879 0.00 0.00 pglz_decompress 6.71 0.99 0.37 1004 0.00 0.00 AtEOXact_Buffers 3.80 1.20 0.21 1103045 0.00 0.00 AllocSetAlloc 3.23 1.38 0.18 174871 0.00 0.00 nocachegetattr 2.92 1.54 0.16 1634957 0.00 0.00 AllocSetFreeIndex 2.50 1.68 0.14 20303 0.00 0.00 heapgettup 1.93 1.79 0.11 1003 0.00 0.00 AtEOXact_CatCache 1.76 1.89 0.10 128442 0.00 0.00 hash_any 1.72 1.98 0.10 90312 0.00 0.00 FunctionCall3 1.69 2.08 0.09 50632 0.00 0.00 ExecTargetList 1.60 2.17 0.09 51647 0.00 0.00 heap_formtuple 1.55 2.25 0.09 406162 0.00 0.00 newNode 1.46 2.33 0.08 133044 0.00 0.00 hash_search It is the same query with slightly different data (The Sun has probably.. 20-40k more rows in the table the query hits). I'll be digging up more info later today. -- Jeff Trout <jeff@jefftrout.com> http://www.jefftrout.com/ http://www.stuarthamm.net/
On Mon, 2003-10-06 at 05:15, Andrew Sullivan wrote: > There's plenty of academic work which purports to show that LRU is > far from the best choice. Just in principle, it seems obvious that a > single-case seqscan-type operation (such as vacuum does) is a good > way to lose your cache for no real gain. Traditionally, seqscan type operations are accommodated in LRU type managers by having multiple buffer promotion policies, primarily because it is simple to implement. For example, if you are doing a seqscan, a buffer loaded from disk is never promoted to the top of the LRU. Instead it is only partially promoted (say, halfway for example) toward the top of the buffer list. A page that is already in the buffer is promoted either to the halfway point or top depending on where it was found. There are numerous variations on the idea, some being more clever and complex than others. The point of this being that a pathological or rare sequential scan can never trash more than a certain percentage of the cache, while not significantly impacting the performance of a sequential scan. The primary nuisance is that it slightly increases the API complexity. I'll add that I don't know what PostgreSQL actually does in this regard, but from the thread it appears as though seqscans are handled like the default case. Cheers, -James Rogers jamesr@best.com
On Sun, 2003-10-05 at 19:50, Neil Conway wrote: > On Sun, 2003-10-05 at 19:43, Tom Lane wrote: > > This would be relatively easy to fix as far as our own buffering is > > concerned, but the thing that's needed to make it really useful is > > to prevent caching of seqscan-read pages in the kernel disk buffers. > For the non-portable way of doing this, are you referring to O_DIRECT? I was hoping you'd reply to this, Tom -- you were referring to O_DIRECT, right? (If you were referring to O_DIRECT, I wanted to add that I wouldn't be surprised if using O_DIRECT on many kernels reduces or eliminates any readahead the OS will be doing on the sequential read, so the net result may actually be a loss for a typical seqscan.) -Neil
Neil Conway <neilc@samurai.com> writes: > On Sun, 2003-10-05 at 19:50, Neil Conway wrote: > I was hoping you'd reply to this, Tom -- you were referring to O_DIRECT, > right? Not necessarily --- as you point out, it's not clear that O_DIRECT would help us. What would be way cool is something similar to what James Rogers was talking about: a way to tell the kernel not to promote this page all the way to the top of its LRU list. I'm not sure that *any* Unixen have such an API, let alone one that's common across more than one platform :-( regards, tom lane
--On Monday, October 06, 2003 14:26:10 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Neil Conway <neilc@samurai.com> writes: >> On Sun, 2003-10-05 at 19:50, Neil Conway wrote: >> I was hoping you'd reply to this, Tom -- you were referring to O_DIRECT, >> right? > > Not necessarily --- as you point out, it's not clear that O_DIRECT would > help us. What would be way cool is something similar to what James > Rogers was talking about: a way to tell the kernel not to promote this > page all the way to the top of its LRU list. I'm not sure that *any* > Unixen have such an API, let alone one that's common across more than > one platform :-( I think Verita's VxFS has this as an option/IOCTL. You can read the Veritas doc on my http://www.lerctr.org:8458/ pages under filesystems. That should work on UnixWare and Solaris sites that have VxFS installed. VxFS is standard on UW. LER -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 972-414-9812 E-Mail: ler@lerctr.org US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749
Attachment
Tom Lane wrote: > Neil Conway <neilc@samurai.com> writes: > > On Sun, 2003-10-05 at 19:50, Neil Conway wrote: > > I was hoping you'd reply to this, Tom -- you were referring to O_DIRECT, > > right? > > Not necessarily --- as you point out, it's not clear that O_DIRECT would > help us. What would be way cool is something similar to what James > Rogers was talking about: a way to tell the kernel not to promote this > page all the way to the top of its LRU list. I'm not sure that *any* > Unixen have such an API, let alone one that's common across more than > one platform :-( Solaris has "free-behind", which prevents a large kernel sequential scan from blowing out the cache. I only read about it in the Mauro Solaris Internals book, and it seems to be done automatically. I guess most OS's don't do this optimization because they usually don't read files larger than their cache. I see BSD/OS madvise() has: #define MADV_NORMAL 0 /* no further special treatment */ #define MADV_RANDOM 1 /* expect random page references */ #define MADV_SEQUENTIAL 2 /* expect sequential references */ #define MADV_WILLNEED 3 /* will need these pages */ --> #define MADV_DONTNEED 4 /* don't need these pages */ #define MADV_SPACEAVAIL 5 /* insure that resources are reserved */ The marked one seems to have the control we need. Of course, the kernel madvise() code has: /* Not yet implemented */ Looks like NetBSD implements it, but it also unmaps the page from the address space, which might be more than we want. NetBSD alao has: #define MADV_FREE 6 /* pages are empty, free them */ which frees the page. I am unclear on its us. FreeBSD has this comment: /* * vm_page_dontneed * * Cache, deactivate, or do nothing as appropriate. This routine * is typically used by madvise() MADV_DONTNEED. * * Generally speaking we want to move the page into the cache so * it gets reused quickly. However, this can result in a silly syndrome * due to the page recycling too quickly. Small objects will not be * fully cached. On the otherhand, if we move the page to the inactive * queue we wind up with a problem whereby very large objects * unnecessarily blow away our inactive and cache queues. * * The solution is to move the pages based on a fixed weighting. We * either leave them alone, deactivate them, or move them to the cache, * where moving them to the cache has the highest weighting. * By forcing some pages into other queues we eventually force the * system to balance the queues, potentially recovering other unrelated * space from active. The idea is to not force this to happen too * often. */ The Linux comment is: /* * Application no longer needs these pages. If the pages are dirty, * it's OK to just throw them away. The app will be more careful about * data it wants to keep. Be sure to free swap resources too. The * zap_page_range call sets things up for refill_inactive to actually free * these pages later if no one else has touched them in the meantime, * although we could add these pages to a global reuse list for * refill_inactive to pick up before reclaiming other pages. * * NB: This interface discards data rather than pushes it out to swap, * as some implementations do. This has performance implications for * applications like large transactional databases which want to discard * pages in anonymous maps after committing to backing store the data * that was kept in them. There is no reason to write this data out to * the swap area if the application is discarding it. * * An interface that causes the system to free clean pages and flush * dirty pages is already available as msync(MS_INVALIDATE). */ It seems mmap is more for controlling the memory mapping of files rather than controlling the cache itself. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Stepping out on a limb... (I'm not a disk kernel guy) I have long thought that as part of a cache descriptor, there should be a process-definable replacement-strategy (RS). Each cache entry would be associated to each process's replacement-strategy variable and the page-replacement algorithm would then take into consideration the desired policy. Imagine for simplicity sake, that each strategy gets its own cache table. When it comes time to replace a page, the system scans the cache tables, picks the most likely page for replacement from each table, then selects the most likely page between all policies. This allows the 99% of apps that can make excellent use of use LRU to use LRU among themselves (best for everyone), and the MRU (better for databases) (best for everyone too) to only sacrifice the best pages between MRU apps. Though, once you have an MRU process, the final decision between taking the page should be use MRU, and not LRU. Of course there are a number of questions: does each RS get its own table, to be managed independently, or can we combine them all into one table? What are the performance implications of the multiple table management? One day, I'd like to see function pointers and kernel modules used as ways for apps to manage replacement policy. fantasyland# insmod MRU.o fantasyland# vi postgresql.conf { replacement_policy=MRU } {meanwhile in some postgre .c file:} set_cache_policy(get_cfg_replacement_policy()); fantasyland# service postmaster restart Anyone want to throw this at the kernel developers? > -----Original Message----- > From: pgsql-performance-owner@postgresql.org > [mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Tom Lane > Sent: Monday, October 06, 2003 2:26 PM > To: Neil Conway > Cc: Andrew Sullivan; PostgreSQL Performance > Subject: Re: [PERFORM] reindex/vacuum locking/performance? > > > Neil Conway <neilc@samurai.com> writes: > > On Sun, 2003-10-05 at 19:50, Neil Conway wrote: > > I was hoping you'd reply to this, Tom -- you were referring to O_DIRECT, > > right? > > Not necessarily --- as you point out, it's not clear that O_DIRECT would > help us. What would be way cool is something similar to what James > Rogers was talking about: a way to tell the kernel not to promote this > page all the way to the top of its LRU list. I'm not sure that *any* > Unixen have such an API, let alone one that's common across more than > one platform :-( > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
> This would be relatively easy to fix as far as our own buffering is > concerned, but the thing that's needed to make it really useful is > to prevent caching of seqscan-read pages in the kernel disk buffers. > I don't know any portable way to do that :-( raw disc ? :-)