Thread: Re: Working on huge RAM based datasets
> What is it about the buffer cache that makes it so unhappy being able to > hold everything? I don't want to be seen as a cache hit fascist, but isn't > it just better if the data is just *there*, available in the postmaster's > address space ready for each backend process to access it, rather than > expecting the Linux cache mechanism, optimised as it may be, to have to do > the caching? The disk cache on most operating systems is optimized. Plus, keeping shared buffers low gives you more room to bump up the sort memory, which will make your big queries run faster. Merlin
>The disk cache on most operating systems is optimized. Plus, keeping shared buffers low gives you more room to bump up the sort memory, which will make your big queries run faster. Thanks merlin, Whether the OS caches the data or PG does, you still want it cached. If your sorting backends gobble up the pages that otherwise would be filled with the database buffers, then your postmaster will crawl, as it'll *really* have to wait for stuff from disk. In my scenario, you'd spec the machine so that there would be plenty of memory for *everything*. On your OS optimisation point, OS caches are, of course, optimised. But people have told me that PG's caching strategy is simply less well optimised, and *that* is the reason for keeping the shared buffer cache down in my scenario. That's a shame in a way, but I understand why it is the way it is - other things have been addressed which speed up operations in different ways. My 'all in RAM' scenario is very rare at the moment, so why waste valuable development resources on developing optimised RAM based data structures to hold the data for quicker query execution when hardly anyone will see the benefit? However - it won't be so rare for too much longer... If I gave you a half a terabyte of RAM and a 4 processor 64 bit machine, I'm sure you could imagine how much quicker databases could run if they were optimised for this sort of platform. Anyway, I'm looking forward to experimenting with stuff the way it works at the moment. Many thanks, Andy
On 7/9/2004 10:16 AM, Merlin Moncure wrote: >> What is it about the buffer cache that makes it so unhappy being able > to >> hold everything? I don't want to be seen as a cache hit fascist, but > isn't >> it just better if the data is just *there*, available in the > postmaster's >> address space ready for each backend process to access it, rather than >> expecting the Linux cache mechanism, optimised as it may be, to have > to do >> the caching? > > The disk cache on most operating systems is optimized. Plus, keeping > shared buffers low gives you more room to bump up the sort memory, which > will make your big queries run faster. Plus, the situation will change dramatically with 7.5 where the disk cache will have less information than the PG shared buffers, which will become sequential scan resistant and will know that a block was pulled in on behalf of vacuum and not because the regular database access pattern required it. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Martha Stewart called it a Good Thing when JanWieck@Yahoo.com (Jan Wieck) wrote: > On 7/9/2004 10:16 AM, Merlin Moncure wrote: >>> What is it about the buffer cache that makes it so unhappy being >>> able to hold everything? I don't want to be seen as a cache hit >>> fascist, but isn't it just better if the data is just *there*, >>> available in the postmaster's address space ready for each backend >>> process to access it, rather than expecting the Linux cache >>> mechanism, optimised as it may be, to have to do the caching? >> The disk cache on most operating systems is optimized. Plus, >> keeping shared buffers low gives you more room to bump up the sort >> memory, which will make your big queries run faster. > Plus, the situation will change dramatically with 7.5 where the disk > cache will have less information than the PG shared buffers, which > will become sequential scan resistant and will know that a block was > pulled in on behalf of vacuum and not because the regular database > access pattern required it. It'll be very curious how this changes things. I _think_ it means that shared buffer usage becomes more efficient both for small and large buffers, since vacuums and seq scans shouldn't "eviscerate" the shared buffers the way they can in earlier versions. What would be most interesting to see is whether this makes it wise to increase shared buffer size. It may be more effective to bump down the cache a little, and bump up sort memory; hard to tell. -- let name="cbbrowne" and tld="ntlug.org" in String.concat "@" [name;tld];; http://cbbrowne.com/info/spreadsheets.html "But life wasn't yes-no, on-off. Life was shades of gray, and rainbows not in the order of the spectrum." -- L. E. Modesitt, Jr., _Adiamante_
Jan wrote: > > The disk cache on most operating systems is optimized. Plus, keeping > > shared buffers low gives you more room to bump up the sort memory, which > > will make your big queries run faster. > > Plus, the situation will change dramatically with 7.5 where the disk > cache will have less information than the PG shared buffers, which will > become sequential scan resistant and will know that a block was pulled > in on behalf of vacuum and not because the regular database access > pattern required it. Hm. In my experience the different between data cached between shared buffers and the O/S is not very much...both are fast. However, I almost always see dramatic performance speedups for bumping up work mem. Are you suggesting that it will be advantageous to bump up shared buffers? Merlin
Andy wrote: > Whether the OS caches the data or PG does, you still want it cached. If > your > sorting backends gobble up the pages that otherwise would be filled with > the > database buffers, then your postmaster will crawl, as it'll *really* have > to > wait for stuff from disk. In my scenario, you'd spec the machine so that > there would be plenty of memory for *everything*. That's the whole point: memory is a limited resource. If pg is crawling, then the problem is simple: you need more memory. The question is: is it postgresql's responsibility to manage that resource? Pg is a data management tool, not a memory management tool. The same 'let's manage everything' argument also frequently gets brought up wrt file i/o, because people assume the o/s sucks at file management. In reality, they are quite good, and through use of the generic interface the administrator is free to choose a file system that best suits the needs of the application. At some point, hard disks will be replaced by solid state memory technologies...do you really want to recode your memory manager when this happens because all your old assumptions are no longer correct? Merlin
> What would be most interesting to see is whether this makes it wise to > increase shared buffer size. It may be more effective to bump down > the cache a little, and bump up sort memory; hard to tell. How do we go about scheduling tests with the OSDL folks? If they could do 10 runs with buffers between 1k and 500k it would help us get a broad view of the situation.
Rond, Chris, > > What would be most interesting to see is whether this makes it wise to > > increase shared buffer size. It may be more effective to bump down > > the cache a little, and bump up sort memory; hard to tell. > > How do we go about scheduling tests with the OSDL folks? If they could > do 10 runs with buffers between 1k and 500k it would help us get a broad > view of the situation. Yes. We'll need to. However, I'd like to wait until we're officially in Beta. I'll be seeing the OSDL folks in person (PostgreSQL+OSDL BOF at Linux World Expo!!) in a couple of weeks. -- Josh Berkus Aglio Database Solutions San Francisco
On 7/12/2004 12:38 PM, Josh Berkus wrote: > Rond, Chris, > >> > What would be most interesting to see is whether this makes it wise to >> > increase shared buffer size. It may be more effective to bump down >> > the cache a little, and bump up sort memory; hard to tell. >> >> How do we go about scheduling tests with the OSDL folks? If they could >> do 10 runs with buffers between 1k and 500k it would help us get a broad >> view of the situation. > > Yes. We'll need to. However, I'd like to wait until we're officially in > Beta. I'll be seeing the OSDL folks in person (PostgreSQL+OSDL BOF at Linux > World Expo!!) in a couple of weeks. > Don't forget to add that ARC needs some time actually to let the algorithm adjust the queue sizes and populate the cache according to the access pattern. You can't start a virgin postmaster and then slam on the accellerator of your test application by launching 500 concurrent clients out of the blue and expect that it starts off airborne. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Sorry for the late reply - I've been away, and I've had problems posting too :( Merlin, I'd like to come back with a few more points! >That's the whole point: memory is a limited resource. If pg is >crawling, then the problem is simple: you need more memory. My posting only relates to the scenario where RAM is not a limiting factor, a scenario which shall become increasingly common over the next few years, as 64 bit processors and OSs allow the exploitation of ever larger, ever cheaper RAM. Incidentally, If PG is crawling, memory might be the problem...but not necessarily - could be disk bound on writes. > The question is: is it postgresql's responsibility to manage that resource? I think you are confusing the issue of RAM and address space. Any application can acquire a piece of address space for its own use. It is the responsibility of the application to do what it needs with that address space. I'm interested in how PG could do something better in its address space when it knows that it can fit all the data it operates on within that address space. Though the OS is responsible for determining whether that address space is RAM resident or not, in my scenario, this is irrelevant, because there *will* be enough RAM for everything, and the OS will, in that scenario, allow all the address space to become RAM resident. I am not advocating undermining the OS in any way. It would be stupid to make PGSQL take over the running of the hardware. I've learned the hard way that bypassing the OS is just a big pain up the backside! >Pg is a data management tool, not a memory management tool. I'm not criticising PG. PG is actually a 'DISK/MEMORY' data management tool. It manages data which lives on disks, but it can only operate on that data in memory, and goes to some lengths to try to fit bits of disk data in a defined piece of memory, and push them back out again. At the moment, this model assumes that RAM is a scarce resource. The model still 'sort of' works when RAM is actually not scarce, because the OS effectively uses that extra RAM to make IO *appear* to be quicker, and indeed, I've found that a hint has been added to PG to tell it how much the OS is likely to be caching. But the question is this: "If you wrote a DB from scratch with the assumption that *all* the data could fit in the address space allocated by the postmaster, and you were confident that the OS had enough RAM so that you never suffered vmem page misses, couldn't you make things go much faster?" A more pertinent question is: "Could PG be extended to have a flag, which when enabled, told it to operate with the assumption that it could fit all the disk data in RAM, and implement the data organisation optimisations that rely on the persistence of data in address space?" >The same >'let's manage everything' argument also frequently gets brought up wrt >file i/o, because people assume the o/s sucks at file management. Well, I'm not saying this. I have substantial experience with high performance file IO through a filesystem. But if you are interested in high speed IO, naive 'let the OS do everything' approach isn't often good enough. You, the application, have to be aware that the order and timing of IO requests, along with the size of IO block you cause to trigger, have a dramatic impact on the speed with which the data reaches your app, OS or no OS. Most high speed storage still relies on spinning things containing data that can only be accessed in a certain way, and data movement is page boundary sensitive. The OS may hide these details from you, but you, the app writer, have to have an understanding of the underlying reality if you want to optimise performance. I want to stress that at no point am I advocating *not* using the OS. PG should do ALL IO and memory allocation through the OS, otherwise you end up with a platform specific product that is of little use. That given, there is still the opportunity for PG to be able to operate far more efficiently in my high memory scenario. Wouldn't your backend processes like to have the entire database sitting ready in address space (ram resident, of course!), indexes all fully built? No tuple more than a few machine instructions away? Imagine the postmaster isn't having to frantically decide which bits of data to kick out of the workspace in order to keep the backends happy. Imagine the postmaster isn't having to build structures to keep track of the newly read in blocks of data from 'disk' (or OS cache). Imagine that everything was just there... Is this not a compelling scenario? >At some point, hard disks will be replaced by solid state memory >technologies... This is irrelevant to my scenario. The optimisations I crave are to do with getting the entire database in a query-optimised form near to the CPUS - i.e. in fast RAM. (I'd expect solid state disk ram to be much slower than the RAM that sits nearer the CPU). The speed of the persistent storage system (whether spinning platters or some sort of persistent solid state memory) isn't really of direct relevance to my argument. Solid state disks would allow write speeds to improve, which would add to the gains which I am fishing for. So when they come, I'll be happy. Still. Solid state disks aren't really an option right now. Big RAM is. >do you really want to recode your memory manager when >this happens because all your old assumptions are no longer correct? My scenario assumes nothing about how the data is stored, but you are right to flag the problems that arise when original assumptions about hardware become incorrect. For example, PG assumes that RAM is a rare resource, and it assumes the postmaster cannot fit the entire database in a single address space. *These* assumptions are now not correct, following the 64bit address space breakthrough. The availability of 64 bit addressing and huge banks of RAM is of enormous significance to databases, and it is the whole reason for my post. Over the next 5-10 years, an increasing proportion of databases will fit comfortably in RAM resident address space on commodity equipment. My particular nationwide application will fit into RAM now. Hence my interest! So, the question for the people involved in PG is: *how* can PG be improved to make use of this, and reap the very substantial speed gains in this scenario, without breaking the existing usage scenarios of PG in the traditional 'DB > RAM' scenario? The answer isn't "undermine the OS". The answer might be"make the postmaster able to build and operate with persistent, query optimised representations of the disk data". Yes, I guess that might be a lot of work.. But the DB that delivers this performance will be very well placed in the next 5 years, don't you think? Anyway - I look forward to further feedback, and thanks for your comments so far. Regards, Andy
Sorry for the late reply - I've been away. Merlin, I'd like to come back with a few more points! >That's the whole point: memory is a limited resource. If pg is >crawling, then the problem is simple: you need more memory. My posting only relates to the scenario where RAM is not a limiting factor, a scenario which shall become increasingly common over the next few years, as 64 bit processors and OSs allow the exploitation of ever larger, ever cheaper RAM. > The question is: is it postgresql's responsibility to manage that resource? I think you are confusing the issue of RAM and address space. Any application can acquire a piece of address space for its own use. It is the responsibility of the application to do what it needs with that address space. I'm interested in how PG could do something better in its address space when it knows that it can fit all the data it operates on within that address space. Though the OS is responsible for determining whether that address space is RAM resident or not, in my scenario, this is irrelevant, because there *will* be enough RAM for everything, and the OS will, in that scenario, allow all the address space to become RAM resident. I am not advocating undermining the OS in any way. It would be stupid to make PGSQL take over the running of the hardware. >Pg is a data management tool, not a memory management tool. I'm not criticising PG. PG is actually a 'DISK/MEMORY' data management tool. It manages data which lives on disks, but it can only operate on that data in memory, and goes to some lengths to try to fit bits of disk data in a defined piece of memory, and push them back out again. At the moment, this model assumes that RAM is a scarce resource. The model still 'sort of' works when RAM is actually not scarce, because the OS effectively uses that extra RAM to make IO *appear* to be quicker, and indeed, I've found that a hint has been added to PG to tell it how much the OS is likely to be caching. But the question is this: "If you wrote a DB from scratch with the assumption that *all* the data could fit in the address space allocated by the postmaster, and you were confident that the OS had enough RAM so that you never suffered vmem page misses, couldn't you make things go much faster?" A more pertinent question is: "Could PG be extended to have a flag, which when enabled, told it to operate with the assumption that it could fit all the disk data in RAM, and implement the data organisation optimisations that rely on the persistence of data in address space?" >The same >'let's manage everything' argument also frequently gets brought up wrt >file i/o, because people assume the o/s sucks at file management. Well, I'm not saying this. I have substantial experience with high performance file IO through a filesystem. But if you are interested in high speed IO, naive 'let the OS do everything' approach isn't often good enough. You, the application, have to be aware that the order and timing of IO requests, along with the size of IO block you cause to trigger, have a dramatic impact on the speed with which the data reaches your app, OS or no OS. Most high speed storage still relies on spinning things containing data that can only be accessed in a certain way, and data movement is page boundary sensitive. The OS may hide these details from you, but you, the app writer, have to have an understanding of the underlying reality if you want to optimise performance. I want to stress that at no point am I advocating *not* using the OS. PG should do ALL IO and memory allocation through the OS, otherwise you end up with a platform specific product that is of little use. That given, there is still the opportunity for PG to be able to operate far more efficiently in my high memory scenario. Wouldn't your backend processes like to have the entire database sitting ready in address space (ram resident, of course!), indexes all fully built? No tuple more than a few machine instructions away? Imagine the postmaster isn't having to frantically decide which bits of data to kick out of the workspace in order to keep the backends happy. Imagine the postmaster isn't having to build structures to keep track of the newly read in blocks of data from 'disk' (or OS cache). Is this not a compelling scenario? >At some point, hard disks will be replaced by solid state memory >technologies... This is irrelevant to my scenario, though solid state disks would allow write speeds to improve, which would add to the gains which I am fishing for here. >do you really want to recode your memory manager when >this happens because all your old assumptions are no longer correct? My scenario assumes nothing about how the data is stored, but you are right to flag the problems that arise when original assumptions about hardware become incorrect. For example, PG assumes that RAM is a rare resource, and it assumes the postmaster cannot fit the entire database in a single address space. *These* assumptions are now not correct, following the 64bit address space breakthrough. The availability of 64 bit addressing and huge banks of RAM is of enormous significance to databases, and itIt is the whole reason for my post. Over the next 5-10 years, an increasing proportion of databases will fit comfortably in RAM resident address space on commodity equipment. So, the question for the people involved in PG is: *how* can PG be improved to make use of this, and reap the very substantial speed gains in this scenario, without breaking the existing usage scenarios of PG in the traditional 'DB > RAM' scenario? The answer isn't "undermine the OS". The answer is "make the postmaster able to build and operate with persistent, query optimised representations of the disk data". Yes, I guess that might be a lot of work.. But the DB that delivers this performance will be very well placed in the next 5 years, don't you think? Thanks for your comments, Regards, Andy