Thread: Multiple Storage per Tablespace, or Volumes
Hi list, Here's a proposal of this idea which stole a good part of my night. I'll present first the idea, then 2 use cases where to read some rational and few details. Please note I won't be able to participate in any development effort associated with this idea, may such a thing happen! The bare idea is to provide a way to 'attach' multiple storage facilities (say volumes) to a given tablespace. Each volume may be attached in READ ONLY, READ WRITE or WRITE ONLY mode. You can mix RW and WO volumes into the same tablespace, but can't have RO with any W form, or so I think. It would be pretty handy to be able to add and remove volumes on a live cluster, and this could be a way to implement moving/extending tablespaces. Use Case A: better read performances while keeping data write reliability The first application of this multiple volumes per tablespace idea is to keep a tablespace both into RAM (tmpfs or ramfs) and on disk (both RW). Then PG should be able to read from both volumes when dealing with read queries, and would have to fwrite()/fsync() both volumes for each write. Of course, write speed will be constrained by the slowest volume, but the quicker one would then be able to take away some amount of read queries meanwhile. It would be neat if PG was able to account volumes relative write speed in order to assign pounds to each tablespace volumes; and have the planner or executor span read queries among volumes depending on that. For example if a single query has a plan containing several full scan (of indexes and/or tables) in the same tablespace, those could be done on different volumes. Use Case B: Synchronous Master Slave(s) Replication By using a Distributed File System capable of being mounted from several nodes at the same time, we could have a configuration where a master node has ('exports') a WO tablespace volume, and one or more slaves (depending on FS capability) configures a RO tablespace volume. PG has then to be able to cope with a RO volume: the data are not written by PG itself (local node point of view), so some limitations would certainly occur. Will it be possible, for example, to add indexes to data on slaves? I'd use the solution even without this, thus... When the master/slave link is broken, the master can no more write to tablespace, as if it was a local disk failure of some sort, so this should prevent nasty desync' problems: data is written on all W volumes or data is not written at all. I realize this proposal is the first draft of a work to be done, and that I won't be able to make a lot more than drafting this idea. This mail is sent on the hackers list in the hope someone there will find this is worth considering and polishing... Regards, and thanks for the good work ;) -- Dimitri Fontaine
On Mon, Feb 19, 2007 at 11:25:41AM +0100, Dimitri Fontaine wrote: > Hi list, > > Here's a proposal of this idea which stole a good part of my night. > I'll present first the idea, then 2 use cases where to read some rational and > few details. Please note I won't be able to participate in any development > effort associated with this idea, may such a thing happen! > > The bare idea is to provide a way to 'attach' multiple storage facilities (say > volumes) to a given tablespace. Each volume may be attached in READ ONLY, > READ WRITE or WRITE ONLY mode. > You can mix RW and WO volumes into the same tablespace, but can't have RO with > any W form, or so I think. Somehow this seems like implementing RAID within postgres, which seems a bit outside of the scope of a DB. > Use Case A: better read performances while keeping data write reliability > > The first application of this multiple volumes per tablespace idea is to keep > a tablespace both into RAM (tmpfs or ramfs) and on disk (both RW). For example, I don't beleive there is a restiction against having one member of a RAID array being a RAM disk. > Use Case B: Synchronous Master Slave(s) Replication > > By using a Distributed File System capable of being mounted from several nodes > at the same time, we could have a configuration where a master node has > ('exports') a WO tablespace volume, and one or more slaves (depending on FS > capability) configures a RO tablespace volume. Here you have the problem of row visibility. The data in the table isn't very useful without the clog, and that's not stored in a tablespace... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout <kleptog@svana.org> writes: > Somehow this seems like implementing RAID within postgres, RAID and LVM too. I can't get excited about re-inventing those wheels when perfectly good implementations already exist for us to sit on top of. regards, tom lane
Le lundi 19 février 2007 16:33, Tom Lane a écrit : > Martijn van Oosterhout <kleptog@svana.org> writes: > > Somehow this seems like implementing RAID within postgres, > > RAID and LVM too. I can't get excited about re-inventing those wheels > when perfectly good implementations already exist for us to sit on top of. I though moving some knowledge about data availability into PostgreSQL code could provide some valuable performance benefit, allowing to organize reads (for example parallel tables scan/indexes scan to different volumes) and obtaining data from 'quicker' known volume (or least used/charged). You're both saying RAID/LVM implementations provide good enough performances for PG not having to go this way, if I understand correctly. And distributed file systems are enough to have the replication stuff, without PG having to deal explicitly with the work involved. May be I should have slept after all ;) Thanks for your time and comments, regards, -- Dimitri Fontaine
Dimitri Fontaine <dim@dalibo.com> writes: > You're both saying RAID/LVM implementations provide good enough performances > for PG not having to go this way, if I understand correctly. There's certainly no evidence to suggest that reimplementing them ourselves would be a productive use of our time. regards, tom lane
On Mon, Feb 19, 2007 at 05:10:36PM +0100, Dimitri Fontaine wrote: > > RAID and LVM too. I can't get excited about re-inventing those wheels > > when perfectly good implementations already exist for us to sit on top of. > > I though moving some knowledge about data availability into PostgreSQL code > could provide some valuable performance benefit, allowing to organize reads > (for example parallel tables scan/indexes scan to different volumes) and > obtaining data from 'quicker' known volume (or least used/charged). Well, organising requests to be handled quickly is a function of LVM/RAID, so we don't go there. However, speeding up scans by having multiple requests is an interesting approach, as would perhaps a different random_page_cost for different tablespaces. My point is, don't try to implement the mechanics of LVM/RAID into postgres, instead, work on providing ways for users to take advantage of these mechanisms if they have them. Look at it as if you have got LVM/RAID setup for your ideas, how do you get postgres to take advantage of them? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > Somehow this seems like implementing RAID within postgres, > > RAID and LVM too. I can't get excited about re-inventing those > wheels when perfectly good implementations already exist for us to > sit on top of. I expect that someone will point out that Windows doesn't support RAID or LVM, and we'll have to reimplement it anyway. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Peter Eisentraut wrote: > Tom Lane wrote: >> Martijn van Oosterhout <kleptog@svana.org> writes: >>> Somehow this seems like implementing RAID within postgres, >> RAID and LVM too. I can't get excited about re-inventing those >> wheels when perfectly good implementations already exist for us to >> sit on top of. > > I expect that someone will point out that Windows doesn't support RAID > or LVM, and we'll have to reimplement it anyway. Windows supports both RAID and LVM. //Magnus
Peter Eisentraut wrote: > Tom Lane wrote: >> Martijn van Oosterhout <kleptog@svana.org> writes: >>> Somehow this seems like implementing RAID within postgres, >> RAID and LVM too. I can't get excited about re-inventing those >> wheels when perfectly good implementations already exist for us to >> sit on top of. > > I expect that someone will point out that Windows doesn't support RAID > or LVM, and we'll have to reimplement it anyway. windows supports software raid just fine since Windows 2000 or so ... Stefan
Magnus Hagander wrote: > Windows supports both RAID and LVM. Oh good, so we've got that on record. :) -- Peter Eisentraut http://developer.postgresql.org/~petere/
Stefan Kaltenbrunner wrote: > Peter Eisentraut wrote: >> Tom Lane wrote: >>> Martijn van Oosterhout <kleptog@svana.org> writes: >>>> Somehow this seems like implementing RAID within postgres, >>> RAID and LVM too. I can't get excited about re-inventing those >>> wheels when perfectly good implementations already exist for us to >>> sit on top of. >> I expect that someone will point out that Windows doesn't support RAID >> or LVM, and we'll have to reimplement it anyway. > > windows supports software raid just fine since Windows 2000 or so ... Longer than that... it supported mirroring and raid 5 in NT4 and possibly even NT3.51 IIRC. Joshua D. Drake > > > Stefan > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org > -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > Somehow this seems like implementing RAID within postgres, > > RAID and LVM too. I can't get excited about re-inventing those wheels > when perfectly good implementations already exist for us to sit on top of. Ok, warning, this is a "you know what would be sweet" moment. What would be nice is to be able to detach one of the volumes, and know the span of the data in there without being able to access the data. The problem that a lot of warehouse operators have is something like this: "We know we have all this data, but we don't know what we will want to do with it later. So keep it all. I'll get back to you when I want to know something." It'd be nice to be able to load up all that data once, and then shunt it off into (say) read-only media. If one could then run a query that would tell one which spans of data are candidates for the search, you could bring back online (onto reasonably fast storage, for instance) just the volumes you need to read. A -- Andrew Sullivan | ajs@crankycanuck.ca Users never remark, "Wow, this software may be buggy and hard to use, but at least there is a lot of code underneath." --Damien Katz
Andrew Sullivan wrote: > On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote: >> Martijn van Oosterhout <kleptog@svana.org> writes: >>> Somehow this seems like implementing RAID within postgres, >> RAID and LVM too. I can't get excited about re-inventing those wheels >> when perfectly good implementations already exist for us to sit on top of. > > Ok, warning, this is a "you know what would be sweet" moment. The dreaded words from a developers mouth to every manager in the world. Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/
On Mon, 19 Feb 2007, Joshua D. Drake wrote: > Longer than that... it supported mirroring and raid 5 in NT4 and > possibly even NT3.51 IIRC. Mirroring and RAID 5 go back to Windows NT 3.1 Advanced Server in 1993: http://support.microsoft.com/kb/114779 http://www.byte.com/art/9404/sec8/art7.htm The main source of confusion about current support for this feature is that the desktop/workstation version of Windows don't have it. For Windows XP, you need the XP Professional version to get "dynamic disk" support; it's not in the home edition. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Joshua D. Drake wrote: > Andrew Sullivan wrote: > > On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote: > >> Martijn van Oosterhout <kleptog@svana.org> writes: > >>> Somehow this seems like implementing RAID within postgres, > >> RAID and LVM too. I can't get excited about re-inventing those wheels > >> when perfectly good implementations already exist for us to sit on top of. > > > > Ok, warning, this is a "you know what would be sweet" moment. > > The dreaded words from a developers mouth to every manager in the world. Yea, I just instinctively hit "delete" when I saw that phrase. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Andrew Sullivan wrote: > On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote: > > Martijn van Oosterhout <kleptog@svana.org> writes: > > > Somehow this seems like implementing RAID within postgres, > > > > RAID and LVM too. I can't get excited about re-inventing those wheels > > when perfectly good implementations already exist for us to sit on top of. > > Ok, warning, this is a "you know what would be sweet" moment. > > What would be nice is to be able to detach one of the volumes, and > know the span of the data in there without being able to access the > data. > > The problem that a lot of warehouse operators have is something like > this: "We know we have all this data, but we don't know what we will > want to do with it later. So keep it all. I'll get back to you when > I want to know something." You should be able to do that with tablespaces and VACUUM FREEZE, the point of the latter being that you can take the disk containing the "read only" data offline, and still have the data readable after plugging it back in, no matter how far along the transaction ID counter is. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Mon, 2007-02-19 at 17:35 -0300, Alvaro Herrera wrote: > Andrew Sullivan wrote: > > On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote: > > > Martijn van Oosterhout <kleptog@svana.org> writes: > > > > Somehow this seems like implementing RAID within postgres, > > > > > > RAID and LVM too. I can't get excited about re-inventing those wheels > > > when perfectly good implementations already exist for us to sit on top of. > > > > Ok, warning, this is a "you know what would be sweet" moment. > > > > What would be nice is to be able to detach one of the volumes, and > > know the span of the data in there without being able to access the > > data. > > > > The problem that a lot of warehouse operators have is something like > > this: "We know we have all this data, but we don't know what we will > > want to do with it later. So keep it all. I'll get back to you when > > I want to know something." > > You should be able to do that with tablespaces and VACUUM FREEZE, the > point of the latter being that you can take the disk containing the > "read only" data offline, and still have the data readable after > plugging it back in, no matter how far along the transaction ID counter > is. Doesn't work anymore because VACUUM FREEZE doesn't (and can't) take a full table lock, so somebody can always update data after a data block has been frozen. That can lead to putting a table onto read-only media when it still needs vacuuming, which is a great way to break the DB. It also doesn't freeze still visible data, so there's no easy way of doing this. Waiting until the VF is the oldest Xid is prone to deadlock as well. Ideally, we'd have a copy to read-only media whilst freezing, as an atomic operation, with some guarantees that it will actually have frozen *everything*, or fail: ALTER TABLE SET TABLESPACE foo READONLY; Can we agree that as a TODO item? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Mon, Feb 19, 2007 at 02:50:34PM -0500, Andrew Sullivan wrote: > On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote: > > Martijn van Oosterhout <kleptog@svana.org> writes: > > > Somehow this seems like implementing RAID within postgres, > > > > RAID and LVM too. I can't get excited about re-inventing those > > wheels when perfectly good implementations already exist for us to > > sit on top of. > > Ok, warning, this is a "you know what would be sweet" moment. > > What would be nice is to be able to detach one of the volumes, and > know the span of the data in there without being able to access the > data. > > The problem that a lot of warehouse operators have is something like > this: "We know we have all this data, but we don't know what we will > want to do with it later. So keep it all. I'll get back to you > when I want to know something." > > It'd be nice to be able to load up all that data once, and then > shunt it off into (say) read-only media. If one could then run a > query that would tell one which spans of data are candidates for the > search, you could bring back online (onto reasonably fast storage, > for instance) just the volumes you need to read. Isn't this one of the big use cases for table partitioning? Cheers, D -- David Fetter <david@fetter.org> http://fetter.org/ phone: +1 415 235 3778 AIM: dfetter666 Skype: davidfetter Remember to vote!
I have a WIP patch that adds the main detail I have found I need to properly tune checkpoint and background writer activity. I think it's almost ready to submit (you can see the current patch against 8.2 at http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt ) after making it a bit more human-readable. But I've realized that along with that, I need some guidance in regards to what log level is appropriate for this information. An example works better than explaining what the patch does: 2007-02-19 21:53:24.602 EST - DEBUG: checkpoint required (wrote checkpoint_segments) 2007-02-19 21:53:24.685 EST - DEBUG: checkpoint starting 2007-02-19 21:53:24.705 EST - DEBUG: checkpoint flushing buffer pool 2007-02-19 21:53:24.985 EST - DEBUG: checkpoint database fsync starting 2007-02-19 21:53:42.725 EST - DEBUG: checkpoint database fsync complete 2007-02-19 21:53:42.726 EST - DEBUG: checkpoint buffer flush dirty=8034 write=279956 us sync=17739974 us Remember that "Load distributed checkpoint" discussion back in December? You can see exactly how bad the problem is on your system with this log style (this is from a pgbench run where it's postively awful--it really does take over 17 seconds for the fsync to execute, and there are clients that are hung the whole time waiting for it). I also instrumented the background writer. You get messages like this: 2007-02-19 21:58:54.328 EST - DEBUG: BGWriter Scan All - Written = 5/5 Unscanned = 23/54 This shows that we wrote (5) the maximum pages we were allowed to write (5) while failing to scan almost half (23) of the buffers we meant to look at (54). By taking a look at this output while the system is under load, I found I was able to do bgwriter optimization that used to take me days of frustrating testing in hours. I've been waiting for a good guide to bgwriter tuning since 8.1 came out. Once you have this, combined with knowing how many buffers were dirty at checkpoint time because the bgwriter didn't get to them in time (the number you want to minimize), the tuning guide practically writes itself. So my question is...what log level should all this go at? Right now, I have the background writer stuff adjusting its level dynamically based on what happened; it logs at DEBUG2 if it hits the write limit (which should be a rare event once you're tuned properly), DEBUG3 for writes that scanned everything they were supposed to, and DEBUG4 if it scanned but didn't find anything to write. The source of checkpoint information logs at DEBUG1, the fsync/write info at DEBUG2. I'd like to move some of these up. On my system, I even have many of the checkpoint messages logged at INFO (the source of the checkpoint and the total write time line). It's a bit chatty, but when you get some weird system pause issue it makes it easy to figure out if checkpoints were to blame. Given how useful I feel some of these messages are to system tuning, and to explaining what currently appears as inexplicable pauses, I don't want them to be buried at DEBUG levels where people are unlikely to ever see them (I think some people may be concerned about turning on things labeled DEBUG at all). I am aware that I am too deep into this to have an unbiased opinion at this point though, which is why I ask for feedback on how to proceed here. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Monday 19 February 2007 15:08, Bruce Momjian wrote: > Joshua D. Drake wrote: > > Andrew Sullivan wrote: > > > On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote: > > >> Martijn van Oosterhout <kleptog@svana.org> writes: > > >>> Somehow this seems like implementing RAID within postgres, > > >> > > >> RAID and LVM too. I can't get excited about re-inventing those wheels > > >> when perfectly good implementations already exist for us to sit on top > > >> of. > > > > > > Ok, warning, this is a "you know what would be sweet" moment. > > > > The dreaded words from a developers mouth to every manager in the world. > > Yea, I just instinctively hit "delete" when I saw that phrase. Too bad... I know oracle can do what he wants... possibly other db systems as well. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
On Monday 19 February 2007 11:27, Martijn van Oosterhout wrote: > On Mon, Feb 19, 2007 at 05:10:36PM +0100, Dimitri Fontaine wrote: > > > RAID and LVM too. I can't get excited about re-inventing those wheels > > > when perfectly good implementations already exist for us to sit on top > > > of. > > > > I though moving some knowledge about data availability into PostgreSQL > > code could provide some valuable performance benefit, allowing to > > organize reads (for example parallel tables scan/indexes scan to > > different volumes) and obtaining data from 'quicker' known volume (or > > least used/charged). > > Well, organising requests to be handled quickly is a function of > LVM/RAID, so we don't go there. However, speeding up scans by having > multiple requests is an interesting approach, as would perhaps a > different random_page_cost for different tablespaces. > On one of my systems I have 1 tablespace for read data (99-1), 1 for read mostly data (90-10), and 1 for write mostly (40-60). The breakdown is based on a combination of the underlying hardware and usage patterns of the tables involved. I suspect that isn't that uncommon really. I've often thought that being able to set guc variables to a specific tablespace (like you can do for users) would allow for a lot of flexibility in tuning queries that go across different tablespaces. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
On Mon, Feb 19, 2007 at 07:10:52PM -0800, David Fetter wrote: > > Isn't this one of the big use cases for table partitioning? Sure, but you can't detach that data in the meantime, AFAIK. Maybe I've missed something. If I have 10 years of finace data, and I have to keep it all online all the time, my electricity costs alone are outrageous. If I can know _where_ the data is without being able to get it until I've actually made it available, then my query tools could be smart enough to detect "table partitioned; this data is offline", I could roll back to my savepoint, return a partial result set to the user, and tell it "call back in 24 hours for your full report". Yes, I know, hands waving in the air. But I already said I was having a "you know what would be sweet" moment. A -- Andrew Sullivan | ajs@crankycanuck.ca This work was visionary and imaginative, and goes to show that visionary and imaginative work need not end up well. --Dennis Ritchie
On Monday 19 February 2007 22:59, Greg Smith wrote: > I have a WIP patch that adds the main detail I have found I need to > properly tune checkpoint and background writer activity. I think it's > almost ready to submit (you can see the current patch against 8.2 at > http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt ) > after making it a bit more human-readable. But I've realized that along > with that, I need some guidance in regards to what log level is > appropriate for this information. > > An example works better than explaining what the patch does: > > 2007-02-19 21:53:24.602 EST - DEBUG: checkpoint required (wrote > checkpoint_segments) > 2007-02-19 21:53:24.685 EST - DEBUG: checkpoint starting > 2007-02-19 21:53:24.705 EST - DEBUG: checkpoint flushing buffer pool > 2007-02-19 21:53:24.985 EST - DEBUG: checkpoint database fsync starting > 2007-02-19 21:53:42.725 EST - DEBUG: checkpoint database fsync complete > 2007-02-19 21:53:42.726 EST - DEBUG: checkpoint buffer flush dirty=8034 > write=279956 us sync=17739974 us > > Remember that "Load distributed checkpoint" discussion back in December? > You can see exactly how bad the problem is on your system with this log > style (this is from a pgbench run where it's postively awful--it really > does take over 17 seconds for the fsync to execute, and there are clients > that are hung the whole time waiting for it). > > I also instrumented the background writer. You get messages like this: > > 2007-02-19 21:58:54.328 EST - DEBUG: BGWriter Scan All - Written = 5/5 > Unscanned = 23/54 > > This shows that we wrote (5) the maximum pages we were allowed to write > (5) while failing to scan almost half (23) of the buffers we meant to look > at (54). By taking a look at this output while the system is under load, > I found I was able to do bgwriter optimization that used to take me days > of frustrating testing in hours. I've been waiting for a good guide to > bgwriter tuning since 8.1 came out. Once you have this, combined with > knowing how many buffers were dirty at checkpoint time because the > bgwriter didn't get to them in time (the number you want to minimize), the > tuning guide practically writes itself. > > So my question is...what log level should all this go at? Right now, I > have the background writer stuff adjusting its level dynamically based on > what happened; it logs at DEBUG2 if it hits the write limit (which should > be a rare event once you're tuned properly), DEBUG3 for writes that > scanned everything they were supposed to, and DEBUG4 if it scanned but > didn't find anything to write. The source of checkpoint information logs > at DEBUG1, the fsync/write info at DEBUG2. > > I'd like to move some of these up. On my system, I even have many of the > checkpoint messages logged at INFO (the source of the checkpoint and the > total write time line). It's a bit chatty, but when you get some weird > system pause issue it makes it easy to figure out if checkpoints were to > blame. Given how useful I feel some of these messages are to system > tuning, and to explaining what currently appears as inexplicable pauses, I > don't want them to be buried at DEBUG levels where people are unlikely to > ever see them (I think some people may be concerned about turning on > things labeled DEBUG at all). I am aware that I am too deep into this to > have an unbiased opinion at this point though, which is why I ask for > feedback on how to proceed here. > My impression of this is that DBA's would typically want to run this for a short period of time to get thier systems tuned and then it pretty much becomes chatter. Can you come up with an idea of what information DBA's need to know? Would it be better to hide all of this logging behind a guc that can be turned on or off (log_bgwriter_activity)? Maybe you could just reinsterment it as dtrace probes so a seperate stand-alone process could pull the information as needed? :-) -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
On Mon, Feb 19, 2007 at 10:59:38PM -0500, Greg Smith wrote: > I have a WIP patch that adds the main detail I have found I need to > properly tune checkpoint and background writer activity. I think it's > almost ready to submit (you can see the current patch against 8.2 at > http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt ) > after making it a bit more human-readable. But I've realized that along > with that, I need some guidance in regards to what log level is > appropriate for this information. It would also be extremely useful to make checkpoint stats visible somewhere in the database (presumably via the existing stats mechanism). The log output is great for doing initial tuning, but you'd want to also be able to monitor things, which would be impractical via logging. I'm thinking just tracking how many pages had to be flushed during a checkpoint would be a good start. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Thu, 22 Feb 2007, Jim C. Nasby wrote: > It would also be extremely useful to make checkpoint stats visible > somewhere in the database (presumably via the existing stats > mechanism)... I'm thinking just tracking how many pages had to be > flushed during a checkpoint would be a good start. I'm in the middle of testing an updated version of the patch, once I nail down exactly what needs to be logged I'd planned to open a discussion on which of those things would be best served by pg_stats instead of a log. I decided specifically to aim for the logs instead for the checkpoint data because if you're in a situation where are inserting fast enough that the checkpoints are spaced closely together, you'd end up having to poll pg_stats all the time for make sure you catch them all, which becomes even less efficient than just logging the data. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Wed, 21 Feb 2007, Robert Treat wrote: > My impression of this is that DBA's would typically want to run this for a > short period of time to get thier systems tuned and then it pretty much > becomes chatter. Can you come up with an idea of what information DBA's need > to know? I am structing the log levels so that you can see all the events that are likely sources for extreme latency by using log level DEBUG1. That level isn't too chatty and I've decided I can leave the server there forever if need by. I've actually come with a much better interface to the background writer logging than the one I originally outlined, that I'll have ready for feedback in a few more days. Now it dumps a nice summary report every time it completes a scan, rather than logging lower-level info. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Mar 5, 2007, at 8:34 PM, Greg Smith wrote: > On Thu, 22 Feb 2007, Jim C. Nasby wrote: > >> It would also be extremely useful to make checkpoint stats visible >> somewhere in the database (presumably via the existing stats >> mechanism)... I'm thinking just tracking how many pages had to be >> flushed during a checkpoint would be a good start. > > I'm in the middle of testing an updated version of the patch, once > I nail down exactly what needs to be logged I'd planned to open a > discussion on which of those things would be best served by > pg_stats instead of a log. > > I decided specifically to aim for the logs instead for the > checkpoint data because if you're in a situation where are > inserting fast enough that the checkpoints are spaced closely > together, you'd end up having to poll pg_stats all the time for > make sure you catch them all, which becomes even less efficient > than just logging the data. It's always good to be able to log stuff for detailed troubleshooting, so that's a good place to start. The flipside is that it's much easier to machine-parse a table rather than trying to scrape the logs. And I don't think we'll generally care about each individual checkpoint; rather we'll want to look at things like 'checkpoints/hour' and 'checkpoint written pages/hour'. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Tue, 6 Mar 2007, Jim Nasby wrote: > The flipside is that it's much easier to machine-parse a table rather > than trying to scrape the logs. Now you might realize why I've been so vocal on the SQL log export implementation details. > And I don't think we'll generally care about each individual checkpoint; > rather we'll want to look at things like 'checkpoints/hour' and > 'checkpoint written pages/hour'. After a few months of staring at this data, I've found averages like that misleading. The real problem areas correlate with the peak pages written at any one checkpoint. Lowering that value is really the end-game for optimizing the background writer, and the peaks are what will nail you with a nasty fsync pause at checkpoint time. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> wrote: > After a few months of staring at this data, I've found averages like that > misleading. The real problem areas correlate with the peak pages written > at any one checkpoint. Lowering that value is really the end-game for > optimizing the background writer, and the peaks are what will nail you > with a nasty fsync pause at checkpoint time. If you've already had some technical knowledge to lead best settings from activity logs, could you write it down in the codes? I hope some kinds of automatic control features in bgwriter if its best configurations vary by usages or activities. BTW, I'm planning two changes in bgwriter. [Load distributed checkpoint] http://archives.postgresql.org/pgsql-patches/2007-02/msg00522.php [Automatic adjustmentof bgwriter_lru_maxpages] http://archives.postgresql.org/pgsql-patches/2007-03/msg00092.php I have some results that if we have plenty of time for checkpoints, bgwriter_all_maxpages is not a so important parameter because it is adjusted to "shared_buffers / duration of checkpoint". Also, my recommended bgwriter_lru_maxpages is "average number of recycled buffers per cycle", that is hardly able to tune manually. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Mar 6, 2007, at 10:11 PM, ITAGAKI Takahiro wrote: > I have some results that if we have plenty of time for checkpoints, > bgwriter_all_maxpages is not a so important parameter because it is > adjusted to "shared_buffers / duration of checkpoint". > Also, my recommended bgwriter_lru_maxpages is "average number of > recycled buffers per cycle", that is hardly able to tune manually. What do you mean by 'number of recycled buffers per cycle"? -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Wed, 7 Mar 2007, ITAGAKI Takahiro wrote: > Also, my recommended bgwriter_lru_maxpages is "average number of > recycled buffers per cycle", that is hardly able to tune manually. This is completely dependent on what percentage of your buffer cache is pinned. If your load is something like the standard pgbench, the LRU writer will rarely find anything useful to write, so this entire line of thinking won't work. The proper behavior for heavily pinned data is to turn off the LRU writer altogether so there's more time to run the all scan. The job I'm trying to take on here is not to presume I can solve these problems myself yet. I've instead recognized that people need usefully organized information in order to even move in that direction, and that informatoin is not even close to being available right now. What my latest work in progress patches do is summarize each scan of the buffer pool with information about how much was written by each of the two writers, along with noting what percentage of the pool was pinned data. I'm trying to get that one ready to submit this week. Those three values suggest some powerful techniques for tuning, but it's not quite good enough to allow auto-tuning. It also needs a feel for how much time is left before the next checkpoint. What really needs to go along with all this is a sort of progress bar that esimates how long we are from a checkpoint based on both a) the timeout, and b) how many segments have been written. The timeout one is easy to work with that way (from what I read of your code, you've worked that angle). The part I had trouble doing was getting the WAL writers to communicate a progress report on how many segments they filled back to the bgwriter. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Jim Nasby <decibel@decibel.org> wrote: > > Also, my recommended bgwriter_lru_maxpages is "average number of > > recycled buffers per cycle", that is hardly able to tune manually. > > What do you mean by 'number of recycled buffers per cycle"? There is the following description in the documentation: | * bgwriter_lru_percent (floating point) | To reduce the probability that server processes will need to issue their | own writes, the background writer tries to write buffers that are likely | to be recycled soon. | * bgwriter_lru_maxpages (integer) | In each round, no more than this many buffers will be written as a | result of scanning soon-to-be-recycled buffers. The number of recycled buffers per round is the same as the number of StrategyGetBuffer() calls per round. I think the number is suitable for bgwriter_lru_maxpages if we want bgwriter to write almost of dirty pages. I proposed to change the semantics of bgwriter_lru_maxpages. It represented "maximum writes per round", but the new meaning is "reserved buffers for recycle". Non-dirty unused buffers will be counted among it. I'm measuring the difference of performance between manual and automatic tuning, especially adding some jobs before writes. I'll inform you about them when I get the results. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Greg Smith <gsmith@gregsmith.com> wrote: > > Also, my recommended bgwriter_lru_maxpages is "average number of > > recycled buffers per cycle", that is hardly able to tune manually. > > This is completely dependent on what percentage of your buffer cache is > pinned. Don't you mean usage_count? In my understanding, each backend pins two or so buffers at once. So percentage of pinned buffers should be low. > If your load is something like the standard pgbench, the LRU > writer will rarely find anything useful to write, so this entire line of > thinking won't work. The proper behavior for heavily pinned data is to > turn off the LRU writer altogether so there's more time to run the all > scan. I think you are pointing out the problem of buffer management algorithm itself, not only bgwriter. Even if you turn off the LRU writer, each backend pays the same cost to find recyclable buffers every time they allocate a buffer. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Fri, 9 Mar 2007, ITAGAKI Takahiro wrote: > In my understanding, each backend pins two or so buffers at once. So > percentage of pinned buffers should be low. With the pgbench workload, a substantial percentage of the buffer cache ends up pinned. From staring at the buffer cache using contrib/pg_buffercache, I believe most of that consists of the index blocks for the records being updated in the accounts table. I just posted a new version of the patch I asked for feedback on at the beginning of this thread, the latest one is at http://westnet.com/~gsmith/content/postgresql/new-patch-checkpoint.txt I've been adjusting it to monitor the same data I think you need to refine your patch. I believe the approach you're taking includes some assumptions that seem perfectly reasonable, but that my testing doesn't agree with. There's nothing like measuring something to settle what's really going on, though, so that's what I've been focusing on. I'd love to get some feedback on whether other people can replicate the kind of things I'm seeing. The new code generates statistics about exactly what the background writer scan found during each round. If there's been substantial write activity, it prints a log line when it recycles back to the beginning of the all scan, to help characterize what the buffer pool looked like during that scan from the perspective of the bgwriter. Here's some sample log output from my underpowered laptop while running pgbench: bgwriter scan all writes=16.6 MB (69.3%) pinned=11.7 MB (48.8%) LRU=7.7 MB (31.9%) ... checkpoint required (wrote checkpoint_segments) checkpoint buffers dirty=19.4 MB (80.8%) write=188.9 ms sync=4918.1 ms Here 69% of the buffer cache contained dirty data, and 49% of the cache was both pinned and dirty. During that same time period, the LRU write also wrote out a fair amount of data, operating on the 20% of the cache that was dirty but not pinned. On my production server, where the background writer is turned way up to reduce checkpoint times, these numbers are even more extreme; almost everything that's dirty is also pinned during pgbench, and the LRU is lucky to find anything it can write as a result. That patch is against the 8.2 codebase; now that I'm almost done I'm planning to move it to HEAD instead soon (where it will conflict considerably with your patch). If you have an 8.2 configuration you can test with my patch applied, set log_min_messages = debug2, try it out, and see what you get when running pgbench for a while. I think you'll discover some interesting and unexpected things. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> writes: > With the pgbench workload, a substantial percentage of the buffer cache > ends up pinned. [ raised eyebrow... ] Prove that. AFAIK it's impossible for the pgbench queries to pin more than about three or four buffers per backend concurrently. regards, tom lane
Greg Smith <gsmith@gregsmith.com> wrote: > > In my understanding, each backend pins two or so buffers at once. So > > percentage of pinned buffers should be low. > > With the pgbench workload, a substantial percentage of the buffer cache > ends up pinned. > http://westnet.com/~gsmith/content/postgresql/new-patch-checkpoint.txt > bgwriter scan all writes=16.6 MB (69.3%) pinned=11.7 MB (48.8%) LRU=7.7 MB (31.9%) > ... > checkpoint required (wrote checkpoint_segments) > checkpoint buffers dirty=19.4 MB (80.8%) write=188.9 ms sync=4918.1 ms > > Here 69% of the buffer cache contained dirty data, and 49% of the cache > was both pinned and dirty. No. "Pinned" means bufHdr->refcount > 0 and you don't distinguish pinned or recently-used (bufHdr->usage_count > 0) buffers in your patch. ! if (bufHdr->refcount != 0 || bufHdr->usage_count != 0) { ! if (skip_pinned) ! { ! UnlockBufHdr(bufHdr); ! return BUF_PINNED; ! } ! buffer_write_type=BUF_WRITTEN_AND_PINNED; Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Fri, 9 Mar 2007, ITAGAKI Takahiro wrote: > "Pinned" means bufHdr->refcount > 0 and you don't distinguish pinned or > recently-used (bufHdr->usage_count > 0) buffers in your patch. Thank you, I will revise the terminology used accordingly. I was using "pinned" as a shortcut for "will be ignored by skip_pinned" which was sloppy of me. As I said, I was trying to show how the buffer cache looks from the perspective of the background writer, and therefore lumping them together because that's how SyncOneBuffer views them. A buffer cache full of either type will be largely ignored by the LRU writer, and that's what I've been finding when running insert/update heavy workloads like pgbench. If I might suggest a terminology change to avoid this confusion in the future, I'd like to rename the SyncOneBuffer "skip_pinned" parameter to something like "skip_active", which is closer to the real behavior. I know Oracle refers to these as "hot" and "cold" LRU entries. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Mar 8, 2007, at 11:51 PM, Greg Smith wrote: > almost everything that's dirty is also pinned during pgbench, and > the LRU is lucky to find anything it can write as a result I'm wondering if pg_bench is a good test of this stuff. ISTM it's unrealistically write-heavy, which is going to tend to not only put a lot of dirty buffers into the pool, but also keep them pinned enough that you can't write them. Perhaps you should either modify pg_bench to do a lot more selects out of the various tables or look towards a different benchmark. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Mar 8, 2007, at 11:51 PM, Greg Smith wrote: > almost everything that's dirty is also pinned during pgbench, and > the LRU is lucky to find anything it can write as a result I'm wondering if pg_bench is a good test of this stuff. ISTM it's unrealistically write-heavy, which is going to tend to not only put a lot of dirty buffers into the pool, but also keep them pinned enough that you can't write them. Perhaps you should either modify pg_bench to do a lot more selects out of the various tables -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Mar 9, 2007, at 7:57 AM, Greg Smith wrote: > On Fri, 9 Mar 2007, ITAGAKI Takahiro wrote: > >> "Pinned" means bufHdr->refcount > 0 and you don't distinguish >> pinned or recently-used (bufHdr->usage_count > 0) buffers in your >> patch. > > Thank you, I will revise the terminology used accordingly. I was > using "pinned" as a shortcut for "will be ignored by skip_pinned" > which was sloppy of me. As I said, I was trying to show how the > buffer cache looks from the perspective of the background writer, > and therefore lumping them together because that's how > SyncOneBuffer views them. A buffer cache full of either type will > be largely ignored by the LRU writer, and that's what I've been > finding when running insert/update heavy workloads like pgbench. > > If I might suggest a terminology change to avoid this confusion in > the future, I'd like to rename the SyncOneBuffer "skip_pinned" > parameter to something like "skip_active", which is closer to the > real behavior. I know Oracle refers to these as "hot" and "cold" > LRU entries. Well, AIUI, whether the buffer is actually pinned or not is almost inconsequential (other than if a buffer *is* pinned then it's usage count is about to become > 0, so there's no reason to consider writing it). What that parameter really does is control whether you're going to follow the LRU semantics or not... -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim Nasby <decibel@decibel.org> writes: > On Mar 8, 2007, at 11:51 PM, Greg Smith wrote: >> almost everything that's dirty is also pinned during pgbench, and >> the LRU is lucky to find anything it can write as a result > I'm wondering if pg_bench is a good test of this stuff. On reflection I think that Greg's result is probably unsurprising, and furthermore does not indicate that anything is wrong. What it shows (now that we got past the terminology) is that only about half of the buffer pool is subject to replacement during any given clock sweep. For low-usage pages that's about what you'd expect: a page is sucked in on demand (using a buffer returned by the clock sweep), and when we're done with it it'll have usage_count = 1. If it's not touched again then when the clock sweep returns to it it'll be decremented to usage_count 0, and on the next visit it'll be recycled for use as something else. Thus for low-usage pages you'd fully expect that about half of the buffer population has usage_count 1 and the rest has usage count 0; which is strikingly close to Greg's measurement that 48.8% of the population has usage_count 0. What this seems to tell us is that pgbench's footprint of heavily used pages (those able to achieve usage_counts above 1) is very small. Which is probably right unless you've used a very large scale factor. I'd be interested to know what scale factor and shared_buffers setting led to the above measurement. It strikes me that the patch would be more useful if it produced a histogram of the observed usage_counts, rather than merely the count for usage_count = 0. regards, tom lane
On Fri, 9 Mar 2007, Jim Nasby wrote: > I'm wondering if pg_bench is a good test of this stuff. ISTM it's > unrealistically write-heavy, which is going to tend to not only put a > lot of dirty buffers into the pool, but also keep them pinned enough > that you can't write them. Whether it's "unrealistically" write-heavy kind of depends on what your real app is. The standard pgbench is a bit weird because it does so many updates to tiny tables, which adds a level of locking contention that doesn't really reflect many real-world situations. But the no-branch mode (update/select to accounts, insert into history) isn't too dissimilar from some insert-heavy logging applications I've seen. The main reason I brought this all up was because Itagaki seemed to be using pgbench for some of his performance tests. I just wanted to point out that the LRU background writer specifically tends to be very underutilized when using pgbench. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Fri, 9 Mar 2007, Tom Lane wrote: > I'd be interested to know what scale factor and shared_buffers setting > led to the above measurement. That was just a trivial example with 1 client, scale=10 (~160MB database), and shared_buffers=24MB. Where things really get interesting with pgbench is on a system with enough horsepower+clients to dirty the whole buffer cache well before a checkpoint. I regularly see >75% of the cache dirty and blocked from LRU writes with pgbench's slightly pathological workload in that situation. You're correct that these results aren't particularly surprising or indicative of a problem to be corrected. But they do shed some light on what pgbench is and isn't appropriate for testing. > It strikes me that the patch would be more useful if it produced a > histogram of the observed usage_counts, rather than merely the count > for usage_count = 0. I'll start working in that direction. With the feedback everyone has given me on how few of the buffers are truly "pinned" via the correct usage of the term, I'm going to revisit the usage details and revise that section. I'm happy with how I'm reporting the checkpoint details now, still some work left to do on the bgwriter activity. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Fri, 9 Mar 2007, Tom Lane wrote: > It strikes me that the patch would be more useful if it produced a > histogram of the observed usage_counts Don't have something worth releasing yet, but I did code a first rev of this today. The results are quite instructive and it's well worth looking at. The main server I work on has shared_buffers=60000 for pgbench testing and the background writers turned way up (in hopes of reducing checkpoint times; that's a whole nother topic). I run pgbench with s=100 (~1.6GB database). Here's what I get as statistics on the buffer pool after a scan when the server is "happy", from a run with 20 clients: writes=38.3MB (8.2%) pinned+used=38.3MB (8.2%) dirty buffer usage count histogram: 0=0.1% 1=0.3% 2=26% 3=17% 4=21% 5+=36% Basically, everything that's left dirty is also heavily used; as I noted before, when I inspect with pg_buffercache these are mostly index blocks. I note that I should probably generate a second histogram that compares the clean data. The all scan is writing pages out as soon as they dirty, the LRU background writer is lucky if it can find anything to do. (Note that I don't rely on the LRU writer more because that causes a significant lull in writes after a checkpoint. By the time it gets going again it's harder to catch up, and delaying PostgreSQL writes also aggrevates issues with the way Linux caches writes.) What's I found really interesting was comparing a "sad" section where performance breaks down. This is from a minute later: writes=441.6MB (94.2%) pinned+used=356.2MB (76.0%) dirty buffer usage count histogram: 0=18.7% 1=26.4% 2=31% 3=11% 4=9% 5+=4% Note how the whole buffer distribution has shifted toward lower usage counts. The breakdown seems to involve evicting the index blocks to make room for the recently dirty data when it can't be written out fast enough. As the index blocks go poof, things slow further, and into the vicious circle you go. Eventually you can end up blocked on a combination of buffer evictions and disk seeks for uncached data that are fighting with the writes. The bgwriter change this suggested to me is defining a triage behavior where the all scan switches to acting like the LRU one: -Each sample period, note what % of the usage_count=0 records are dirty -If that number is above a tunable threshold, switch to only writing usage_count=0 records until it isn't anymore. On my system a first guess for that tunable would be 2-5%, based on what values I see on either side of the sad periods. No doubt some systems would set that much higher, haven't tested my system at home yet to have a guideline for a more typical PC. As for why this behavior matters so much to me, I actually have a prototype auto-tuning background writer design that was hung up on this particular problem. It notes how often you write out max_pages, uses that to model the average percentage of the buffer cache you're traversing each pass, then runs a couple of weighted-average feedback loops to aim for a target seconds/sweep. The problem was that it went berzerk when the whole buffer cache was dirty (I hope someone appreciates that I've been calling this "Felix Unger on crack mode" in my notes). I needed some way to prioritize which buffers to concentrate on when that happens, and so far the above has been a good first-cut way to help with that. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> writes: > Here's what I get as statistics on the buffer pool after a > scan when the server is "happy", from a run with 20 clients: > writes=38.3MB (8.2%) pinned+used=38.3MB (8.2%) > dirty buffer usage count histogram: > 0=0.1% 1=0.3% 2=26% 3=17% 4=21% 5+=36% Interesting, but not real meaningful by itself --- what's the situation for non-dirty buffers? And how many are there of each? It might also be interesting to know exactly how many buffers were pinned at the time the scan passed over them. In theory it should be a small fraction, but maybe it isn't ... regards, tom lane
On Mon, 12 Mar 2007, Tom Lane wrote: > It might also be interesting to know exactly how many buffers were > pinned at the time the scan passed over them. In theory it should be a > small fraction, but maybe it isn't ... It is; the theory holds for all the tests I tried today. The actual pinned buffers were so few (typically a fraction of the clients) that I reverted to just lumping them in with the recently used ones. To better reflect the vast majority of what it's interacting with, in my patch I renamed the SyncOneBuffer "skip_pinned" to "skip_recently_used". It seems natural that something currently pinned would also be considered recently used, the current naming I didn't find so obvious. I'm also now collecting clean vs. dirty usage histogram counts as well since you suggested it. Nothing exciting to report there so far, may note something interesting after I collect more data. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Is this patch ready? --------------------------------------------------------------------------- Greg Smith wrote: > I have a WIP patch that adds the main detail I have found I need to > properly tune checkpoint and background writer activity. I think it's > almost ready to submit (you can see the current patch against 8.2 at > http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt ) > after making it a bit more human-readable. But I've realized that along > with that, I need some guidance in regards to what log level is > appropriate for this information. > > An example works better than explaining what the patch does: > > 2007-02-19 21:53:24.602 EST - DEBUG: checkpoint required (wrote > checkpoint_segments) > 2007-02-19 21:53:24.685 EST - DEBUG: checkpoint starting > 2007-02-19 21:53:24.705 EST - DEBUG: checkpoint flushing buffer pool > 2007-02-19 21:53:24.985 EST - DEBUG: checkpoint database fsync starting > 2007-02-19 21:53:42.725 EST - DEBUG: checkpoint database fsync complete > 2007-02-19 21:53:42.726 EST - DEBUG: checkpoint buffer flush dirty=8034 > write=279956 us sync=17739974 us > > Remember that "Load distributed checkpoint" discussion back in December? > You can see exactly how bad the problem is on your system with this log > style (this is from a pgbench run where it's postively awful--it really > does take over 17 seconds for the fsync to execute, and there are clients > that are hung the whole time waiting for it). > > I also instrumented the background writer. You get messages like this: > > 2007-02-19 21:58:54.328 EST - DEBUG: BGWriter Scan All - Written = 5/5 > Unscanned = 23/54 > > This shows that we wrote (5) the maximum pages we were allowed to write > (5) while failing to scan almost half (23) of the buffers we meant to look > at (54). By taking a look at this output while the system is under load, > I found I was able to do bgwriter optimization that used to take me days > of frustrating testing in hours. I've been waiting for a good guide to > bgwriter tuning since 8.1 came out. Once you have this, combined with > knowing how many buffers were dirty at checkpoint time because the > bgwriter didn't get to them in time (the number you want to minimize), the > tuning guide practically writes itself. > > So my question is...what log level should all this go at? Right now, I > have the background writer stuff adjusting its level dynamically based on > what happened; it logs at DEBUG2 if it hits the write limit (which should > be a rare event once you're tuned properly), DEBUG3 for writes that > scanned everything they were supposed to, and DEBUG4 if it scanned but > didn't find anything to write. The source of checkpoint information logs > at DEBUG1, the fsync/write info at DEBUG2. > > I'd like to move some of these up. On my system, I even have many of the > checkpoint messages logged at INFO (the source of the checkpoint and the > total write time line). It's a bit chatty, but when you get some weird > system pause issue it makes it easy to figure out if checkpoints were to > blame. Given how useful I feel some of these messages are to system > tuning, and to explaining what currently appears as inexplicable pauses, I > don't want them to be buried at DEBUG levels where people are unlikely to > ever see them (I think some people may be concerned about turning on > things labeled DEBUG at all). I am aware that I am too deep into this to > have an unbiased opinion at this point though, which is why I ask for > feedback on how to proceed here. > > -- > * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Would not at least some of these numbers be better presented through the stats collector, so they can be easily monitored? That goes along the line of my way way way away from finished attempt earlier, perhaps a combination of these two patches? //Magnus Bruce Momjian wrote: > Is this patch ready? > > --------------------------------------------------------------------------- > > Greg Smith wrote: >> I have a WIP patch that adds the main detail I have found I need to >> properly tune checkpoint and background writer activity. I think it's >> almost ready to submit (you can see the current patch against 8.2 at >> http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt ) >> after making it a bit more human-readable. But I've realized that along >> with that, I need some guidance in regards to what log level is >> appropriate for this information. >> >> An example works better than explaining what the patch does: >> >> 2007-02-19 21:53:24.602 EST - DEBUG: checkpoint required (wrote >> checkpoint_segments) >> 2007-02-19 21:53:24.685 EST - DEBUG: checkpoint starting >> 2007-02-19 21:53:24.705 EST - DEBUG: checkpoint flushing buffer pool >> 2007-02-19 21:53:24.985 EST - DEBUG: checkpoint database fsync starting >> 2007-02-19 21:53:42.725 EST - DEBUG: checkpoint database fsync complete >> 2007-02-19 21:53:42.726 EST - DEBUG: checkpoint buffer flush dirty=8034 >> write=279956 us sync=17739974 us >> >> Remember that "Load distributed checkpoint" discussion back in December? >> You can see exactly how bad the problem is on your system with this log >> style (this is from a pgbench run where it's postively awful--it really >> does take over 17 seconds for the fsync to execute, and there are clients >> that are hung the whole time waiting for it). >> >> I also instrumented the background writer. You get messages like this: >> >> 2007-02-19 21:58:54.328 EST - DEBUG: BGWriter Scan All - Written = 5/5 >> Unscanned = 23/54 >> >> This shows that we wrote (5) the maximum pages we were allowed to write >> (5) while failing to scan almost half (23) of the buffers we meant to look >> at (54). By taking a look at this output while the system is under load, >> I found I was able to do bgwriter optimization that used to take me days >> of frustrating testing in hours. I've been waiting for a good guide to >> bgwriter tuning since 8.1 came out. Once you have this, combined with >> knowing how many buffers were dirty at checkpoint time because the >> bgwriter didn't get to them in time (the number you want to minimize), the >> tuning guide practically writes itself. >> >> So my question is...what log level should all this go at? Right now, I >> have the background writer stuff adjusting its level dynamically based on >> what happened; it logs at DEBUG2 if it hits the write limit (which should >> be a rare event once you're tuned properly), DEBUG3 for writes that >> scanned everything they were supposed to, and DEBUG4 if it scanned but >> didn't find anything to write. The source of checkpoint information logs >> at DEBUG1, the fsync/write info at DEBUG2. >> >> I'd like to move some of these up. On my system, I even have many of the >> checkpoint messages logged at INFO (the source of the checkpoint and the >> total write time line). It's a bit chatty, but when you get some weird >> system pause issue it makes it easy to figure out if checkpoints were to >> blame. Given how useful I feel some of these messages are to system >> tuning, and to explaining what currently appears as inexplicable pauses, I >> don't want them to be buried at DEBUG levels where people are unlikely to >> ever see them (I think some people may be concerned about turning on >> things labeled DEBUG at all). I am aware that I am too deep into this to >> have an unbiased opinion at this point though, which is why I ask for >> feedback on how to proceed here. >> >> -- >> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 7: You can help support the PostgreSQL project by donating at >> >> http://www.postgresql.org/about/donate >
Magnus Hagander wrote: > Would not at least some of these numbers be better presented through the > stats collector, so they can be easily monitored? > > That goes along the line of my way way way away from finished attempt > earlier, perhaps a combination of these two patches? Yes. --------------------------------------------------------------------------- > > //Magnus > > > Bruce Momjian wrote: > > Is this patch ready? > > > > --------------------------------------------------------------------------- > > > > Greg Smith wrote: > >> I have a WIP patch that adds the main detail I have found I need to > >> properly tune checkpoint and background writer activity. I think it's > >> almost ready to submit (you can see the current patch against 8.2 at > >> http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt ) > >> after making it a bit more human-readable. But I've realized that along > >> with that, I need some guidance in regards to what log level is > >> appropriate for this information. > >> > >> An example works better than explaining what the patch does: > >> > >> 2007-02-19 21:53:24.602 EST - DEBUG: checkpoint required (wrote > >> checkpoint_segments) > >> 2007-02-19 21:53:24.685 EST - DEBUG: checkpoint starting > >> 2007-02-19 21:53:24.705 EST - DEBUG: checkpoint flushing buffer pool > >> 2007-02-19 21:53:24.985 EST - DEBUG: checkpoint database fsync starting > >> 2007-02-19 21:53:42.725 EST - DEBUG: checkpoint database fsync complete > >> 2007-02-19 21:53:42.726 EST - DEBUG: checkpoint buffer flush dirty=8034 > >> write=279956 us sync=17739974 us > >> > >> Remember that "Load distributed checkpoint" discussion back in December? > >> You can see exactly how bad the problem is on your system with this log > >> style (this is from a pgbench run where it's postively awful--it really > >> does take over 17 seconds for the fsync to execute, and there are clients > >> that are hung the whole time waiting for it). > >> > >> I also instrumented the background writer. You get messages like this: > >> > >> 2007-02-19 21:58:54.328 EST - DEBUG: BGWriter Scan All - Written = 5/5 > >> Unscanned = 23/54 > >> > >> This shows that we wrote (5) the maximum pages we were allowed to write > >> (5) while failing to scan almost half (23) of the buffers we meant to look > >> at (54). By taking a look at this output while the system is under load, > >> I found I was able to do bgwriter optimization that used to take me days > >> of frustrating testing in hours. I've been waiting for a good guide to > >> bgwriter tuning since 8.1 came out. Once you have this, combined with > >> knowing how many buffers were dirty at checkpoint time because the > >> bgwriter didn't get to them in time (the number you want to minimize), the > >> tuning guide practically writes itself. > >> > >> So my question is...what log level should all this go at? Right now, I > >> have the background writer stuff adjusting its level dynamically based on > >> what happened; it logs at DEBUG2 if it hits the write limit (which should > >> be a rare event once you're tuned properly), DEBUG3 for writes that > >> scanned everything they were supposed to, and DEBUG4 if it scanned but > >> didn't find anything to write. The source of checkpoint information logs > >> at DEBUG1, the fsync/write info at DEBUG2. > >> > >> I'd like to move some of these up. On my system, I even have many of the > >> checkpoint messages logged at INFO (the source of the checkpoint and the > >> total write time line). It's a bit chatty, but when you get some weird > >> system pause issue it makes it easy to figure out if checkpoints were to > >> blame. Given how useful I feel some of these messages are to system > >> tuning, and to explaining what currently appears as inexplicable pauses, I > >> don't want them to be buried at DEBUG levels where people are unlikely to > >> ever see them (I think some people may be concerned about turning on > >> things labeled DEBUG at all). I am aware that I am too deep into this to > >> have an unbiased opinion at this point though, which is why I ask for > >> feedback on how to proceed here. > >> > >> -- > >> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD > >> > >> ---------------------------(end of broadcast)--------------------------- > >> TIP 7: You can help support the PostgreSQL project by donating at > >> > >> http://www.postgresql.org/about/donate > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Tue, 27 Mar 2007, Magnus Hagander wrote: > Would not at least some of these numbers be better presented through the > stats collector, so they can be easily monitored? > That goes along the line of my way way way away from finished attempt > earlier, perhaps a combination of these two patches? When I saw your patch recently, I thought to myself "hmmm, the data collected here sure looks familiar"--you even made some of the exact same code changes I did. I've been bogged down recently chasing a performance issue that, come to find, was mainly caused by the "high CPU usage for stats collector" bug. That caused the background writer to slow to a crawl under heavy load, which is why I was having all these checkpoint and writer issues that got me monitoring that code in the first place. With that seemingly resolved, slightly new plan now. Next I want to take the data I've been collecting in my patch, bundle the most important parts of that into messages sent to the stats writer the way it was suggested you rewrite your patch, then submit the result. I got log files down and have a real good idea what data should be collected, but as this would be my first time adding stats I'd certainly love some help with that. Once that monitoring infrastructure is in place, I then planned to merge Itagati's "Load distributed checkpoint" patch (it touches a lot of the same code) and test that out under heavy load. I think it gives a much better context to evaluate that patch in if rather than measuring just its gross results, you can say something like "with the patch in place the average fsync time on my system dropped from 3 seconds to 1.2 seconds when writing out more than 100MB at checkpoint time". That's the direct cause of the biggest problem in that area of code, so why not stare right at it rather than measuring it indirectly. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD