Thread: Separate BLCKSZ for data and logging
Hi all, I've been wondering if there might be anything to gain by having a separate block size for logging and data. I thought I might try defining DATA_BLCKSZ and LOG_BLCKSZ and see what kind of trouble I get myself into. I wasn't able to find any previous discussion but pehaps 'separate BLKSZ' were poor parameters to use. Any thoughts? Thanks, Mark
On 3/16/06, Mark Wong <markw@osdl.org> wrote:
If you're going to try it out, here's a starting point based on the block sizes used by Oracle:
512 bytes on Linux, Solaris, AIX, Windows
1K on HP-UX and Tru64
2K on SCO
4K on MVS
--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324
I've been wondering if there might be anything to gain by having a
separate block size for logging and data. I thought I might try
defining DATA_BLCKSZ and LOG_BLCKSZ and see what kind of trouble I get
myself into.
If you're going to try it out, here's a starting point based on the block sizes used by Oracle:
512 bytes on Linux, Solaris, AIX, Windows
1K on HP-UX and Tru64
2K on SCO
4K on MVS
--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324
On Thu, 2006-03-16 at 08:21 -0800, Mark Wong wrote: > I've been wondering if there might be anything to gain by having a > separate block size for logging and data. I thought I might try > defining DATA_BLCKSZ and LOG_BLCKSZ and see what kind of trouble I get > myself into. > > I wasn't able to find any previous discussion but pehaps 'separate > BLKSZ' were poor parameters to use. Any thoughts? I see your thinking.... presumably a performance tuning thought? Overall, the two things are fairly separate, apart from the fact that we do currently log whole data blocks straight to the log. Usually just one, but possibly 2 or three. So I have a feeling that things would become less efficient if you did this, not more. But its a good line of thought and I'll have a look at that. Best Regards, Simon Riggs
On Thu, 16 Mar 2006 19:37:07 +0000 Simon Riggs <simon@2ndquadrant.com> wrote: > On Thu, 2006-03-16 at 08:21 -0800, Mark Wong wrote: > > > I've been wondering if there might be anything to gain by having a > > separate block size for logging and data. I thought I might try > > defining DATA_BLCKSZ and LOG_BLCKSZ and see what kind of trouble I get > > myself into. > > > > I wasn't able to find any previous discussion but pehaps 'separate > > BLKSZ' were poor parameters to use. Any thoughts? > > I see your thinking.... presumably a performance tuning thought? Yeah. :) > Overall, the two things are fairly separate, apart from the fact that we > do currently log whole data blocks straight to the log. Usually just > one, but possibly 2 or three. So I have a feeling that things would > become less efficient if you did this, not more. I was hoping that in the case where 2 or more data blocks are written to the log that they could written once within a single larger log block. The log block size must be larger than the data block size, of course. Thanks, Mark
Simon Riggs <simon@2ndquadrant.com> writes: > Overall, the two things are fairly separate, apart from the fact that we > do currently log whole data blocks straight to the log. Usually just > one, but possibly 2 or three. So I have a feeling that things would > become less efficient if you did this, not more. > But its a good line of thought and I'll have a look at that. I too think reducing the size of WAL blocks might be a win, because we currently always write whole blocks, and so a series of small transactions will be rewriting the same 8K block multiple times. If the filesystem's native block size is less than 8K, matching that size should theoretically make things faster. Whether it makes enough difference to be worth the trouble is another question ... regards, tom lane
On Thu, 2006-03-16 at 12:22 -0800, Mark Wong wrote: > I was hoping that in the case where 2 or more data blocks are written to > the log that they could written once within a single larger log block. > The log block size must be larger than the data block size, of course. I think Tom's right... the OS blocksize is smaller than BLCKSZ, so reducing the size might help with a very high transaction load when commits are required very frequently. At checkpoint it sounds like we might benefit from a large WAL blocksize because of all the additional blocks written, but we often write more than one block at a time anyway, and that still translates to multiple OS blocks whichever way you cut it, so I'm not convinced yet. On Thu, 2006-03-16 at 15:21 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Overall, the two things are fairly separate, apart from the fact that we > > do currently log whole data blocks straight to the log. Usually just > > one, but possibly 2 or three. So I have a feeling that things would > > become less efficient if you did this, not more. > > > But its a good line of thought and I'll have a look at that. > > I too think reducing the size of WAL blocks might be a win, because > we currently always write whole blocks, and so a series of small > transactions will be rewriting the same 8K block multiple times. > If the filesystem's native block size is less than 8K, matching that > size should theoretically make things faster. Might it be possible to do this: When committing, if the current WAL page is less than half-full wait for a single spin-lock cycle and then do the write? (With the spin-lock, I mean on a single CPU we wait zero, on a multi-CPU we wait a while). This is effectively a modification of the group commit idea, but not to wait every time - only when it is write-efficient to do so. (And we'd make that optional, too). We could then ditch the remnant of the group-commit code. Best Regards, Simon Riggs
On Thu, 16 Mar 2006 20:51:54 +0000 Simon Riggs <simon@2ndquadrant.com> wrote: > On Thu, 2006-03-16 at 12:22 -0800, Mark Wong wrote: > > > I was hoping that in the case where 2 or more data blocks are written to > > the log that they could written once within a single larger log block. > > The log block size must be larger than the data block size, of course. > > I think Tom's right... the OS blocksize is smaller than BLCKSZ, so > reducing the size might help with a very high transaction load when > commits are required very frequently. At checkpoint it sounds like we > might benefit from a large WAL blocksize because of all the additional > blocks written, but we often write more than one block at a time anyway, > and that still translates to multiple OS blocks whichever way you cut > it, so I'm not convinced yet. > > On Thu, 2006-03-16 at 15:21 -0500, Tom Lane wrote: > > Simon Riggs <simon@2ndquadrant.com> writes: > > > Overall, the two things are fairly separate, apart from the fact that we > > > do currently log whole data blocks straight to the log. Usually just > > > one, but possibly 2 or three. So I have a feeling that things would > > > become less efficient if you did this, not more. > > > > > But its a good line of thought and I'll have a look at that. > > > > I too think reducing the size of WAL blocks might be a win, because > > we currently always write whole blocks, and so a series of small > > transactions will be rewriting the same 8K block multiple times. > > If the filesystem's native block size is less than 8K, matching that > > size should theoretically make things faster. > > Might it be possible to do this: When committing, if the current WAL > page is less than half-full wait for a single spin-lock cycle and then > do the write? (With the spin-lock, I mean on a single CPU we wait zero, > on a multi-CPU we wait a while). This is effectively a modification of > the group commit idea, but not to wait every time - only when it is > write-efficient to do so. (And we'd make that optional, too). We could > then ditch the remnant of the group-commit code. Sounds like there is some agreement that this could be an interesting exercise. I'll see what I can do. Thanks, Mark
"Simon Riggs" <simon@2ndquadrant.com> wrote > > I think Tom's right... the OS blocksize is smaller than BLCKSZ, so > reducing the size might help with a very high transaction load when > commits are required very frequently. At checkpoint it sounds like we > might benefit from a large WAL blocksize because of all the additional > blocks written, but we often write more than one block at a time anyway, > and that still translates to multiple OS blocks whichever way you cut > it, so I'm not convinced yet. > As I observed from other database system, they really did something like this. You can see the disk write sequence is something like this: 512 512 2048 4196 32768 512 ... That is, the xlog write bytes will always align to the disk sector size (required by O_DIRECT), and try to write out as much as possible (but within a upper bound like 32768 I guess). As I understand, this change would not take too much trouble, maybe a local change in XlogWrite() is enough. Regards, Qingqing