Thread: Separate BLCKSZ for data and logging

Separate BLCKSZ for data and logging

From
Mark Wong
Date:
Hi all,

I've been wondering if there might be anything to gain by having a
separate block size for logging and data.  I thought I might try
defining DATA_BLCKSZ and LOG_BLCKSZ and see what kind of trouble I get
myself into.

I wasn't able to find any previous discussion but pehaps 'separate
BLKSZ' were poor parameters to use.  Any thoughts?

Thanks,
Mark


Re: Separate BLCKSZ for data and logging

From
"Jonah H. Harris"
Date:
On 3/16/06, Mark Wong <markw@osdl.org> wrote:
I've been wondering if there might be anything to gain by having a
separate block size for logging and data.  I thought I might try
defining DATA_BLCKSZ and LOG_BLCKSZ and see what kind of trouble I get
myself into.

If you're going to try it out, here's a starting point based on the block sizes used by Oracle:

512 bytes on Linux, Solaris, AIX, Windows
1K on HP-UX and Tru64
2K on SCO
4K on MVS

--
Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
732.331.1324

Re: Separate BLCKSZ for data and logging

From
Simon Riggs
Date:
On Thu, 2006-03-16 at 08:21 -0800, Mark Wong wrote:

> I've been wondering if there might be anything to gain by having a
> separate block size for logging and data.  I thought I might try
> defining DATA_BLCKSZ and LOG_BLCKSZ and see what kind of trouble I get
> myself into.
> 
> I wasn't able to find any previous discussion but pehaps 'separate
> BLKSZ' were poor parameters to use.  Any thoughts?

I see your thinking.... presumably a performance tuning thought?

Overall, the two things are fairly separate, apart from the fact that we
do currently log whole data blocks straight to the log. Usually just
one, but possibly 2 or three. So I have a feeling that things would
become less efficient if you did this, not more.

But its a good line of thought and I'll have a look at that.

Best Regards, Simon Riggs



Re: Separate BLCKSZ for data and logging

From
Mark Wong
Date:
On Thu, 16 Mar 2006 19:37:07 +0000
Simon Riggs <simon@2ndquadrant.com> wrote:

> On Thu, 2006-03-16 at 08:21 -0800, Mark Wong wrote:
> 
> > I've been wondering if there might be anything to gain by having a
> > separate block size for logging and data.  I thought I might try
> > defining DATA_BLCKSZ and LOG_BLCKSZ and see what kind of trouble I get
> > myself into.
> > 
> > I wasn't able to find any previous discussion but pehaps 'separate
> > BLKSZ' were poor parameters to use.  Any thoughts?
> 
> I see your thinking.... presumably a performance tuning thought?

Yeah. :)

> Overall, the two things are fairly separate, apart from the fact that we
> do currently log whole data blocks straight to the log. Usually just
> one, but possibly 2 or three. So I have a feeling that things would
> become less efficient if you did this, not more.

I was hoping that in the case where 2 or more data blocks are written to
the log that they could written once within a single larger log block. 
The log block size must be larger than the data block size, of course.

Thanks,
Mark


Re: Separate BLCKSZ for data and logging

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Overall, the two things are fairly separate, apart from the fact that we
> do currently log whole data blocks straight to the log. Usually just
> one, but possibly 2 or three. So I have a feeling that things would
> become less efficient if you did this, not more.

> But its a good line of thought and I'll have a look at that.

I too think reducing the size of WAL blocks might be a win, because
we currently always write whole blocks, and so a series of small
transactions will be rewriting the same 8K block multiple times.
If the filesystem's native block size is less than 8K, matching that
size should theoretically make things faster.

Whether it makes enough difference to be worth the trouble is another
question ...
        regards, tom lane


Re: Separate BLCKSZ for data and logging

From
Simon Riggs
Date:
On Thu, 2006-03-16 at 12:22 -0800, Mark Wong wrote:

> I was hoping that in the case where 2 or more data blocks are written to
> the log that they could written once within a single larger log block. 
> The log block size must be larger than the data block size, of course.

I think Tom's right... the OS blocksize is smaller than BLCKSZ, so
reducing the size might help with a very high transaction load when
commits are required very frequently. At checkpoint it sounds like we
might benefit from a large WAL blocksize because of all the additional
blocks written, but we often write more than one block at a time anyway,
and that still translates to multiple OS blocks whichever way you cut
it, so I'm not convinced yet.

On Thu, 2006-03-16 at 15:21 -0500, Tom Lane wrote: 
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Overall, the two things are fairly separate, apart from the fact that we
> > do currently log whole data blocks straight to the log. Usually just
> > one, but possibly 2 or three. So I have a feeling that things would
> > become less efficient if you did this, not more.
> 
> > But its a good line of thought and I'll have a look at that.
> 
> I too think reducing the size of WAL blocks might be a win, because
> we currently always write whole blocks, and so a series of small
> transactions will be rewriting the same 8K block multiple times.
> If the filesystem's native block size is less than 8K, matching that
> size should theoretically make things faster.

Might it be possible to do this: When committing, if the current WAL
page is less than half-full wait for a single spin-lock cycle and then
do the write? (With the spin-lock, I mean on a single CPU we wait zero,
on a multi-CPU we wait a while). This is effectively a modification of
the group commit idea, but not to wait every time - only when it is
write-efficient to do so. (And we'd make that optional, too). We could
then ditch the remnant of the group-commit code.

Best Regards, Simon Riggs



Re: Separate BLCKSZ for data and logging

From
Mark Wong
Date:
On Thu, 16 Mar 2006 20:51:54 +0000
Simon Riggs <simon@2ndquadrant.com> wrote:

> On Thu, 2006-03-16 at 12:22 -0800, Mark Wong wrote:
> 
> > I was hoping that in the case where 2 or more data blocks are written to
> > the log that they could written once within a single larger log block. 
> > The log block size must be larger than the data block size, of course.
> 
> I think Tom's right... the OS blocksize is smaller than BLCKSZ, so
> reducing the size might help with a very high transaction load when
> commits are required very frequently. At checkpoint it sounds like we
> might benefit from a large WAL blocksize because of all the additional
> blocks written, but we often write more than one block at a time anyway,
> and that still translates to multiple OS blocks whichever way you cut
> it, so I'm not convinced yet.
> 
> On Thu, 2006-03-16 at 15:21 -0500, Tom Lane wrote: 
> > Simon Riggs <simon@2ndquadrant.com> writes:
> > > Overall, the two things are fairly separate, apart from the fact that we
> > > do currently log whole data blocks straight to the log. Usually just
> > > one, but possibly 2 or three. So I have a feeling that things would
> > > become less efficient if you did this, not more.
> > 
> > > But its a good line of thought and I'll have a look at that.
> > 
> > I too think reducing the size of WAL blocks might be a win, because
> > we currently always write whole blocks, and so a series of small
> > transactions will be rewriting the same 8K block multiple times.
> > If the filesystem's native block size is less than 8K, matching that
> > size should theoretically make things faster.
> 
> Might it be possible to do this: When committing, if the current WAL
> page is less than half-full wait for a single spin-lock cycle and then
> do the write? (With the spin-lock, I mean on a single CPU we wait zero,
> on a multi-CPU we wait a while). This is effectively a modification of
> the group commit idea, but not to wait every time - only when it is
> write-efficient to do so. (And we'd make that optional, too). We could
> then ditch the remnant of the group-commit code.

Sounds like there is some agreement that this could be an interesting
exercise.  I'll see what I can do.

Thanks,
Mark


Re: Separate BLCKSZ for data and logging

From
"Qingqing Zhou"
Date:
"Simon Riggs" <simon@2ndquadrant.com> wrote
>
> I think Tom's right... the OS blocksize is smaller than BLCKSZ, so
> reducing the size might help with a very high transaction load when
> commits are required very frequently. At checkpoint it sounds like we
> might benefit from a large WAL blocksize because of all the additional
> blocks written, but we often write more than one block at a time anyway,
> and that still translates to multiple OS blocks whichever way you cut
> it, so I'm not convinced yet.
>

As I observed from other database system, they really did something like
this. You can see the disk write sequence is something like this:
   512   512   2048   4196   32768   512   ...

That is, the xlog write bytes will always align to the disk sector size
(required by O_DIRECT), and try to write out as much as possible (but within
a upper bound like 32768 I guess). As I understand, this change would not
take too much trouble, maybe a local change in XlogWrite() is enough.

Regards,
Qingqing