Thread: Re: Curiosity: what is PostgreSQL doing with data when "nothing" is happening?

Re: Curiosity: what is PostgreSQL doing with data when "nothing" is happening?

From

"Kevin Grittner"

Date:

26 November 2012, 13:15:55

Gavan Schneider wrote:
> On Sunday, November 25, 2012 at 04:27, Kevin Grittner wrote:
>
>> You might want to consider grabbing the source for
>> pg_clearxlogtail from pgfoundry and piping through that followed
>> by gzip in your archive script. An "empty" WAL file tends to be
>> about 16KB that way, rather than (for us) ranging between 4MB
>> and 8MB. That way you can keep the archive_timeout fairly short.

>   240 -rw-------+ 1 postgres _postgres  120852 Nov 26 14:56 0000000100000001000001C0.bz2
>  3096 -rw-------+ 1 postgres _postgres 1583557 Nov 26 15:06 0000000100000001000001C1.bz2
>     8 -rw-------+ 1 postgres _postgres     161 Nov 26 15:26 0000000100000001000001C2.bz2
>     8 -rw-------+ 1 postgres _postgres     158 Nov 26 15:26 0000000100000001000001C3.bz2

> PostgreSQL server idle throughout, and spitting out a WAL file
> for archive every 10 minutes

> Impressive what happens when pg_clearxlogtail gets into the mix
> halfway through! :)

I think I can spot where you changed the script.  :-)

Impressive that bzip2 does two orders of magnitude better than gzip
with this. In my benchmarks I found that piping WAL files through
pg_clearxlogtail followed by gzip was consistently faster than just
gzip -- even when the file was full. I put it down to a difference
in how buffering was handled. I'll have to check the performance of
bzip2 compared to gzip.

> Would the universe as we know it be upset if there was a
> postgresql.conf option such as:
>
>  archive_zero_fill_on_segment_switch = on|off # default off
>
> This would achieve the idle compression result much more
> elegantly (I know, it's still a hack) for those who have the
> need, without, as far as I can tell, breaking anything else.

The problem is that this would involve 16MB of writes to the OS for
every WAL-file switch, which might even result in that much actual
disk activity in many environments. The filter I wrote doesn't add
any disk I/O and (as I mentioned above), actually decreased CPU
time, at least in my benchmarks on SLES 10. I guess in environments
where the hit wouldn't be noticed and the WAL files were being
compressed it might be convenient, but the difference between that
and piping through the extra filter in the archive script is
minimal enough to make it questionable whether it is worth it.

-Kevin

Re: Curiosity: what is PostgreSQL doing with data when "nothing" is happening?

From

Gavan Schneider

Date:

26 November 2012, 23:57:35

On Monday, November 26, 2012 at 00:15, Kevin Grittner wrote:

>Impressive that bzip2 does two orders of magnitude better than gzip
>with this. ...

bzip2 has a few bytes of overhead for each additional large
block of the original file so some/all of the difference may
only reflect my 8Mb vs the usual 16 Mb WAL files. And this is
not a real saving until the results are put into a tarball since
each tiny/not-as-tiny file still consumes an inode and disk segment.

The differences that matter are the part filled WAL files with
real data... I didn't test, rather I'm using my experience that
bzip2 distribution archive files are always smaller than their
gzip alternatives. And bzip2 was available.

>Gavan Schneider:
>>Would the universe as we know it be upset if there was a
>>postgresql.conf option such as:
>>
>> archive_zero_fill_on_segment_switch = on|off # default off
>>
>>This would achieve the idle compression result much more
>>elegantly (I know, it's still a hack) for those who have the
>>need, without, as far as I can tell, breaking anything else.
>
>The problem is that this would involve 16MB of writes to the OS for
>every WAL-file switch, which might even result in that much actual
>disk activity in many environments. The filter I wrote doesn't add
>any disk I/O ...
>
Point taken.

More musings...

Maybe an optimisation that could work for many is if the initial
block of the WAL file carried the information as to how much of
the WAL file has useful data. And short (i.e., valid data only)
WAL files were acceptable to postgres on restore/replication.
(Obviously this is not for the cluster's pool of
working/overwritten WAL files.)

During normal operations where the WAL file is being overwritten
the proposed flag in the initial file segment would be set to
zero to indicate all page segments have to be checked on replay
(i.e., existing behavior in crash recovery), but at WAL
switchover the first file segment gets updated with the proposed
flag set to specify the extent of valid data that follows. Only
the indicated data is restored/replicated when the file is read
by postgres. Tools such as pg_clearxlogtail would then only need
to inspect the first part of the WAL file, calculate the
end_of_valid_data offset, and copy exactly that much to output.
This would save reading/checking the rest of the WAL file and
outputting the padding zero's.

The big world people could benefit since smaller update files
(esp. if compressed) can move around a network a lot faster in
replication environments.

AFAIK the downside would be one extra disc write per WAL
changeover. The worst case scenario is only the status quo,
i.e., the whole WAL file has to be processed since it has been
filled with data. Note in this case you may as well leave the
proposed flag in the first file segment as zero (no extra write
needed) since this correctly indicates the whole file has to be processed.

The upside, at least for those who need to rollout incomplete
WAL files to satisfy timing needs, is they could work with files
that are only as big as they need to be.

And a postgresql.conf switch could isolate the cost of that
minor extra file update per WAL changeover to those who can
benefit from it.

Worth any further thoughts?

Regards
Gavan Schneider