Thread: [HACKERS] Proposal: generic WAL compression

[HACKERS] Proposal: generic WAL compression

From
Oleg Ivanov
Date:
Hackers,

a few years ago generic WAL was proposed by Alexander Korotkov 

(https://www.postgresql.org/message-id/flat/CAPpHfdsXwZmojm6Dx%2BTJnpYk27kT4o7Ri6X_4OSWcByu1Rm%2BVA%40mail.gmail.com#CAPpHfdsXwZmojm6Dx+TJnpYk27kT4o7Ri6X_4OSWcByu1Rm+VA@mail.gmail.com).

and was committed into PostgreSQL 9.6.
One of the generic WAL advantages is the common interface for safe 
interaction with WAL for modules and extensions. The interface allows 
module to register the page, then change it, and then generic WAL 
generates and writes into xlog the algorithm of changing the old version 
of page into the new one. In the case of crash and recovery this 
algorithm may be applied.

Bloom and RUM indexes use generic WAL. The issue is that the generic 
algorithm of turning the old page into the new one is not optimal in the 
sense of record size in xlog. Now the main idea of the approach is to 
find all positions in the page where new version is different from the 
original one and write these changes into xlog. It works well if the 
page was rewritten completely or if only a few bytes have been changed. 
Nevertheless, this approach produces too large WAL record in the case of 
inserting or deleting a few bytes into the page. In this case there are 
almost no position where the old version and the new one are equal, so 
the record size is near the page size, but actual amount of changes in 
the page is small. This is exactly what happens often in RUM indexes.

In order to overcome that issue, I would like to propose the patch, 
which provides possibility to use another approach of the WAL record 
construction. If another approach fails to make a short enough record, 
it rolls back to the original approach. The main idea of another 
approach is not to perform bytewise comparison of pages, but finding the 
minimal editing distance between two pages and the corresponding editing 
algorithm. In the general case, finding editing distance takes O(N^2) 
time and memory. But there is also an algorithm which requires only 
O(ND) time and O(D^2) memory, where D is the editing distance between 
two sequences. Also for given D algorithm may show that the minimal 
editing distance between two sequences is more than D in the same amount 
of time and memory.

The special case of this algorithm which does not consider replace 
operations is described in the paper 
(http://www.xmailserver.org/diff2.pdf). The version of this algorithm 
which consumes O(ND) time and O(N) memory is used in diff console 
command, but for our purposes we don't need to increase the constant for 
the time in order to decrease memory complexity. For RUM indexes we 
usually have small enough editing distance (less than 64), so D^2 is not 
too much to store.

The results of experiments:

+------------------------------+----------------+----------------+----------------+----------------+
|             Name             |  WAL_diff(MB)  |  WAL_orig(MB) |  
Time_diff(s)  |  Time_orig(s)  |
+------------------------------+----------------+----------------+----------------+----------------+
| rum: make installcheck       | 38.9           | 82.5 | 4.37           
| 4.16           |
+------------------------------+----------------+----------------+----------------+----------------+
| bloom: make installcheck     | 1.0            | 1.0 | 0.66           | 
0.53           |
+------------------------------+----------------+----------------+----------------+----------------+
| test00.sql                 | 20.5           | 51.0 | 1.86           | 
1.41           |
+------------------------------+----------------+----------------+----------------+----------------+
| test01.sql                   | 207.1      | 219.7   | 8.06 | 6.89   
         |
+------------------------------+----------------+----------------+----------------+----------------+

We can see that the patch provides a little slowdown, but compresses 
generic WAL efficiently for RUM index. Also I'm going to do a few more 
experiments on this patch with another data.

The patch was tested on Lubuntu 14.04, but should not contain any 
platform-specific items. The patch, the files and scripts for doing the 
experiments and performance tests are attached.

Oleg Ivanov
Postgres Professional
The Russian PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Proposal: generic WAL compression

From
Antonin Houska
Date:
Oleg Ivanov <o.ivanov@postgrespro.ru> wrote:

> In order to overcome that issue, I would like to propose the patch, which
> provides possibility to use another approach of the WAL record
> construction.

After having spent several hours reviewing this patch I dare to send the
following comments:


* writeToPtr() and readFromPtr() are applied to the existing code. I think this is a reason for a separate diff, to be
appliedbefore the actual patch. 

BTW, if you're going to do any refactoring of the existing code, I think the
"Begin" and "Start" words should be used consistently, see "fragmentBegin" vs
"validStart" in computeRegionBaseDelta().


* What's the reason for changing FRAGMENT_HEADER_SIZE ? I see no change in the data layout that writeFragment()
produces.


* alignRegions()

** typo in the prologue: "mismathing".

** "An O(ND) Difference Algorithm and Its Variations" - I think the reference  should contain more information, e.g.
nameof the author(s). I think I even  saw URLs to scientific papers in the PG source but I'm not 100% sure right  now. 

** As for the algorithm itself: Although I didn't have enough patience (and  time) to follow it in every detail for
thisfirst review, I think I can  imagine what it is about. Yet I think reading would be easier if the  concepts of
alignmentand "diff delta" were depicted in the comments,  perhaps using some sort of "ASCII graphics". 

** DiffDeltaOperations enumeration: the individual values should be described.


* computeRegionDiffDelta()

** Does "previous delta" in one of the comments refer to "base delta",  i.e. the delta computation w/o your patch? If
so,the comment should  mention it explicitly. 

** If you use Min() to initialize the for-loop, it'd be more consistent if you  also used Max() when checking the
limits:

for (i = Min(validStart, targetStart); i < Max(validEnd, targetEnd); ++i)

And similarly you can simplify the body of the loop.


* computeDelta()

As the expresssion

*((bool *) pageData->delta)

is used more than once, I think a separate (bool *) variable should be used.


--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


Re: [HACKERS] Proposal: generic WAL compression

From
Arthur Silva
Date:
I wonder if the algorithm could be changed to operate with units of 2 or 4 bytes (instead of 1).
Since the algorithm speed is essentially doubled/quadrupled it could yield a better runtime/compression trade-off.

Does that make sense?

--
Arthur Silva


On Wed, Nov 1, 2017 at 12:43 AM, Oleg Ivanov <o.ivanov@postgrespro.ru> wrote:
Hackers,

a few years ago generic WAL was proposed by Alexander Korotkov (https://www.postgresql.org/message-id/flat/CAPpHfdsXwZmojm6Dx%2BTJnpYk27kT4o7Ri6X_4OSWcByu1Rm%2BVA%40mail.gmail.com#CAPpHfdsXwZmojm6Dx+TJnpYk27kT4o7Ri6X_4OSWcByu1Rm+VA@mail.gmail.com). and was committed into PostgreSQL 9.6.
One of the generic WAL advantages is the common interface for safe interaction with WAL for modules and extensions. The interface allows module to register the page, then change it, and then generic WAL generates and writes into xlog the algorithm of changing the old version of page into the new one. In the case of crash and recovery this algorithm may be applied.

Bloom and RUM indexes use generic WAL. The issue is that the generic algorithm of turning the old page into the new one is not optimal in the sense of record size in xlog. Now the main idea of the approach is to find all positions in the page where new version is different from the original one and write these changes into xlog. It works well if the page was rewritten completely or if only a few bytes have been changed. Nevertheless, this approach produces too large WAL record in the case of inserting or deleting a few bytes into the page. In this case there are almost no position where the old version and the new one are equal, so the record size is near the page size, but actual amount of changes in the page is small. This is exactly what happens often in RUM indexes.

In order to overcome that issue, I would like to propose the patch, which provides possibility to use another approach of the WAL record construction. If another approach fails to make a short enough record, it rolls back to the original approach. The main idea of another approach is not to perform bytewise comparison of pages, but finding the minimal editing distance between two pages and the corresponding editing algorithm. In the general case, finding editing distance takes O(N^2) time and memory. But there is also an algorithm which requires only O(ND) time and O(D^2) memory, where D is the editing distance between two sequences. Also for given D algorithm may show that the minimal editing distance between two sequences is more than D in the same amount of time and memory.

The special case of this algorithm which does not consider replace operations is described in the paper (http://www.xmailserver.org/diff2.pdf). The version of this algorithm which consumes O(ND) time and O(N) memory is used in diff console command, but for our purposes we don't need to increase the constant for the time in order to decrease memory complexity. For RUM indexes we usually have small enough editing distance (less than 64), so D^2 is not too much to store.

The results of experiments:

+------------------------------+----------------+----------------+----------------+----------------+
|             Name             |  WAL_diff(MB)  |  WAL_orig(MB) |  Time_diff(s)  |  Time_orig(s)  |
+------------------------------+----------------+----------------+----------------+----------------+
| rum: make installcheck       | 38.9           | 82.5 | 4.37           | 4.16           |
+------------------------------+----------------+----------------+----------------+----------------+
| bloom: make installcheck     | 1.0            | 1.0 | 0.66           | 0.53           |
+------------------------------+----------------+----------------+----------------+----------------+
| test00.sql                 | 20.5           | 51.0 | 1.86           | 1.41           |
+------------------------------+----------------+----------------+----------------+----------------+
| test01.sql                   | 207.1      | 219.7   | 8.06 | 6.89           |
+------------------------------+----------------+----------------+----------------+----------------+

We can see that the patch provides a little slowdown, but compresses generic WAL efficiently for RUM index. Also I'm going to do a few more experiments on this patch with another data.

The patch was tested on Lubuntu 14.04, but should not contain any platform-specific items. The patch, the files and scripts for doing the experiments and performance tests are attached.

Oleg Ivanov
Postgres Professional
The Russian PostgreSQL Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: generic WAL compression

From
Antonin Houska
Date:
Antonin Houska <ah@cybertec.at> wrote:

> Oleg Ivanov <o.ivanov@postgrespro.ru> wrote:
>
> > In order to overcome that issue, I would like to propose the patch, which
> > provides possibility to use another approach of the WAL record
> > construction.
>
> After having spent several hours reviewing this patch I dare to send the
> following comments:

One more idea:

I think the metadata (ALIGN_GAP) should be stored separate from the actual
data so that you can use memcpy() instead of this loop:
   while (i < j)   {    char        c = targetRegionAligned[i++];
    writeToPtr(ptr, c);   }


--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at


Re: [HACKERS] Proposal: generic WAL compression

From
Michael Paquier
Date:
On Mon, Nov 20, 2017 at 4:49 PM, Antonin Houska <ah@cybertec.at> wrote:
> One more idea:
>
> I think the metadata (ALIGN_GAP) should be stored separate from the actual
> data so that you can use memcpy() instead of this loop:
>
>     while (i < j)
>     {
>             char                c = targetRegionAligned[i++];
>
>             writeToPtr(ptr, c);
>     }

Moved to next CF per lack of reviews. The patch still applies.

Oleg, I would recommend as well that you do more patch reviews. Let's
not forget that for one patch submitted you should review one patch
with a somewhat equal difficulty, and yours is quite specialized.
-- 
Michael


Re: [HACKERS] Proposal: generic WAL compression

From
Stephen Frost
Date:
Oleg,

I'm not really sure why this is still in Needs Review as a review was
posted and I don't see any follow-up.  I've changed this to be Waiting
for Author.

* Antonin Houska (ah@cybertec.at) wrote:
> Oleg Ivanov <o.ivanov@postgrespro.ru> wrote:
>
> > In order to overcome that issue, I would like to propose the patch, which
> > provides possibility to use another approach of the WAL record
> > construction.
>
> After having spent several hours reviewing this patch I dare to send the
> following comments:

Thanks for the review Antonin!

> * writeToPtr() and readFromPtr() are applied to the existing code. I think
>   this is a reason for a separate diff, to be applied before the actual patch.

I'm not really a fan of using these, for my 2c.  I'm not sure how others
feel, but having these macros which change one of the arguments to the
macro (by advancing the pointer) doesn't result in a net improvement to
readability for me.

The other review comments also seem pretty reasonable to me.

Thanks!

Stephen

Attachment

Re: Re: [HACKERS] Proposal: generic WAL compression

From
David Steele
Date:
Hi Oleg,

On 1/22/18 4:37 PM, Stephen Frost wrote:
> Oleg,
> 
> I'm not really sure why this is still in Needs Review as a review was
> posted and I don't see any follow-up.  I've changed this to be Waiting
> for Author.
> 
> * Antonin Houska (ah@cybertec.at) wrote:
>> Oleg Ivanov <o.ivanov@postgrespro.ru> wrote:
>>
>>> In order to overcome that issue, I would like to propose the patch, which
>>> provides possibility to use another approach of the WAL record
>>> construction.
>>
>> After having spent several hours reviewing this patch I dare to send the
>> following comments:
> 
> Thanks for the review Antonin!
> 
>> * writeToPtr() and readFromPtr() are applied to the existing code. I think
>>   this is a reason for a separate diff, to be applied before the actual patch.
> 
> I'm not really a fan of using these, for my 2c.  I'm not sure how others
> feel, but having these macros which change one of the arguments to the
> macro (by advancing the pointer) doesn't result in a net improvement to
> readability for me.
> 
> The other review comments also seem pretty reasonable to me.

This proposal is still in need of review and hasn't really gotten it in
the last few CFs.

Since the feature does not seem like a good fit for the last CF of PG11
I am marking it Returned with Feedback.

Regards,
-- 
-David
david@pgmasters.net


Re: [HACKERS] Proposal: generic WAL compression

From
Markus Nullmeier
Date:
Hi David,

On 07/02/18 18:31, David Steele wrote:

> This proposal is still in need of review and hasn't really gotten it in
> the last few CFs.

Agreed.

> Since the feature does not seem like a good fit for the last CF of PG11
> I am marking it Returned with Feedback.

Is there any chance that it may be still resubmitted / reopened for the
March CF?
Actually right now I am in the process of reviewing it, for it would be
really useful for my "OUZO" project (a fork of the RUM access method).

One general comment I can already make is that enabling compression
should be made optional, which appears to be a small and easy addition
to the generic WAL API.

Best regards,

-- 
Markus Nullmeier            http://www.g-vo.org
German Astrophysical Virtual Observatory (GAVO)


Re: [HACKERS] Proposal: generic WAL compression

From
Stephen Frost
Date:
Greetings,

* Markus Nullmeier (dq124@uni-heidelberg.de) wrote:
> On 07/02/18 18:31, David Steele wrote:
>
> > This proposal is still in need of review and hasn't really gotten it in
> > the last few CFs.
>
> Agreed.
>
> > Since the feature does not seem like a good fit for the last CF of PG11
> > I am marking it Returned with Feedback.
>
> Is there any chance that it may be still resubmitted / reopened for the
> March CF?

Of course, you're certainly welcome to resubmit it to a future CF.  As
we're getting close to the end of this release cycle, I tend to think it
would be appropriate to submit it for the post-11 CF rather than the one
in March.

> Actually right now I am in the process of reviewing it, for it would be
> really useful for my "OUZO" project (a fork of the RUM access method).

Glad to hear that you're continuing to work on it.

> One general comment I can already make is that enabling compression
> should be made optional, which appears to be a small and easy addition
> to the generic WAL API.

The question would be- is there a way for us to determine when it should
be enabled or not enabled, or do we have to ask the user to tell us?
Asking the user is something we'd really not do unless we absolutely
have to.  Much better is if we can figure out when it's best to enable
or disable (or use/not use) based on some criteria and do so
automatically.

Thanks!

Stephen

Attachment

Re: [HACKERS] Proposal: generic WAL compression

From
Markus Nullmeier
Date:
On 07/02/18 19:42, Stephen Frost wrote:
>> really useful for my "OUZO" project (a fork of the RUM access method).
> 
> Glad to hear that you're continuing to work on it.

Yes, it will be available on Github eventually.

>> One general comment I can already make is that enabling compression
>> should be made optional, which appears to be a small and easy addition
>> to the generic WAL API.
> 
> The question would be- is there a way for us to determine when it should
> be enabled or not enabled, or do we have to ask the user to tell us?

My suggestion would be adding something like

   #define GENERIC_XLOG_MOVE_DETECT_ON   0x0002  /* search for moved
parts of page image */
   #define GENERIC_XLOG_MOVE_DETECT_OFF  0x0004  /* do not search for
moved parts of page image */

This would be used for the 'flags' argument of GenericXLogRegisterBuffer(),
allowing code using generic WAL (such as custom access methods) to
control this kind of compression on a per buffer basis, if need be.
Then it would be up to the author of any extension relying on generic
WAL to decide if and how the user will be given control of WAL
compression.

However,

> Asking the user is something we'd really not do unless we absolutely
> have to.  Much better is if we can figure out when it's best to enable
> or disable (or use/not use) based on some criteria and do so
> automatically.

this is a point I did not think of before :-)  Probably a logical
choice would be having "GENERIC_XLOG_MOVE_DETECT" default to true
for a 'wal_level' value of 'replica' (the default setting nowadays)
or higher, and default to false for 'minimal'.
In this way, generic WAL will be automatically compressed when it is
potentially stored for some longer time, or consuming network bandwidth.


-- 
Markus Nullmeier            http://www.g-vo.org
German Astrophysical Virtual Observatory (GAVO)


Re: [HACKERS] Proposal: generic WAL compression

From
Oleg Ivanov
Date:
Hello everyone!

Unfortunately, I faced the use case with RUM index, in which my patch 
produced
enormously large time overhead (queries execution time is about 2 or 3 times
slower). Only now I finally managed to limit this overhead by 20% or 
even make
it statistically insignificant on my HDD. Nevertheless, SSD evaluation 
is still
required.
So I attach new test totat_test.sh script which includes the bad RUM use 
case
and the new version of the patch.

Thanks everybody for review, it improved the patch a lot. The majority 
of the
listed drawbacks were fixed, the others are discussed below.

On 11/16/2017 08:31 PM, Antonin Houska wrote:
> * writeToPtr() and readFromPtr() are applied to the existing code. I think
>    this is a reason for a separate diff, to be applied before the actual patch.
Removed the existing code refactoring, that should be a separate patch 
indeed.

> * What's the reason for changing FRAGMENT_HEADER_SIZE ? I see no change in
>    the data layout that writeFragment() produces.
Diff delta has its own header which consists of one char and one int. To 
make
this point more clear, I defined DIFF_DELTA_HEADER_SIZE and added
2 * DIFF_DELTA_HEADER_SIZE to MAX_DELTA_SIZE.

> ** "An O(ND) Difference Algorithm and Its Variations" - I think the reference
>     should contain more information, e.g. name of the author(s). I think I even
>     saw URLs to scientific papers in the PG source but I'm not 100% sure right
>     now.
The reference to the paper contains much more information now.

> ** As for the algorithm itself: Although I didn't have enough patience (and
>     time) to follow it in every detail for this first review, I think I can
>     imagine what it is about. Yet I think reading would be easier if the
>     concepts of alignment and "diff delta" were depicted in the comments,
>     perhaps using some sort of "ASCII graphics".
I briefly described the main idea of the dynamical programming algorithm
in comments of alignRegions. It can be further refined if you think it is
still unclear now, but I'm afraid the detailed description will result 
in just
copying the referred paper into the comments.

> ** DiffDeltaOperations enumeration: the individual values should be described.
>
>
> * computeRegionDiffDelta()
>
> ** Does "previous delta" in one of the comments refer to "base delta",
>     i.e. the delta computation w/o your patch? If so, the comment should
>     mention it explicitly.
>
> ** If you use Min() to initialize the for-loop, it'd be more consistent if you
>     also used Max() when checking the limits:
>
> for (i = Min(validStart, targetStart); i < Max(validEnd, targetEnd); ++i)
>
> And similarly you can simplify the body of the loop.
I found that this part of code is unnecessary and deleted it.

On 11/17/2017 03:02 PM, Arthur Silva wrote:
> I wonder if the algorithm could be changed to operate with units of 2 
> or 4 bytes (instead of 1).
> Since the algorithm speed is essentially doubled/quadrupled it could 
> yield a better runtime/compression trade-off.
>
> Does that make sense?
This approach implies that we cannot detect data shift by one byte, for 
example.
It is suitable for 2- or 4-bytes-aligned data, but looks not general enough.

On 01/23/2018 12:37 AM, Stephen Frost wrote:
>> * writeToPtr() and readFromPtr() are applied to the existing code. I think
>>    this is a reason for a separate diff, to be applied before the actual patch.
> I'm not really a fan of using these, for my 2c.  I'm not sure how others
> feel, but having these macros which change one of the arguments to the
> macro (by advancing the pointer) doesn't result in a net improvement to
> readability for me.
Sounds reasonable, but the patch uses such constructions as writeToPtr 
rather
often. How do you feel about using writeToPtr(&ptr, data) which which more
explicitly indicates the first argument changing?

On 02/07/2018 09:02 PM, Markus Nullmeier wrote:
> One general comment I can already make is that enabling compression
> should be made optional, which appears to be a small and easy addition
> to the generic WAL API.
Now I'm working on this.I consider creation the function
GenericXLogRegisterBufferEx in addition to GenericXLogRegisterBuffer.
GenericXLogRegisterBufferEx will take a structure with parameters for diff
delta such as maximal alignment length, to which parts of page it must be
applied, etc. And GenericXLogRegisterBufferwill call
GenericXLogRegisterBufferEx with default parameters. This allows using
different compression settings for different indexes. What do you think 
about
such solution?

Attachment

Re: [HACKERS] Proposal: generic WAL compression

From
Oleg Ivanov
Date:
On 02/07/2018 09:02 PM, Markus Nullmeier wrote:
> One general comment I can already make is that enabling compression
> should be made optional, which appears to be a small and easy addition
> to the generic WAL API.

The new version of the patch is attached.

In order to control generic WAL compression the new structure
PageXLogCompressParams is introduced. It is passed as an additional
parameter into GenericXLogRegisterBufferEx, so the access method
can control its own WAL compression.

GenericXLogRegisterBuffer uses default compression settings which
appeared to be a reasonable tradeoff between performance overheads
and compression rate on RUM. On my HDD PostgreSQL works even 10%
faster for some RUM workloads because of reducing size of generic
WAL to be written.

Oleg Ivanov
Postgres Professional
The Russian PostgreSQL Company

Attachment