Thread: Compressed pluggable storage experiments

Compressed pluggable storage experiments

From
Ildar Musin
Date:
Hi hackers,

I've been experimenting with pluggable storage API recently and just
feel like I can share my first experience. First of all it's great to
have this API and that now community has the opportunity to implement
alternative storage engines. There are a few applications that come to
mind and a compressed storage is one of them.

Recently I've been working on a simple append-only compressed storage
[1]. My first idea was to just store data into compressed 1mb blocks
in a continuous file and keep separate file for block offsets (similar
to Knizhnik's CFS proposal). But then i realized that then i won't be
able to use most of postgres' infrastructure like WAL-logging and also
won't be able to implement some of the functions of TableAmRoutine
(like bitmap scan or analyze). So I had to adjust extension the way to
utilize standard postgres 8kb blocks: compressed 1mb blocks are split
into chunks and distributed among 8kb blocks. Current page layout
looks like this:

┌───────────┐
│ metapage  │
└───────────┘
┌───────────┐  ┐
│  block 1  │  │
├────...────┤  │ compressed 1mb block
│  block k  │  │
└───────────┘  ┘
┌───────────┐  ┐
│ block k+1 │  │
├────...────┤  │ another compressed 1mb block
│  block m  │  │
└───────────┘  ┘

Inside compressed blocks there are regular postgres heap tuples.

The following is the list of things i stumbled upon while implementing
storage. Since API is just came out there are not many examples of
pluggable storages and even less as external extensions (I managed to
find only blackhole_am by Michael Paquier which doesn't do much). So
many things i had to figure out by myself. Hopefully some of those
issues have a solution that i just can't see.

1. Unlike FDW API, in pluggable storage API there are no routines like
"begin modify table" and "end modify table" and there is no shared
state between insert/update/delete calls. In context of compressed
storage that means that there is no exact moment when we can finalize
writes (compress, split into chunks etc). We can set a callback at the
end of transaction, but in this case we'll have to keep latest
modifications for every table in memory until the end of transaction.
As for shared state we also can maintain some kind of key-value data
structure with per-relation shared state. But that again requires memory.
Because of this currently I only implemented COPY semantics.

2. It looks like I cannot implement custom storage options. E.g. for
compressed storage it makes sense to implement different compression
methods (lz4, zstd etc.) and corresponding options (like compression
level). But as i can see storage options (like fillfactor etc) are
hardcoded and are not extensible. Possible solution is to use GUCs
which would work but is not extremely convinient.

3. A bit surprising limitation that in order to use bitmap scan the
maximum number of tuples per page must not exceed 291 due to
MAX_TUPLES_PER_PAGE macro in tidbitmap.c which is calculated based on
8kb page size. In case of 1mb page this restriction feels really
limiting.

4. In order to use WAL-logging each page must start with a standard 24
byte PageHeaderData even if it is needless for storage itself. Not a
big deal though. Another (acutally documented) WAL-related limitation
is that only generic WAL can be used within extension. So unless
inserts are made in bulks it's going to require a lot of disk space to
accomodate logs and wide bandwith for replication.

pg_cryogen extension is still in developement so if other issues arise
i'll post them here. At this point the extension already supports
inserts via COPY, index and bitmap scans, vacuum (only freezing),
analyze. It uses lz4 compression and currently i'm working on adding
different compression methods. I'm also willing to work on
forementioned issues in API if community verifies them as valid.


[1] https://github.com/adjust/pg_cryogen

Thanks,
Ildar

Re: Compressed pluggable storage experiments

From
Alvaro Herrera
Date:
On 2019-Oct-10, Ildar Musin wrote:

> 1. Unlike FDW API, in pluggable storage API there are no routines like
> "begin modify table" and "end modify table" and there is no shared
> state between insert/update/delete calls.

Hmm.  I think adding a begin/end to modifytable is a reasonable thing to
do (it'd be a no-op for heap and zheap I guess).

> 2. It looks like I cannot implement custom storage options. E.g. for
> compressed storage it makes sense to implement different compression
> methods (lz4, zstd etc.) and corresponding options (like compression
> level). But as i can see storage options (like fillfactor etc) are
> hardcoded and are not extensible. Possible solution is to use GUCs
> which would work but is not extremely convinient.

Yeah, the reloptions module is undergoing some changes.  I expect that
there will be a way to extend reloptions from an extension, at the end
of that set of patches.

> 3. A bit surprising limitation that in order to use bitmap scan the
> maximum number of tuples per page must not exceed 291 due to
> MAX_TUPLES_PER_PAGE macro in tidbitmap.c which is calculated based on
> 8kb page size. In case of 1mb page this restriction feels really
> limiting.

I suppose this is a hardcoded limit that needs to be fixed by patching
core as we make table AM more pervasive.

> 4. In order to use WAL-logging each page must start with a standard 24
> byte PageHeaderData even if it is needless for storage itself. Not a
> big deal though. Another (acutally documented) WAL-related limitation
> is that only generic WAL can be used within extension. So unless
> inserts are made in bulks it's going to require a lot of disk space to
> accomodate logs and wide bandwith for replication.

Not sure what to suggest.  Either you should ignore this problem, or
you should fix it.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Compressed pluggable storage experiments

From
Andres Freund
Date:
Hi,

On 2019-10-17 12:47:47 -0300, Alvaro Herrera wrote:
> On 2019-Oct-10, Ildar Musin wrote:
> 
> > 1. Unlike FDW API, in pluggable storage API there are no routines like
> > "begin modify table" and "end modify table" and there is no shared
> > state between insert/update/delete calls.
> 
> Hmm.  I think adding a begin/end to modifytable is a reasonable thing to
> do (it'd be a no-op for heap and zheap I guess).

I'm fairly strongly against that. Adding two additional "virtual"
function calls for something that's rarely going to be used, seems like
adding too much overhead to me.


> > 2. It looks like I cannot implement custom storage options. E.g. for
> > compressed storage it makes sense to implement different compression
> > methods (lz4, zstd etc.) and corresponding options (like compression
> > level). But as i can see storage options (like fillfactor etc) are
> > hardcoded and are not extensible. Possible solution is to use GUCs
> > which would work but is not extremely convinient.
> 
> Yeah, the reloptions module is undergoing some changes.  I expect that
> there will be a way to extend reloptions from an extension, at the end
> of that set of patches.

Cool.


> > 3. A bit surprising limitation that in order to use bitmap scan the
> > maximum number of tuples per page must not exceed 291 due to
> > MAX_TUPLES_PER_PAGE macro in tidbitmap.c which is calculated based on
> > 8kb page size. In case of 1mb page this restriction feels really
> > limiting.
> 
> I suppose this is a hardcoded limit that needs to be fixed by patching
> core as we make table AM more pervasive.

That's not unproblematic - a dynamic limit would make a number of
computations more expensive, and we already spend plenty CPU cycles
building the tid bitmap. And we'd waste plenty of memory just having all
that space for the worst case.  ISTM that we "just" need to replace the
TID bitmap with some tree like structure.


> > 4. In order to use WAL-logging each page must start with a standard 24
> > byte PageHeaderData even if it is needless for storage itself. Not a
> > big deal though. Another (acutally documented) WAL-related limitation
> > is that only generic WAL can be used within extension. So unless
> > inserts are made in bulks it's going to require a lot of disk space to
> > accomodate logs and wide bandwith for replication.
> 
> Not sure what to suggest.  Either you should ignore this problem, or
> you should fix it.

I think if it becomes a problem you should ask for an rmgr ID to use for
your extension, which we encode and then then allow to set the relevant
rmgr callbacks for that rmgr id at startup.  But you should obviously
first develop the WAL logging etc, and make sure it's beneficial over
generic wal logging for your case.

Greetings,

Andres Freund



Re: Compressed pluggable storage experiments

From
Tomas Vondra
Date:
On Fri, Oct 18, 2019 at 03:25:05AM -0700, Andres Freund wrote:
>Hi,
>
>On 2019-10-17 12:47:47 -0300, Alvaro Herrera wrote:
>> On 2019-Oct-10, Ildar Musin wrote:
>>
>> > 1. Unlike FDW API, in pluggable storage API there are no routines like
>> > "begin modify table" and "end modify table" and there is no shared
>> > state between insert/update/delete calls.
>>
>> Hmm.  I think adding a begin/end to modifytable is a reasonable thing to
>> do (it'd be a no-op for heap and zheap I guess).
>
>I'm fairly strongly against that. Adding two additional "virtual"
>function calls for something that's rarely going to be used, seems like
>adding too much overhead to me.
>

That seems a bit strange to me. Sure - if there's an alternative way to
achieve the desired behavior (clear way to finalize writes etc.), then
cool, let's do that. But forcing people to use invonvenient workarounds
seems like a bad thing to me - having a convenient and clear API is
quite valueable, IMHO.

Let's see if this actually has a measuerable overhead first.

>
>> > 2. It looks like I cannot implement custom storage options. E.g. for
>> > compressed storage it makes sense to implement different compression
>> > methods (lz4, zstd etc.) and corresponding options (like compression
>> > level). But as i can see storage options (like fillfactor etc) are
>> > hardcoded and are not extensible. Possible solution is to use GUCs
>> > which would work but is not extremely convinient.
>>
>> Yeah, the reloptions module is undergoing some changes.  I expect that
>> there will be a way to extend reloptions from an extension, at the end
>> of that set of patches.
>
>Cool.
>

Yep.

>
>> > 3. A bit surprising limitation that in order to use bitmap scan the
>> > maximum number of tuples per page must not exceed 291 due to
>> > MAX_TUPLES_PER_PAGE macro in tidbitmap.c which is calculated based on
>> > 8kb page size. In case of 1mb page this restriction feels really
>> > limiting.
>>
>> I suppose this is a hardcoded limit that needs to be fixed by patching
>> core as we make table AM more pervasive.
>
>That's not unproblematic - a dynamic limit would make a number of
>computations more expensive, and we already spend plenty CPU cycles
>building the tid bitmap. And we'd waste plenty of memory just having all
>that space for the worst case.  ISTM that we "just" need to replace the
>TID bitmap with some tree like structure.
>

I think the zedstore has roughly the same problem, and Heikki mentioned
some possible solutions to dealing with it in his pgconfeu talk (and it
was discussed in the zedstore thread, I think).

>
>> > 4. In order to use WAL-logging each page must start with a standard 24
>> > byte PageHeaderData even if it is needless for storage itself. Not a
>> > big deal though. Another (acutally documented) WAL-related limitation
>> > is that only generic WAL can be used within extension. So unless
>> > inserts are made in bulks it's going to require a lot of disk space to
>> > accomodate logs and wide bandwith for replication.
>>
>> Not sure what to suggest.  Either you should ignore this problem, or
>> you should fix it.
>
>I think if it becomes a problem you should ask for an rmgr ID to use for
>your extension, which we encode and then then allow to set the relevant
>rmgr callbacks for that rmgr id at startup.  But you should obviously
>first develop the WAL logging etc, and make sure it's beneficial over
>generic wal logging for your case.
>

AFAIK compressed/columnar engines generally implement two types of
storage - write-optimized store (WOS) and read-optimized store (ROS),
where the WOS is mostly just an uncompressed append-only buffer, and ROS
is compressed etc. ISTM the WOS would benefit from a more elaborate WAL
logging, but ROS should be mostly fine with the generic WAL logging.

But yeah, we should test and measure how beneficial that actually is.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Compressed pluggable storage experiments

From
Natarajan R
Date:
Hi all, This is a continuation of the above thread...

>> > 4. In order to use WAL-logging each page must start with a standard 24
>> > byte PageHeaderData even if it is needless for storage itself. Not a
>> > big deal though. Another (acutally documented) WAL-related limitation
>> > is that only generic WAL can be used within extension. So unless
>> > inserts are made in bulks it's going to require a lot of disk space to
>> > accomodate logs and wide bandwith for replication.
>>
>> Not sure what to suggest.  Either you should ignore this problem, or
>> you should fix it.

I am working on an environment similar to the above extension(pg_cryogen which experiments pluggable storage api's) but don't have much knowledge on pg's logical replication.. 
Please suggest some approaches to support pg's logical replication for a table with a custom access method, which writes generic wal record.

On Wed, 17 Aug 2022 at 19:04, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Fri, Oct 18, 2019 at 03:25:05AM -0700, Andres Freund wrote:
>Hi,
>
>On 2019-10-17 12:47:47 -0300, Alvaro Herrera wrote:
>> On 2019-Oct-10, Ildar Musin wrote:
>>
>> > 1. Unlike FDW API, in pluggable storage API there are no routines like
>> > "begin modify table" and "end modify table" and there is no shared
>> > state between insert/update/delete calls.
>>
>> Hmm.  I think adding a begin/end to modifytable is a reasonable thing to
>> do (it'd be a no-op for heap and zheap I guess).
>
>I'm fairly strongly against that. Adding two additional "virtual"
>function calls for something that's rarely going to be used, seems like
>adding too much overhead to me.
>

That seems a bit strange to me. Sure - if there's an alternative way to
achieve the desired behavior (clear way to finalize writes etc.), then
cool, let's do that. But forcing people to use invonvenient workarounds
seems like a bad thing to me - having a convenient and clear API is
quite valueable, IMHO.

Let's see if this actually has a measuerable overhead first.

>
>> > 2. It looks like I cannot implement custom storage options. E.g. for
>> > compressed storage it makes sense to implement different compression
>> > methods (lz4, zstd etc.) and corresponding options (like compression
>> > level). But as i can see storage options (like fillfactor etc) are
>> > hardcoded and are not extensible. Possible solution is to use GUCs
>> > which would work but is not extremely convinient.
>>
>> Yeah, the reloptions module is undergoing some changes.  I expect that
>> there will be a way to extend reloptions from an extension, at the end
>> of that set of patches.
>
>Cool.
>

Yep.

>
>> > 3. A bit surprising limitation that in order to use bitmap scan the
>> > maximum number of tuples per page must not exceed 291 due to
>> > MAX_TUPLES_PER_PAGE macro in tidbitmap.c which is calculated based on
>> > 8kb page size. In case of 1mb page this restriction feels really
>> > limiting.
>>
>> I suppose this is a hardcoded limit that needs to be fixed by patching
>> core as we make table AM more pervasive.
>
>That's not unproblematic - a dynamic limit would make a number of
>computations more expensive, and we already spend plenty CPU cycles
>building the tid bitmap. And we'd waste plenty of memory just having all
>that space for the worst case.  ISTM that we "just" need to replace the
>TID bitmap with some tree like structure.
>

I think the zedstore has roughly the same problem, and Heikki mentioned
some possible solutions to dealing with it in his pgconfeu talk (and it
was discussed in the zedstore thread, I think).

>
>> > 4. In order to use WAL-logging each page must start with a standard 24
>> > byte PageHeaderData even if it is needless for storage itself. Not a
>> > big deal though. Another (acutally documented) WAL-related limitation
>> > is that only generic WAL can be used within extension. So unless
>> > inserts are made in bulks it's going to require a lot of disk space to
>> > accomodate logs and wide bandwith for replication.
>>
>> Not sure what to suggest.  Either you should ignore this problem, or
>> you should fix it.
>
>I think if it becomes a problem you should ask for an rmgr ID to use for
>your extension, which we encode and then then allow to set the relevant
>rmgr callbacks for that rmgr id at startup.  But you should obviously
>first develop the WAL logging etc, and make sure it's beneficial over
>generic wal logging for your case.
>

AFAIK compressed/columnar engines generally implement two types of
storage - write-optimized store (WOS) and read-optimized store (ROS),
where the WOS is mostly just an uncompressed append-only buffer, and ROS
is compressed etc. ISTM the WOS would benefit from a more elaborate WAL
logging, but ROS should be mostly fine with the generic WAL logging.

But yeah, we should test and measure how beneficial that actually is.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services