Thread: Preventing free space from being reused

Preventing free space from being reused

From
Noah Bergbauer
Date:
Hello,

I am working on a system to store and analyze JSON-based log data. The idea is to store tuples with a monotonically increasing timestamp (i.e. CURRENT_TIMESTAMP) in a table and run some queries that focus on specific time windows (~one hour). It seems to me that a BRIN index is designed exactly for this type of application.

However, I find that a BRIN index with standard settings (pages_per_range = 128) with very simple filters (e.g. ts > (CURRENT_TIMESTAM - INTERVAL '1 hour')) causes ~20x more pages to be scanned than I would expect. A majority of these tuples is then discarded due to index rechecking. Inspecting the BRIN pages using the pageinspect extension reveals the following problem: it seems that if a page is left with some free space that can not be filled right away (because the tuples being inserted are too large), then this hole can be filled at some arbitrary later point in time (sometimes hours later) once a small enough tuple comes along. This substantially reduces the BRIN index's effectiveness.

I confirmed this theory by CLUSTERing the table using a temporary btree index. Suddenly the query performance exactly matched my estimates.

The JSON data structure is often similar, so the table is stored on ZFS with compression. Hence, filling these holes brings no benefit - they were taking up no space thanks to the compression. On the other hand, rewriting old pages in order to fill up holes also creates a lot of fragmentation because of ZFS's copy-on-write semantics.

In summary, all of these problems would be solved if there was some way to stop the database system from ever reusing free space.

Bonus question: what's the best TOAST configuration for this use case? Is there any point in out-of-line storage when all tuples are going to be quite large (i.e. > 1kB)? Is there any benefit in having postgresql compress the data when ZFS runs compression regardless?

Thank you,
Noah Bergbauer

Re: Preventing free space from being reused

From
Michael Lewis
Date:
What version are you using? What is your usage pattern for insert/update/deletes? If sometimes the JSON data gets too big and the data is moved from in-line storage to TOASTED, then that would be opening up gaps. Or if you are doing deletes. Perhaps adjusting your fillfactor and/or TOAST_TUPLE_TARGET would influence that behavior in the direction you want to go. As best I understand though if you are doing deletes, you won't be able to prevent those gaps from showing up and being re-used. I don't believe there is any way to influence Postgres to append-only the tuples and pages.

Re: Preventing free space from being reused

From
Noah Bergbauer
Date:
This is on version 12.5. The usage pattern is inserts only, no updates or deletes at all. Hence, fillfactor is set to 100.

It just seems extremely unfortunate in this particular case that Postgres goes to all the trouble of tetris-ing new tuples into existing pages, only to cripple performance in two different ways (BRIN and disk fragmentation). In other words, if there is no workaround for this problem, then I believe that an option to essentially just disable free space maps on a per-relation basis should be added.

On Fri, Feb 12, 2021 at 4:58 PM Michael Lewis <mlewis@entrata.com> wrote:
What version are you using? What is your usage pattern for insert/update/deletes? If sometimes the JSON data gets too big and the data is moved from in-line storage to TOASTED, then that would be opening up gaps. Or if you are doing deletes. Perhaps adjusting your fillfactor and/or TOAST_TUPLE_TARGET would influence that behavior in the direction you want to go. As best I understand though if you are doing deletes, you won't be able to prevent those gaps from showing up and being re-used. I don't believe there is any way to influence Postgres to append-only the tuples and pages.

Re: Preventing free space from being reused

From
Michael Lewis
Date:
If you have no updates or deletes, then I would wonder about setting fillfactor LOWER such that new rows are less likely to find a gap that is acceptable. Depending how/when you use the json, lowering toast_tuple_target may be appropriate to store (nearly?) all out of line and making the data stored in the main relation be more uniform in size. Are you seeing significant toast usage currently?

Re: Preventing free space from being reused

From
Noah Bergbauer
Date:
I'm experimenting with that right now. Setting storage to MAIN appears to be counterproductive, whereas EXTERNAL with toast_tuple_target = 128 is definitely an improvement. In theory, this configuration might eliminate the problem, but due to the toast_tuple_target bug (https://www.postgresql.org/message-id/flat/20190403063759.GF3298%40paquier.xyz) plenty of 1kB tuples are still being stored inline. As a result I'm averaging around 11.5 tuples per page, when it should be >200 (one tuple is 35 bytes when stored out of line). A small test query shows ~7000 tuples removed by index recheck, but based on my statistics only ~1500 would be expected (in the ideal case where tuple disk order matches insertion order).

On the other hand, wouldn't the toast table still run into the disk fragmentation issue? Also, the 4-byte oid may actually become an issue a few months down the road.

I wonder how hard it would be to hack up a table access method that is just a copy of heap where HEAP_INSERT_SKIP_FSM is always set...

On Fri, Feb 12, 2021 at 5:49 PM Michael Lewis <mlewis@entrata.com> wrote:
If you have no updates or deletes, then I would wonder about setting fillfactor LOWER such that new rows are less likely to find a gap that is acceptable. Depending how/when you use the json, lowering toast_tuple_target may be appropriate to store (nearly?) all out of line and making the data stored in the main relation be more uniform in size. Are you seeing significant toast usage currently?

Re: Preventing free space from being reused

From
Noah Bergbauer
Date:
>I wonder how hard it would be to hack up a table access method that is just a copy of heap where HEAP_INSERT_SKIP_FSM is always set...

Update: I think this actually works. It's awful because I just copied the entire builtin heap code into an extension and then renamed a few functions so they don't collide during dynamic linking, but after changing the table's access method to the one from the extension there is no more overlap!

Before:
 {2021-02-12 14:21:24.93711+01 .. 2021-02-12 15:14:28.564695+01}
 {2021-02-12 15:10:22.832827+01 .. 2021-02-12 20:45:34.918054+01}
 {2021-02-12 15:50:50.768549+01 .. 2021-02-12 20:51:20.487791+01}
 {2021-02-12 16:25:01.224667+01 .. 2021-02-12 17:07:31.95343+01}
 {2021-02-12 16:51:30.007498+01 .. 2021-02-12 18:15:42.377372+01}
 {2021-02-12 17:30:17.943716+01 .. 2021-02-12 18:55:00.030094+01}
 {2021-02-12 18:08:39.488203+01 .. 2021-02-12 20:55:00.012211+01}
 {2021-02-12 19:05:35.495433+01 .. 2021-02-12 20:20:00.088014+01}

After:
 {2021-02-12 20:00:32.61514+01 .. 2021-02-12 20:45:23.617548+01}
 {2021-02-12 20:45:23.617548+01 .. 2021-02-12 20:51:05.098479+01}
 {2021-02-12 20:51:05.219331+01 .. 2021-02-12 20:57:56.93961+01}
 {2021-02-12 20:57:57.000953+01 .. 2021-02-12 21:02:10.245561+01}




On Fri, Feb 12, 2021 at 6:38 PM Noah Bergbauer <noah@statshelix.com> wrote:
I'm experimenting with that right now. Setting storage to MAIN appears to be counterproductive, whereas EXTERNAL with toast_tuple_target = 128 is definitely an improvement. In theory, this configuration might eliminate the problem, but due to the toast_tuple_target bug (https://www.postgresql.org/message-id/flat/20190403063759.GF3298%40paquier.xyz) plenty of 1kB tuples are still being stored inline. As a result I'm averaging around 11.5 tuples per page, when it should be >200 (one tuple is 35 bytes when stored out of line). A small test query shows ~7000 tuples removed by index recheck, but based on my statistics only ~1500 would be expected (in the ideal case where tuple disk order matches insertion order).

On the other hand, wouldn't the toast table still run into the disk fragmentation issue? Also, the 4-byte oid may actually become an issue a few months down the road.

I wonder how hard it would be to hack up a table access method that is just a copy of heap where HEAP_INSERT_SKIP_FSM is always set...

On Fri, Feb 12, 2021 at 5:49 PM Michael Lewis <mlewis@entrata.com> wrote:
If you have no updates or deletes, then I would wonder about setting fillfactor LOWER such that new rows are less likely to find a gap that is acceptable. Depending how/when you use the json, lowering toast_tuple_target may be appropriate to store (nearly?) all out of line and making the data stored in the main relation be more uniform in size. Are you seeing significant toast usage currently?