Re: All Taxi Services need Index Clustered Heap Append - Mailing list pgsql-hackers

From Darafei "Komяpa" Praliaskouski
Subject Re: All Taxi Services need Index Clustered Heap Append
Date
Msg-id CAC8Q8tKKbkhGVJGvnrrq7HGhhBSuFqZL7zOM29Hi0S4BLBvP6Q@mail.gmail.com
Whole thread Raw
In response to Re: All Taxi Services need Index Clustered Heap Append  (Ants Aasma <ants.aasma@eesti.ee>)
List pgsql-hackers


пн, 5 мар. 2018 г. в 19:48, Ants Aasma <ants.aasma@eesti.ee>:
On Mon, Mar 5, 2018 at 2:11 PM, Darafei "Komяpa" Praliaskouski
<me@komzpa.net> wrote:
>> This approach mixes well with hash
>> partitioning. It would be neat indeed if PostgreSQL do something
>> equivalent on its own, and pluggable storage work being done could
>> enable index organized tables that would help. But you probably need
>> something right now.
>
>
> Fixing glaring issues (no vacuum and thus no Index-Only Scan on append-only
> tables, vacuum processing all of the eternity of btree) by 11 will get most
> of spike-nails out of the microservice code, and we can probably live with
> them until 11 gets to RDS.
>
> I also don't see why a pluggable storage is a must for the clustered write.
> Postgres does have a mechanism for selecting the next page to write tuple
> to, right now it's just looking at FSM - but what if it just peeked at
> existing index that already has enough the data to route tuple to correct
> page on write?

The mechanism you outlined would likely work for your use case, but it
has many issues that prevent it from being universally useful. From
the top of my head:

* One extra index descent per insertion (I/O for this is necessary
anyway, but CPU work is duplicated).

This part is going to be needed for any version of such mechanisms.
It is going to only be enabled for CLUSTERed tables only, so most people won't be affected.
 
* We don't currently track the amount of bloat. A mechanism that does
this needs to be added.

We've got number of tuples and number of pages in relation. We know we're degraded when number of tuples is going toward number of pages. It seems to be enough for a good enough cutoff heuristic.
 
* If table hits the bloat limit there will be a sudden change in
behavior. This is pretty nasty from an operations point of view.

We can omit the bloat limit for PoC implementation and see if that is enough. It may happen that there are other ways to calculate sibling pages that will handle both ascending and descending insertion.
 
* With your (id,ts) clustering and data coming in mostly ordered by
timestamp, after initial warmup, each page will contain rows from a
single id, but different ids are arbitrarily interleaved. This is
better than current state, but people might want to have an
interleaving step bigger than 8kB to better utilize storage hardware.

8kb is what btree index offers in clustered fashion, and people are okay with that, isn't it?

When thinking of this I thought "ok, how can we make the current heap behave like a perfectly-clustered btree, given we've got a perfectly-clustered btree nearby, but cannot use split operation since we cannot move a tuple between pages". This was the best I can think of, for now. Do you have better idea? :)
 
* It seems that with a common (ts) clustering and age of timestamp
coming from an exponential distribution, this will quickly bloat to
threshold and then insert data in a rather arbitrary order. This is
much worse than the default behavior.

Can you please explain how is it going to happen? I see that such process is going to quickly start reusing previous pages, and not reach threshold, as a page that has enough space to hold the newly coming tuple will be among the list of pages acquired form index. Clustering will likely not be perfect but will reorder at least part of the tuples. 

Please note that every tuple inserted to heap by FSM is going to drag the tuples following it into that page, before the limiting mechanism kicks in because of no space on that page.

Providing an out-of-the-box solution in core PostgreSQL would of
course be best, but 

Let's stop right here. We need out of box solution in core, it would be best, period. I'll happily discuss out-of-postgres solutions out-of-postgres list :)

Darafei Praliaskouski, 
GIS Engineer / Juno Minsk

pgsql-hackers by date:

Previous
From: Teodor Sigaev
Date:
Subject: Re: General purpose hashing func in pgbench
Next
From: Petr Jelinek
Date:
Subject: Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers