All Taxi Services need Index Clustered Heap Append - Mailing list pgsql-hackers

From Darafei "Komяpa" Praliaskouski
Subject All Taxi Services need Index Clustered Heap Append
Date
Msg-id CAC8Q8tLBeAxR+BXWuKK+HP5m8tEVYn270CVrDvKXt=0PkJTY9g@mail.gmail.com
Whole thread Raw
Responses Re: All Taxi Services need Index Clustered Heap Append  (David Rowley <david.rowley@2ndquadrant.com>)
Re: All Taxi Services need Index Clustered Heap Append  (Ants Aasma <ants.aasma@eesti.ee>)
Re: All Taxi Services need Index Clustered Heap Append  (Aleksander Alekseev <a.alekseev@postgrespro.ru>)
Re: All Taxi Services need Index Clustered Heap Append  (Craig Ringer <craig@2ndquadrant.com>)
Re: All Taxi Services need Index Clustered Heap Append  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
Re: All Taxi Services need Index Clustered Heap Append  (stalkthetiger <stalkthetiger@gmail.com>)
Re: All Taxi Services need Index Clustered Heap Append  (fedor <kazansecure@ya.ru>)
List pgsql-hackers
Hi,

I work at a ride sharing company and we found a simple scenario that Postgres has a lot to improve at.
After my talk at pgconf.ru Alexander Korotkov encouraged me to share my story and thoughts in -hackers.

Story setting:

 - Amazon RDS (thus vanilla unpatchable postgres), synchronuous_commit=off (we're ok to lose some seconds on crash).

 - Amazon gp2 storage. It's almost like SSD for small spikes, but has IOPS limit if you're doing a lot of operations.

 - Drivers driving around the city and sending their GPS positions each second, for simplicity - 50k of them at a time.

 - An append-only table of shape (id uuid, ts timestamp, geom geometry, heading speed accuracy float, source text).
A btree index on (id, ts).

 - A service that gets the measurements on the network, batches all into a buffer of 1 second (~N=50k rows) and inserting via COPY.

After someone orders and completes a trip, we have the id of the driver and trip time interval coming from another service, and want to get trip's route to calculate the bill. Trip times are from seconds up to 4 hours (N=4*3600=14400 rows, a page typically has 60 rows).

select * from positions where id = :id and ts between :start_ts and :end_ts;

Data for more than 24 hours ago is not needed in this service, thus a storage of 70 gb should be enough. On the safe side we're giving it 500 gb, which on gp2 gives a steady 1500 iops.

In development (on synthetic data) plans use index and look great, so we proceed with Postgres and not MySQL. :)

When deployed to production we figured out that a single query for interval of more than half an hour (1800+ rows) can exhaust all the IOPS.
Data is appended with increasing time field, which effectively ensures no rows from the same driver are ever going to be in the same heap page. A 4 hour long request can degrade system for 10 seconds. gp2 provides max 10000 IOPS, and we need to get 14400 pages then. We need the biggest available gp2 offer just to read 2 megabytes of data in 1.5 seconds. The best io-optimized io1 volumes provide 32000 IOPS, which get us as low as 500 ms.

If the data was perfectly packed, it would be just 240 8k pages and translate to 120 input operations of gp2's 16kb blocks. 

Our options were:

 - partitioning. Not entirely trivial when your id is uuid. To get visible gains, we need to make sure each driver gets their own partition. That would leave us with 50 000(+) tables, and rumors say that in that's what is done in some bigger taxi service, and relcache then eats up all the RAM and system OOMs.

 - CLUSTER table using index. Works perfect on test stand, isn't available as online option.
 
 - Postgres Pro suggested CREATE INDEX .. INCLUDE (on commitfest https://commitfest.postgresql.org/15/1350/). We can't use that as it's not in upstream/amazon Postgres yet.

 - We decided to live with overhead of unnecessary sorting by all fields and keeping a copy of heap and created a btree over all the fields to utilize Index-Only Scans:

  * Testing went well on dump of production database.

  * After we've made indexes on production, we found out that performance is _worse_ than with simpler index.

  * EXPLAIN (BUFFERS) revealed that Visibility Map is never being frozen, as autovacuum ignores append-only never-updated never-deleted table that is only truncated once a day. No way to force autovacuum on such table exists.

  * We created a microservice (hard to find spot for crontab in distributed system!) that periodically agressively runs VACUUM on the table. 
It indeed helped with queries, but.. VACUUM skips all-visible pages in index, but always walks over all the pages of btree, which is even larger than the heap in our case.
There is a patch to repair this behavior on commitfest https://commitfest.postgresql.org/16/952/ - but not yet in upstream/amazon Postgres.

  * We ended up inventing partitioning schema that rotates a table every gigabyte of data, to keep VACUUM run time low. There are a hundreds partitions with indexes on all the fields. Finally the system is stable.

  * EXPLAIN (BUFFERS) shows later that each reading query visits all the indexes of partitioned table and fetches a page from index to know there are 0 rows there. To prune obviously unneeded partitions we decided to add constraint on timestamp after a partition is finalized. Timestamps are sanitized due to mobile network instability are stamped on the client side, so we don't know the bounds in advance. Unfortunately that means we need two seq scans to do it: first one to get min(ts), max(ts), and second one on ALTER TABLE ADD CONSTRAINT. This operation also eats up iops.

We are not very large company but we bump into too many scalability issues on this path already. Searching for solutions on every step shows other people with tables named like "gpsdata" and "traces", so we're not alone with this problem. :)

I gave this all some thought and it looks like it all could have not happened if Postgres was able to cluster heap insertions by (id, ts) index. We're ok with synchronuous_commit=off, so amplified write won't immediately hit disk and can get cooled down in progress. Clustering doesn't require perfect sorting: we need to minimize number of pages fetched, it's ok if the pages are not consecutive on disk.

I see the option for clustered heap writes as follows:

 - on tuple insertion, check if table is clustered by some index;

 - if table is clustered, we start writing not into last used page, but instead go into index and get page numbers for index tuples that are less or equal than current one, up to index page boundary (or some other exit strategy, at least one heap page is needed);

 - if we can fit tuple into that page, let it be written there;

 - if we cannot fit it, consult statistics to see if we have too many empty space in pages. If we have more than 50% space empty, get pages by FSM. Create a new page otherwise.
(looking into FSM as safety measure for people who specifically insert tuples in backward sorted order).

This would not require more space than is currently required to keep vacuumed all-fields index, and let us omit VACUUM passes over relations. A pencil-and-paper simulation shows it as a good thing. Do you have thoughts, or can you help with implementation, or tell why it would be a bad idea and we need to follow our competitor in moving away to other RDBMS? :)

I encourage everyone to help with at least https://commitfest.postgresql.org/16/952/ as no good SQL-level workaround exists for it. After that enabling autovacuum for tables that have only inserted and no deleted tuples would be cool to have, so that we have them marked as Visible.

Hackers, can you help me keep Postgres in the system? :)

Darafei Praliaskouski, 
GIS Engineer / Juno Minsk

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Online enabling of checksums
Next
From: Robert Haas
Date:
Subject: Re: Online enabling of checksums