Thread: All Taxi Services need Index Clustered Heap Append
Hi,
- Amazon gp2 storage. It's almost like SSD for small spikes, but has IOPS limit if you're doing a lot of operations.
- Drivers driving around the city and sending their GPS positions each second, for simplicity - 50k of them at a time.
- An append-only table of shape (id uuid, ts timestamp, geom geometry, heading speed accuracy float, source text).
- partitioning. Not entirely trivial when your id is uuid. To get visible gains, we need to make sure each driver gets their own partition. That would leave us with 50 000(+) tables, and rumors say that in that's what is done in some bigger taxi service, and relcache then eats up all the RAM and system OOMs.
- CLUSTER table using index. Works perfect on test stand, isn't available as online option.
- Postgres Pro suggested CREATE INDEX .. INCLUDE (on commitfest https://commitfest.postgresql.org/15/1350/). We can't use that as it's not in upstream/amazon Postgres yet.
- We decided to live with overhead of unnecessary sorting by all fields and keeping a copy of heap and created a btree over all the fields to utilize Index-Only Scans:
- if table is clustered, we start writing not into last used page, but instead go into index and get page numbers for index tuples that are less or equal than current one, up to index page boundary (or some other exit strategy, at least one heap page is needed);
- if we can fit tuple into that page, let it be written there;
- if we cannot fit it, consult statistics to see if we have too many empty space in pages. If we have more than 50% space empty, get pages by FSM. Create a new page otherwise.
(looking into FSM as safety measure for people who specifically insert tuples in backward sorted order).
This would not require more space than is currently required to keep vacuumed all-fields index, and let us omit VACUUM passes over relations. A pencil-and-paper simulation shows it as a good thing. Do you have thoughts, or can you help with implementation, or tell why it would be a bad idea and we need to follow our competitor in moving away to other RDBMS? :)
I encourage everyone to help with at least https://commitfest.postgresql.org/16/952/ as no good SQL-level workaround exists for it. After that enabling autovacuum for tables that have only inserted and no deleted tuples would be cool to have, so that we have them marked as Visible.
Hackers, can you help me keep Postgres in the system? :)
Darafei Praliaskouski,
GIS Engineer / Juno Minsk
I work at a ride sharing company and we found a simple scenario that Postgres has a lot to improve at.
After my talk at pgconf.ru Alexander Korotkov encouraged me to share my story and thoughts in -hackers.
Story setting:
- Amazon RDS (thus vanilla unpatchable postgres), synchronuous_commit=off (we're ok to lose some seconds on crash).
- Amazon gp2 storage. It's almost like SSD for small spikes, but has IOPS limit if you're doing a lot of operations.
- Drivers driving around the city and sending their GPS positions each second, for simplicity - 50k of them at a time.
- An append-only table of shape (id uuid, ts timestamp, geom geometry, heading speed accuracy float, source text).
A btree index on (id, ts).
- A service that gets the measurements on the network, batches all into a buffer of 1 second (~N=50k rows) and inserting via COPY.
After someone orders and completes a trip, we have the id of the driver and trip time interval coming from another service, and want to get trip's route to calculate the bill. Trip times are from seconds up to 4 hours (N=4*3600=14400 rows, a page typically has 60 rows).
select * from positions where id = :id and ts between :start_ts and :end_ts;
Data for more than 24 hours ago is not needed in this service, thus a storage of 70 gb should be enough. On the safe side we're giving it 500 gb, which on gp2 gives a steady 1500 iops.
In development (on synthetic data) plans use index and look great, so we proceed with Postgres and not MySQL. :)
When deployed to production we figured out that a single query for interval of more than half an hour (1800+ rows) can exhaust all the IOPS.
Data is appended with increasing time field, which effectively ensures no rows from the same driver are ever going to be in the same heap page. A 4 hour long request can degrade system for 10 seconds. gp2 provides max 10000 IOPS, and we need to get 14400 pages then. We need the biggest available gp2 offer just to read 2 megabytes of data in 1.5 seconds. The best io-optimized io1 volumes provide 32000 IOPS, which get us as low as 500 ms.
- A service that gets the measurements on the network, batches all into a buffer of 1 second (~N=50k rows) and inserting via COPY.
After someone orders and completes a trip, we have the id of the driver and trip time interval coming from another service, and want to get trip's route to calculate the bill. Trip times are from seconds up to 4 hours (N=4*3600=14400 rows, a page typically has 60 rows).
select * from positions where id = :id and ts between :start_ts and :end_ts;
Data for more than 24 hours ago is not needed in this service, thus a storage of 70 gb should be enough. On the safe side we're giving it 500 gb, which on gp2 gives a steady 1500 iops.
In development (on synthetic data) plans use index and look great, so we proceed with Postgres and not MySQL. :)
When deployed to production we figured out that a single query for interval of more than half an hour (1800+ rows) can exhaust all the IOPS.
Data is appended with increasing time field, which effectively ensures no rows from the same driver are ever going to be in the same heap page. A 4 hour long request can degrade system for 10 seconds. gp2 provides max 10000 IOPS, and we need to get 14400 pages then. We need the biggest available gp2 offer just to read 2 megabytes of data in 1.5 seconds. The best io-optimized io1 volumes provide 32000 IOPS, which get us as low as 500 ms.
If the data was perfectly packed, it would be just 240 8k pages and translate to 120 input operations of gp2's 16kb blocks.
Our options were:
Our options were:
- partitioning. Not entirely trivial when your id is uuid. To get visible gains, we need to make sure each driver gets their own partition. That would leave us with 50 000(+) tables, and rumors say that in that's what is done in some bigger taxi service, and relcache then eats up all the RAM and system OOMs.
- CLUSTER table using index. Works perfect on test stand, isn't available as online option.
- Postgres Pro suggested CREATE INDEX .. INCLUDE (on commitfest https://commitfest.postgresql.org/15/1350/). We can't use that as it's not in upstream/amazon Postgres yet.
- We decided to live with overhead of unnecessary sorting by all fields and keeping a copy of heap and created a btree over all the fields to utilize Index-Only Scans:
* Testing went well on dump of production database.
* After we've made indexes on production, we found out that performance is _worse_ than with simpler index.
* EXPLAIN (BUFFERS) revealed that Visibility Map is never being frozen, as autovacuum ignores append-only never-updated never-deleted table that is only truncated once a day. No way to force autovacuum on such table exists.
* We created a microservice (hard to find spot for crontab in distributed system!) that periodically agressively runs VACUUM on the table.
* We created a microservice (hard to find spot for crontab in distributed system!) that periodically agressively runs VACUUM on the table.
It indeed helped with queries, but.. VACUUM skips all-visible pages in index, but always walks over all the pages of btree, which is even larger than the heap in our case.
There is a patch to repair this behavior on commitfest https://commitfest.postgresql.org/16/952/ - but not yet in upstream/amazon Postgres.
* We ended up inventing partitioning schema that rotates a table every gigabyte of data, to keep VACUUM run time low. There are a hundreds partitions with indexes on all the fields. Finally the system is stable.
* EXPLAIN (BUFFERS) shows later that each reading query visits all the indexes of partitioned table and fetches a page from index to know there are 0 rows there. To prune obviously unneeded partitions we decided to add constraint on timestamp after a partition is finalized. Timestamps are sanitized due to mobile network instability are stamped on the client side, so we don't know the bounds in advance. Unfortunately that means we need two seq scans to do it: first one to get min(ts), max(ts), and second one on ALTER TABLE ADD CONSTRAINT. This operation also eats up iops.
* We ended up inventing partitioning schema that rotates a table every gigabyte of data, to keep VACUUM run time low. There are a hundreds partitions with indexes on all the fields. Finally the system is stable.
* EXPLAIN (BUFFERS) shows later that each reading query visits all the indexes of partitioned table and fetches a page from index to know there are 0 rows there. To prune obviously unneeded partitions we decided to add constraint on timestamp after a partition is finalized. Timestamps are sanitized due to mobile network instability are stamped on the client side, so we don't know the bounds in advance. Unfortunately that means we need two seq scans to do it: first one to get min(ts), max(ts), and second one on ALTER TABLE ADD CONSTRAINT. This operation also eats up iops.
We are not very large company but we bump into too many scalability issues on this path already. Searching for solutions on every step shows other people with tables named like "gpsdata" and "traces", so we're not alone with this problem. :)
I gave this all some thought and it looks like it all could have not happened if Postgres was able to cluster heap insertions by (id, ts) index. We're ok with synchronuous_commit=off, so amplified write won't immediately hit disk and can get cooled down in progress. Clustering doesn't require perfect sorting: we need to minimize number of pages fetched, it's ok if the pages are not consecutive on disk.
I see the option for clustered heap writes as follows:
I gave this all some thought and it looks like it all could have not happened if Postgres was able to cluster heap insertions by (id, ts) index. We're ok with synchronuous_commit=off, so amplified write won't immediately hit disk and can get cooled down in progress. Clustering doesn't require perfect sorting: we need to minimize number of pages fetched, it's ok if the pages are not consecutive on disk.
I see the option for clustered heap writes as follows:
- on tuple insertion, check if table is clustered by some index;
- if table is clustered, we start writing not into last used page, but instead go into index and get page numbers for index tuples that are less or equal than current one, up to index page boundary (or some other exit strategy, at least one heap page is needed);
- if we can fit tuple into that page, let it be written there;
- if we cannot fit it, consult statistics to see if we have too many empty space in pages. If we have more than 50% space empty, get pages by FSM. Create a new page otherwise.
(looking into FSM as safety measure for people who specifically insert tuples in backward sorted order).
This would not require more space than is currently required to keep vacuumed all-fields index, and let us omit VACUUM passes over relations. A pencil-and-paper simulation shows it as a good thing. Do you have thoughts, or can you help with implementation, or tell why it would be a bad idea and we need to follow our competitor in moving away to other RDBMS? :)
I encourage everyone to help with at least https://commitfest.postgresql.org/16/952/ as no good SQL-level workaround exists for it. After that enabling autovacuum for tables that have only inserted and no deleted tuples would be cool to have, so that we have them marked as Visible.
Hackers, can you help me keep Postgres in the system? :)
Darafei Praliaskouski,
GIS Engineer / Juno Minsk
On 3 March 2018 at 05:30, Darafei "Komяpa" Praliaskouski <me@komzpa.net> wrote: > Our options were: > > - partitioning. Not entirely trivial when your id is uuid. To get visible > gains, we need to make sure each driver gets their own partition. That would > leave us with 50 000(+) tables, and rumors say that in that's what is done > in some bigger taxi service, and relcache then eats up all the RAM and > system OOMs. It's a good job someone invented HASH partitioning then. It would be interesting to hear how your benchmarks go using current master + the faster partition pruning patchset [1]. Currently, HASH partitioning does exist in master, just there's no partition pruning for the non-matching partitions, which is why you need [1]. I think trying with something like 500-1000 partitions might be a good place to start. [1] https://www.postgresql.org/message-id/0f96dd16-f5d5-7301-4ddf-858d41a6cbe3@lab.ntt.co.jp -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Mar 2, 2018 at 6:30 PM, Darafei "Komяpa" Praliaskouski <me@komzpa.net> wrote: > I gave this all some thought and it looks like it all could have not > happened if Postgres was able to cluster heap insertions by (id, ts) index. > We're ok with synchronuous_commit=off, so amplified write won't immediately > hit disk and can get cooled down in progress. Clustering doesn't require > perfect sorting: we need to minimize number of pages fetched, it's ok if the > pages are not consecutive on disk. Data locality is indeed the key here. Specifically for non-cached data. It is possible to manually implement some approximation of clustering on SQL level with current PostgreSQL features. Insert incoming data into new data partitions and have a background job swap input to a new partition and then insert data from the previous new data partition to main storage sorting it by vehicle in the process. If you do this every few minutes or so you should be able to tune the system in a way that the new partition data isn't even written to disk, you only have to pay the cost of double WAL for insertion and the CPU work to perform the move. This approach mixes well with hash partitioning. It would be neat indeed if PostgreSQL do something equivalent on its own, and pluggable storage work being done could enable index organized tables that would help. But you probably need something right now. I guess I don't have to tell you that it looks like your needs have outgrown what RDS works well with and you are in for a painful move sooner or later. Regards, Ants Aasma -- +43-670-6056265 Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com
On Sat, Mar 3, 2018 at 4:53 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 3 March 2018 at 05:30, Darafei "Komяpa" Praliaskouski <me@komzpa.net> wrote: >> Our options were: >> >> - partitioning. Not entirely trivial when your id is uuid. To get visible >> gains, we need to make sure each driver gets their own partition. That would >> leave us with 50 000(+) tables, and rumors say that in that's what is done >> in some bigger taxi service, and relcache then eats up all the RAM and >> system OOMs. > > It's a good job someone invented HASH partitioning then. > > It would be interesting to hear how your benchmarks go using current > master + the faster partition pruning patchset [1]. Currently, HASH > partitioning does exist in master, just there's no partition pruning > for the non-matching partitions, which is why you need [1]. > > I think trying with something like 500-1000 partitions might be a good > place to start. I don't think that will actually help much. 1000 partitions means each partition gets data from ~50 vehicles. A 60 tuples per page each page in the partitioned able will contain on average 1.2 interesting tuples. So you still have almost one page read per row. Regards, Ants Aasma -- +43-670-6056265 Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com
On 4 March 2018 at 23:05, Ants Aasma <ants.aasma@eesti.ee> wrote: > On Sat, Mar 3, 2018 at 4:53 PM, David Rowley >> It's a good job someone invented HASH partitioning then. >> >> It would be interesting to hear how your benchmarks go using current >> master + the faster partition pruning patchset [1]. Currently, HASH >> partitioning does exist in master, just there's no partition pruning >> for the non-matching partitions, which is why you need [1]. >> >> I think trying with something like 500-1000 partitions might be a good >> place to start. > > I don't think that will actually help much. 1000 partitions means each > partition gets data from ~50 vehicles. A 60 tuples per page each page > in the partitioned able will contain on average 1.2 interesting > tuples. So you still have almost one page read per row. hmm, I missed that part about only 60 tuples per page. It may be worth an experiment with two table, one to hold the day's worth of data, and a holding table which stores about 1-2 minutes of data. Each minute or two the holding table could be flushed like: WITH del AS (DELETE FROM driver_pos_holding RETURNING *) INSERT INTO driver_pos SELECT * FROM del ORDER BY id,ts; then perhaps a manual VACUUM of driver_pos_holding... or leave it up to auto-vacuum... both tables could be inherited by a single parent to allow queries to return all rows, or be wrapped up in a UNION ALL view, although an inherited table should provide better plans than the view in some cases. Although using an inherited parent would disallow you to use partitioning if you ever wanted to partition by ts to make the job of removing old data easier. Hopefully having 60-120 seconds of driver data will in the holding table will mean that the tuples for each driver only span 2-3 pages for that 1-2 minute period in the main table You might then have not much more than 240 pages to load for a driver after a 4-hour run. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hello Darafei, > After my talk at pgconf.ru Alexander Korotkov encouraged me to share my > story and thoughts in -hackers. > [...] > - An append-only table of shape (id uuid, ts timestamp, geom geometry, > heading speed accuracy float, source text). > A btree index on (id, ts). > [...] > select * from positions where id = :id and ts between :start_ts and :end_ts; Thank you for sharing the detailed description of the problem! Frankly, considering your requirements, including one regarding data durability and consistency, I would advice to write a microservice that stores data in non-transactional append-only one-file-per-driver-id fashion. For serializing the data I would advice to use something like Thrift or Protobuf. Also it would be a good idea to store CRC of every record. It will take like one workday of any Go/Java developer and will solve your problem entirely. Granted, PostgreSQL is a powerful database. But knowing it development process and current resources limitations (for instance, amount of people willing to do code review) objectively it will take a few years before this problem will be solved in the vanilla code, plus some time before AWS deploys it. (Assuming there is a PostgreSQL core developer interested in this particular task and he or she has time to spare.) It is my understanding that waiting a few years is not an option in your case. This is why I'm proposing an alternative solution. -- Best regards, Aleksander Alekseev
Attachment
This approach mixes well with hash
partitioning. It would be neat indeed if PostgreSQL do something
equivalent on its own, and pluggable storage work being done could
enable index organized tables that would help. But you probably need
something right now.
Fixing glaring issues (no vacuum and thus no Index-Only Scan on append-only tables, vacuum processing all of the eternity of btree) by 11 will get most of spike-nails out of the microservice code, and we can probably live with them until 11 gets to RDS.
I also don't see why a pluggable storage is a must for the clustered write. Postgres does have a mechanism for selecting the next page to write tuple to, right now it's just looking at FSM - but what if it just peeked at existing index that already has enough the data to route tuple to correct page on write?
I also don't see why a pluggable storage is a must for the clustered write. Postgres does have a mechanism for selecting the next page to write tuple to, right now it's just looking at FSM - but what if it just peeked at existing index that already has enough the data to route tuple to correct page on write?
I guess I don't have to tell you that it looks like your needs have
outgrown what RDS works well with and you are in for a painful move
sooner or later.
Painful move where to? If we just run a Postgres instance without RDS we'll get the pain of setting up Postgres and replication and backups and autofailover, with no visible gain except if we get some private / unaccepted patches applied to it. If we can get these things right upstream why would we want to switch?
Per my colleagues, MySQL offers clustered index, also MySQL is available on RDS without the need of "painful move", which is doable by writing to two locations for a day and then pointing readers to new DB. But if we can instead do no move and be sure the issues are gone upstream before we hit the limit of spike-nails we're running on currently, wouldn't that be better? :)
Darafei Praliaskouski,
GIS Engineer / Juno Minsk
On Mon, Mar 5, 2018 at 2:11 PM, Darafei "Komяpa" Praliaskouski <me@komzpa.net> wrote: >> This approach mixes well with hash >> partitioning. It would be neat indeed if PostgreSQL do something >> equivalent on its own, and pluggable storage work being done could >> enable index organized tables that would help. But you probably need >> something right now. > > > Fixing glaring issues (no vacuum and thus no Index-Only Scan on append-only > tables, vacuum processing all of the eternity of btree) by 11 will get most > of spike-nails out of the microservice code, and we can probably live with > them until 11 gets to RDS. > > I also don't see why a pluggable storage is a must for the clustered write. > Postgres does have a mechanism for selecting the next page to write tuple > to, right now it's just looking at FSM - but what if it just peeked at > existing index that already has enough the data to route tuple to correct > page on write? The mechanism you outlined would likely work for your use case, but it has many issues that prevent it from being universally useful. From the top of my head: * One extra index descent per insertion (I/O for this is necessary anyway, but CPU work is duplicated). * We don't currently track the amount of bloat. A mechanism that does this needs to be added. * If table hits the bloat limit there will be a sudden change in behavior. This is pretty nasty from an operations point of view. * With your (id,ts) clustering and data coming in mostly ordered by timestamp, after initial warmup, each page will contain rows from a single id, but different ids are arbitrarily interleaved. This is better than current state, but people might want to have an interleaving step bigger than 8kB to better utilize storage hardware. * It seems that with a common (ts) clustering and age of timestamp coming from an exponential distribution, this will quickly bloat to threshold and then insert data in a rather arbitrary order. This is much worse than the default behavior. At least in my opinion these problems make it a special case optimization that is hard to justify in core. A decent alternative would be a plugin mechanism for locating free space for a tuple where you can write your extension to find a suitable location for the row. >> I guess I don't have to tell you that it looks like your needs have >> outgrown what RDS works well with and you are in for a painful move >> sooner or later. > > > Painful move where to? If we just run a Postgres instance without RDS we'll > get the pain of setting up Postgres and replication and backups and > autofailover, with no visible gain except if we get some private / > unaccepted patches applied to it. If we can get these things right upstream > why would we want to switch? EC2 for example. Mainly because I3 instances and ephemeral provide an order of magnitude or two of performance improvement while costing less. Being able to run custom extensions and patches if necessary is a nice bonus. Yes, setting up replication, autofailover and backups is extra work that you have to weigh against the benefits. But don't overestimate the effort - there are some pretty nice tools available that make a proper cluster relatively simple to set up. > Per my colleagues, MySQL offers clustered index, also MySQL is available on > RDS without the need of "painful move", which is doable by writing to two > locations for a day and then pointing readers to new DB. But if we can > instead do no move and be sure the issues are gone upstream before we hit > the limit of spike-nails we're running on currently, wouldn't that be > better? :) The move off of RDS is painful because getting data out of RDS involves either downtime or building an ad-hoc logical replication solution. You need to solve that regardless of where you move to. Providing an out-of-the-box solution in core PostgreSQL would of course be best, but realistically you will be waiting at least 2 years to get it on RDS. In the meanwhile either the buffer partition approach I described, or a buffering microservice in front of PostgreSQL like Aleksander recommended should fix data locality for you. If you weren't running on RDS I would even propose using Redis as the buffer with one key per driver and redis_fdw to make the data accessible from within PostgreSQL. Regards, Ants Aasma -- +43-670-6056265 Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com
Alex Kane
On Mon, Mar 5, 2018 at 11:48 AM, Ants Aasma <ants.aasma@eesti.ee> wrote:
On Mon, Mar 5, 2018 at 2:11 PM, Darafei "Komяpa" Praliaskouski
<me@komzpa.net> wrote:
>> This approach mixes well with hash
>> partitioning. It would be neat indeed if PostgreSQL do something
>> equivalent on its own, and pluggable storage work being done could
>> enable index organized tables that would help. But you probably need
>> something right now.
>
>
> Fixing glaring issues (no vacuum and thus no Index-Only Scan on append-only
> tables, vacuum processing all of the eternity of btree) by 11 will get most
> of spike-nails out of the microservice code, and we can probably live with
> them until 11 gets to RDS.
>
> I also don't see why a pluggable storage is a must for the clustered write.
> Postgres does have a mechanism for selecting the next page to write tuple
> to, right now it's just looking at FSM - but what if it just peeked at
> existing index that already has enough the data to route tuple to correct
> page on write?
The mechanism you outlined would likely work for your use case, but it
has many issues that prevent it from being universally useful. From
the top of my head:
* One extra index descent per insertion (I/O for this is necessary
anyway, but CPU work is duplicated).
* We don't currently track the amount of bloat. A mechanism that does
this needs to be added.
* If table hits the bloat limit there will be a sudden change in
behavior. This is pretty nasty from an operations point of view.
* With your (id,ts) clustering and data coming in mostly ordered by
timestamp, after initial warmup, each page will contain rows from a
single id, but different ids are arbitrarily interleaved. This is
better than current state, but people might want to have an
interleaving step bigger than 8kB to better utilize storage hardware.
* It seems that with a common (ts) clustering and age of timestamp
coming from an exponential distribution, this will quickly bloat to
threshold and then insert data in a rather arbitrary order. This is
much worse than the default behavior.
At least in my opinion these problems make it a special case
optimization that is hard to justify in core. A decent alternative
would be a plugin mechanism for locating free space for a tuple where
you can write your extension to find a suitable location for the row.
>> I guess I don't have to tell you that it looks like your needs have
>> outgrown what RDS works well with and you are in for a painful move
>> sooner or later.
>
>
> Painful move where to? If we just run a Postgres instance without RDS we'll
> get the pain of setting up Postgres and replication and backups and
> autofailover, with no visible gain except if we get some private /
> unaccepted patches applied to it. If we can get these things right upstream
> why would we want to switch?
EC2 for example. Mainly because I3 instances and ephemeral provide an
order of magnitude or two of performance improvement while costing
less. Being able to run custom extensions and patches if necessary is
a nice bonus. Yes, setting up replication, autofailover and backups is
extra work that you have to weigh against the benefits. But don't
overestimate the effort - there are some pretty nice tools available
that make a proper cluster relatively simple to set up.
> Per my colleagues, MySQL offers clustered index, also MySQL is available on
> RDS without the need of "painful move", which is doable by writing to two
> locations for a day and then pointing readers to new DB. But if we can
> instead do no move and be sure the issues are gone upstream before we hit
> the limit of spike-nails we're running on currently, wouldn't that be
> better? :)
The move off of RDS is painful because getting data out of RDS
involves either downtime or building an ad-hoc logical replication
solution. You need to solve that regardless of where you move to.
Providing an out-of-the-box solution in core PostgreSQL would of
course be best, but realistically you will be waiting at least 2 years
to get it on RDS. In the meanwhile either the buffer partition
approach I described, or a buffering microservice in front of
PostgreSQL like Aleksander recommended should fix data locality for
you. If you weren't running on RDS I would even propose using Redis as
the buffer with one key per driver and redis_fdw to make the data
accessible from within PostgreSQL.
Regards,
Ants Aasma
--
+43-670-6056265
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com
On 3 March 2018 at 00:30, Darafei "Komяpa" Praliaskouski <me@komzpa.net> wrote:
I gave this all some thought and it looks like it all could have not happened if Postgres was able to cluster heap insertions by (id, ts) index. We're ok with synchronuous_commit=off, so amplified write won't immediately hit disk and can get cooled down in progress. Clustering doesn't require perfect sorting: we need to minimize number of pages fetched, it's ok if the pages are not consecutive on disk.
I'm surprised nobody has mentioned BRIN yet.
Ever since BRIN was introduced, I've thought it would be very interesting to use it + the freespace map for coarse-grained tuple routing in heap inserts. Try to find the best-match range with BRIN and look for free space there. I think there was discussion of this early on, so you may want to look up the BRIN threads.
The main challenge probably comes when a range is exhausted. Your writes would start spilling over into new heap pages and get intermixed again. Some support for pre-allocating new ranges would be needed.
пн, 5 мар. 2018 г. в 19:48, Ants Aasma <ants.aasma@eesti.ee>:
On Mon, Mar 5, 2018 at 2:11 PM, Darafei "Komяpa" Praliaskouski
<me@komzpa.net> wrote:
>> This approach mixes well with hash
>> partitioning. It would be neat indeed if PostgreSQL do something
>> equivalent on its own, and pluggable storage work being done could
>> enable index organized tables that would help. But you probably need
>> something right now.
>
>
> Fixing glaring issues (no vacuum and thus no Index-Only Scan on append-only
> tables, vacuum processing all of the eternity of btree) by 11 will get most
> of spike-nails out of the microservice code, and we can probably live with
> them until 11 gets to RDS.
>
> I also don't see why a pluggable storage is a must for the clustered write.
> Postgres does have a mechanism for selecting the next page to write tuple
> to, right now it's just looking at FSM - but what if it just peeked at
> existing index that already has enough the data to route tuple to correct
> page on write?
The mechanism you outlined would likely work for your use case, but it
has many issues that prevent it from being universally useful. From
the top of my head:
* One extra index descent per insertion (I/O for this is necessary
anyway, but CPU work is duplicated).
This part is going to be needed for any version of such mechanisms.
It is going to only be enabled for CLUSTERed tables only, so most people won't be affected.
* We don't currently track the amount of bloat. A mechanism that does
this needs to be added.
We've got number of tuples and number of pages in relation. We know we're degraded when number of tuples is going toward number of pages. It seems to be enough for a good enough cutoff heuristic.
* If table hits the bloat limit there will be a sudden change in
behavior. This is pretty nasty from an operations point of view.
We can omit the bloat limit for PoC implementation and see if that is enough. It may happen that there are other ways to calculate sibling pages that will handle both ascending and descending insertion.
* With your (id,ts) clustering and data coming in mostly ordered by
timestamp, after initial warmup, each page will contain rows from a
single id, but different ids are arbitrarily interleaved. This is
better than current state, but people might want to have an
interleaving step bigger than 8kB to better utilize storage hardware.
8kb is what btree index offers in clustered fashion, and people are okay with that, isn't it?
When thinking of this I thought "ok, how can we make the current heap behave like a perfectly-clustered btree, given we've got a perfectly-clustered btree nearby, but cannot use split operation since we cannot move a tuple between pages". This was the best I can think of, for now. Do you have better idea? :)
* It seems that with a common (ts) clustering and age of timestamp
coming from an exponential distribution, this will quickly bloat to
threshold and then insert data in a rather arbitrary order. This is
much worse than the default behavior.
Can you please explain how is it going to happen? I see that such process is going to quickly start reusing previous pages, and not reach threshold, as a page that has enough space to hold the newly coming tuple will be among the list of pages acquired form index. Clustering will likely not be perfect but will reorder at least part of the tuples.
Please note that every tuple inserted to heap by FSM is going to drag the tuples following it into that page, before the limiting mechanism kicks in because of no space on that page.
Providing an out-of-the-box solution in core PostgreSQL would of
course be best, but
Let's stop right here. We need out of box solution in core, it would be best, period. I'll happily discuss out-of-postgres solutions out-of-postgres list :)
Darafei Praliaskouski,
GIS Engineer / Juno Minsk
Darafei Praliaskouski,
GIS Engineer / Juno Minsk
вт, 6 мар. 2018 г. в 4:57, Craig Ringer <craig@2ndquadrant.com>:
On 3 March 2018 at 00:30, Darafei "Komяpa" Praliaskouski <me@komzpa.net> wrote:I gave this all some thought and it looks like it all could have not happened if Postgres was able to cluster heap insertions by (id, ts) index. We're ok with synchronuous_commit=off, so amplified write won't immediately hit disk and can get cooled down in progress. Clustering doesn't require perfect sorting: we need to minimize number of pages fetched, it's ok if the pages are not consecutive on disk.I'm surprised nobody has mentioned BRIN yet.Ever since BRIN was introduced, I've thought it would be very interesting to use it + the freespace map for coarse-grained tuple routing in heap inserts. Try to find the best-match range with BRIN and look for free space there. I think there was discussion of this early on, so you may want to look up the BRIN threads.
It can be done using any indexing mechanism that can get you some list of pages containing sibling tuples for current one. For btree, GiST I see it as inserting tuple to index first (or taking a look at the page the insertion is going to happen at) and then looking at page numbers of tuples in the same index page. Similar can work for SP-GiST. BRIN, just the way you say.
I'm not sure how it can be implemented for GIN. Probably choosing the most empty page from all the ones referenced by posting lists the inserted row can participate in, in cuckoo-hashing manner?
The main challenge probably comes when a range is exhausted. Your writes would start spilling over into new heap pages and get intermixed again. Some support for pre-allocating new ranges would be needed.
This can potentially be solved provided there is mechanism to move tuples between pages updating all the indexes. Then autovacuum can be taught to move tuples between ranges to make the ranges smaller. I see how we can survive without such mechanism for exact indexes like GiST or btree, but not for BRIN at the start.
Darafei Praliaskouski,
GIS Engineer / Juno Minsk
GIS Engineer / Juno Minsk
On Mar 6, 2018, at 04:57, Craig Ringer <craig@2ndquadrant.com> wrote:On 3 March 2018 at 00:30, Darafei "Komяpa" Praliaskouski <me@komzpa.net> wrote:I gave this all some thought and it looks like it all could have not happened if Postgres was able to cluster heap insertions by (id, ts) index. We're ok with synchronuous_commit=off, so amplified write won't immediately hit disk and can get cooled down in progress. Clustering doesn't require perfect sorting: we need to minimize number of pages fetched, it's ok if the pages are not consecutive on disk.I'm surprised nobody has mentioned BRIN yet.Ever since BRIN was introduced, I've thought it would be very interesting to use it + the freespace map for coarse-grained tuple routing in heap inserts. Try to find the best-match range with BRIN and look for free space there. I think there was discussion of this early on, so you may want to look up the BRIN threads.
The main challenge probably comes when a range is exhausted. Your writes would start spilling over into new heap pages and get intermixed again. Some support for pre-allocating new ranges would be needed.
1. Find eligible pages to insert
2. Make sure you don't screw new page
At first we want to define how we want to cluster our data, it can be a mix of equality, hash, range and geo operators.
In Darafei's case appropriate clustering would be: CLUSTER BY (hash(id), ts asc) or (eq(id), range(ts, '1 day interval')).
To support fast search of pages to insert, there should be present Bloom index on id and BRIN or Btree index on ts
for the first example, and Btree or Hash on id and Btree or BRIN or Gist for the second.
So, we can bitmap scan all needed indexes to find pages with tuples with already present clustered keys,
we AND those bitmaps and consult fsm and proceed to insert in pages if place allows.
Now, if we have not found pages with the same cluster keys or the pages are full, we need to find free page and don't screw it.
To do so, we must estimate ideal clustering of a page and either insert or make a completely new page.
In order not to intermix page with hash or equality clustering, we should examine already present keys and consult statistics for them.
If key is in MCV - better find new page, otherwise we need to calculate how many tuples per page on average we have for the key
and compare with number of tuples on the page with our new key. I.e. if the page we about to insert have 30 tuples with 3 clustering keys,
and we conclude that on average we have 20 tuples per key this means that there would be 30 more tuples with that keys and 20 for our key.
If the page can hold that 80 tuples, we insert there. I believe existing statistics are enough to make this work.
For range based clustering we should estimate we should estimate how many tuples there would be on a page and its interval and compare
if we would break this interval. For example we have plain ordering clustering, we insert value 130, the page have min value 100 and number
of tuples is 20. We estimate that on average a page max value minus min value is 100 and number of tuples is 30. So each tuple increments
range by 3. So our page with min value of 100 and 20 tuples have future closing max value of 200, as our value is 130 we proceed to insert.
I don't know if there are enough stats for this already, or if BRIN can help here. But it is doable.
Now for the geo part. This is the least thought through. We definitely need an index to consult MBR and use gist penalty functions to decide.
Do you think the above rough design is doable and desirable?
Hello, Would the following custom solution: - a pre-loaded table rows being sorted by id and ts containing null values for other columns, enough free space per block to permit updates in place, - having a (btree or brin) index on (id,ts), - loaded using UPDATEs in spite of INSERTs for each new mesure have a chance to be a good work arround ? Regards PAscal -- Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
On 02.03.2018 19:30, Darafei "Komяpa" Praliaskouski wrote:
Hi,I work at a ride sharing company and we found a simple scenario that Postgres has a lot to improve at.After my talk at pgconf.ru Alexander Korotkov encouraged me to share my story and thoughts in -hackers.Story setting:- Amazon RDS (thus vanilla unpatchable postgres), synchronuous_commit=off (we're ok to lose some seconds on crash).
- Amazon gp2 storage. It's almost like SSD for small spikes, but has IOPS limit if you're doing a lot of operations.
- Drivers driving around the city and sending their GPS positions each second, for simplicity - 50k of them at a time.
- An append-only table of shape (id uuid, ts timestamp, geom geometry, heading speed accuracy float, source text).A btree index on (id, ts).
- A service that gets the measurements on the network, batches all into a buffer of 1 second (~N=50k rows) and inserting via COPY.
After someone orders and completes a trip, we have the id of the driver and trip time interval coming from another service, and want to get trip's route to calculate the bill. Trip times are from seconds up to 4 hours (N=4*3600=14400 rows, a page typically has 60 rows).
select * from positions where id = :id and ts between :start_ts and :end_ts;
Data for more than 24 hours ago is not needed in this service, thus a storage of 70 gb should be enough. On the safe side we're giving it 500 gb, which on gp2 gives a steady 1500 iops.
In development (on synthetic data) plans use index and look great, so we proceed with Postgres and not MySQL. :)
When deployed to production we figured out that a single query for interval of more than half an hour (1800+ rows) can exhaust all the IOPS.
Data is appended with increasing time field, which effectively ensures no rows from the same driver are ever going to be in the same heap page. A 4 hour long request can degrade system for 10 seconds. gp2 provides max 10000 IOPS, and we need to get 14400 pages then. We need the biggest available gp2 offer just to read 2 megabytes of data in 1.5 seconds. The best io-optimized io1 volumes provide 32000 IOPS, which get us as low as 500 ms.If the data was perfectly packed, it would be just 240 8k pages and translate to 120 input operations of gp2's 16kb blocks.
Our options were:
- partitioning. Not entirely trivial when your id is uuid. To get visible gains, we need to make sure each driver gets their own partition. That would leave us with 50 000(+) tables, and rumors say that in that's what is done in some bigger taxi service, and relcache then eats up all the RAM and system OOMs.
- CLUSTER table using index. Works perfect on test stand, isn't available as online option.
- Postgres Pro suggested CREATE INDEX .. INCLUDE (on commitfest https://commitfest.postgresql.org/15/1350/). We can't use that as it's not in upstream/amazon Postgres yet.
- We decided to live with overhead of unnecessary sorting by all fields and keeping a copy of heap and created a btree over all the fields to utilize Index-Only Scans:* Testing went well on dump of production database.* After we've made indexes on production, we found out that performance is _worse_ than with simpler index.* EXPLAIN (BUFFERS) revealed that Visibility Map is never being frozen, as autovacuum ignores append-only never-updated never-deleted table that is only truncated once a day. No way to force autovacuum on such table exists.
* We created a microservice (hard to find spot for crontab in distributed system!) that periodically agressively runs VACUUM on the table.It indeed helped with queries, but.. VACUUM skips all-visible pages in index, but always walks over all the pages of btree, which is even larger than the heap in our case.There is a patch to repair this behavior on commitfest https://commitfest.postgresql.org/16/952/ - but not yet in upstream/amazon Postgres.
* We ended up inventing partitioning schema that rotates a table every gigabyte of data, to keep VACUUM run time low. There are a hundreds partitions with indexes on all the fields. Finally the system is stable.
* EXPLAIN (BUFFERS) shows later that each reading query visits all the indexes of partitioned table and fetches a page from index to know there are 0 rows there. To prune obviously unneeded partitions we decided to add constraint on timestamp after a partition is finalized. Timestamps are sanitized due to mobile network instability are stamped on the client side, so we don't know the bounds in advance. Unfortunately that means we need two seq scans to do it: first one to get min(ts), max(ts), and second one on ALTER TABLE ADD CONSTRAINT. This operation also eats up iops.We are not very large company but we bump into too many scalability issues on this path already. Searching for solutions on every step shows other people with tables named like "gpsdata" and "traces", so we're not alone with this problem. :)
I gave this all some thought and it looks like it all could have not happened if Postgres was able to cluster heap insertions by (id, ts) index. We're ok with synchronuous_commit=off, so amplified write won't immediately hit disk and can get cooled down in progress. Clustering doesn't require perfect sorting: we need to minimize number of pages fetched, it's ok if the pages are not consecutive on disk.
I see the option for clustered heap writes as follows:- on tuple insertion, check if table is clustered by some index;
- if table is clustered, we start writing not into last used page, but instead go into index and get page numbers for index tuples that are less or equal than current one, up to index page boundary (or some other exit strategy, at least one heap page is needed);
- if we can fit tuple into that page, let it be written there;
- if we cannot fit it, consult statistics to see if we have too many empty space in pages. If we have more than 50% space empty, get pages by FSM. Create a new page otherwise.
(looking into FSM as safety measure for people who specifically insert tuples in backward sorted order).
This would not require more space than is currently required to keep vacuumed all-fields index, and let us omit VACUUM passes over relations. A pencil-and-paper simulation shows it as a good thing. Do you have thoughts, or can you help with implementation, or tell why it would be a bad idea and we need to follow our competitor in moving away to other RDBMS? :)
I encourage everyone to help with at least https://commitfest.postgresql.org/16/952/ as no good SQL-level workaround exists for it. After that enabling autovacuum for tables that have only inserted and no deleted tuples would be cool to have, so that we have them marked as Visible.
Hackers, can you help me keep Postgres in the system? :)
Darafei Praliaskouski,
GIS Engineer / Juno Minsk
Clustered updateable heap should have structure very similar to B-Tree. We need to somehow locate target page for insert by key, should handle page overflow/underflow (split/merge),...
So if something looks like B-Tree, behaves like B-Tree,... then most likely we should not reinvent a wheel and use B-Tree.
From my point of view covering index (CREATE INDEX .. INCLUDE on commitfest https://commitfest.postgresql.org/15/1350/) is what we need in this case.
May be some more options are needed to force use index only scan for append only data. While this patch is not committed yet, it is possible to try to create standard compound index, including all record columns (the problem can be caused by types not having required comparison operators).
Alternative solution is manual record clustering. It can be achieved by storing several points of the same rote in one Postgres record. You can use Postgres array or json types.
The approach with json was used by Platon company solving the similar task (tracking information about vehicles movements).
Although it seems to be a litle bit strange idea to use JSON for statically structured data, Platon was able to several times reduce storage size and increase query execution speed.
Instead of arrays/json type you can defined your own "vector"/"tile" types or use my extension VOPS. Certainly in all this cases you will have to rewrite queries and them will become more complicated.
It is actually very common problem that the order of inserting and traversing data is different. The same think happen in most trading systems, when data is imported in time ascending order while it is usually accessed and analyzed grouped by symbols. The most general solution of the problem is to maintain several different representations of the data.
It can be for example "vertical" and "horizonal" representations. In Vertica you can create arbitrary number of "projections" where data is sorted in different way.
In Postgres is can be achieved using indexes or external storages. I hope that extensible heap API will simplify development and integration of such alternative storages.
But even right now you can try to do some manual clustering using indexes or compound types.
-- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Have you looked at Timescale DB extension on postgresql? -- Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
Colleagues incomprehensible situation: it seems that the statistics for autovacuum analyze is not passed to the slave. After a manual run, vacuum analyze all comes to life. the essence of this is a frequently changing table when I request a replica I get on a big Heap Fetches: 18623 the request is made up to 400 msec as soon as the master starts vacuum analyze Heap Fetches: 6394 the request is executed 5-10 msec screwed to the table alter table callback SET (autovacuum_analyze_scale_factor = 0.0); alter table callback SET (autovacuum_analyze_threshold = 500); I see in the statistics of current requests the autovacuum on this table passes, but it does not help, the visibility data for this table are not updated for the slave -- Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html