Thread: UUID v1 optimizations...
Hi all, Some time ago, I was having trouble with some rather high load OLTP application (in Java, but that doesn't really matter) that was using v1 UUID's for primary keys and after some time, the bloat of certain indexes went quite high. So I investigated the PostgreSQL code to see how it is handling UUID's with respect to storage, sorting, aso. but all I could find was that it basically falls back to the 16-byte. After struggling to find a way to optimize things inside the database, I reverted to introduce a hack into the application by not shuffling the timestamp bytes for the UUID's, which makes it look quite serial in terms of byte order. With that, we were able to reduce bloat by magnitudes and finally VACUUM also was able to reclaim index space. So, my question now is: Would it make sense for you to handle these time-based UUID's differently internally? Specifically un-shuffling the timestamp before they are going to storage? A second question would be whether native support functions could be introduced? Most interesting for me would be: * timestamp * version * variant Here are some numbers from the tests I have run (against a PostgreSQL 10 server): 1. insert 5,000,000 rows 2. delete 2,500,000 rows Index Bloat: idxname | bloat_ratio | bloat_size | real_size ---------------------+-------------+------------+----------- uuid_v1_pkey | 23.6 | 46 MB | 195 MB uuid_serial_pkey | 50.4 | 76 MB | 150 MB Higher ratio for "serial", but still lower total index size. :) Now, the performance of VACUUM is also very interesting here: # vacuum (verbose, analyze, freeze) uuid_serial; INFO: vacuuming "public.uuid_serial" INFO: index "uuid_serial_pkey" now contains 2500001 row versions in 19253 pages DETAIL: 0 index row versions were removed.ce(toast.reltuples, 0) / 4 ) * bs ) as expected_bytes 9624 index pages have been deleted, 9624 are currently reusable. CPU: user: 0.03 s, system: 0.01 s, elapsed: 0.05 s.t.oid INFO: "uuid_serial": found 0 removable, 2500001 nonremovable row versions in 13515 out of 27028 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 270712 There were 94 unused item pointers. Skipped 0 pages due to buffer pins, 13513 frozen pages. 0 pages are entirely empty.be reused CPU: user: 0.37 s, system: 0.16 s, elapsed: 2.83 s. INFO: analyzing "public.uuid_serial"e compressed INFO: "uuid_serial": scanned 27028 of 27028 pages, containing 2500001 live rows and 0 dead rows; 30000 rows in sample, 2500001 estimated total rows VACUUM schemaname, tablename, can_estimate, Time: 3969.812 ms (00:03.970) # vacuum (verbose, analyze, freeze) uuid_v1; INFO: vacuuming "public.uuid_v1" INFO: scanned index "uuid_v1_pkey" to remove 2499999 row versions DETAIL: CPU: user: 1.95 s, system: 0.13 s, elapsed: 5.09 s INFO: "uuid_v1": removed 2499999 row versions in 27028 pages DETAIL: CPU: user: 0.22 s, system: 0.26 s, elapsed: 3.93 s INFO: index "uuid_v1_pkey" now contains 2500001 row versions in 24991 pages DETAIL: 2499999 index row versions were removed. 12111 index pages have been deleted, 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: "uuid_v1": found 1791266 removable, 2500001 nonremovable row versions in 27028 out of 27028 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 270716 There were 0 unused item pointers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 2.90 s, system: 0.71 s, elapsed: 14.54 s. INFO: analyzing "public.uuid_v1" INFO: "uuid_v1": scanned 27028 of 27028 pages, containing 2500001 live rows and 0 dead rows; 30000 rows in sample, 2500001 estimated total rows VACUUM Time: 15702.803 ms (00:15.703) ...almost 5x faster! Now insert another 20 million: COPY uuid_serial FROM '...' WITH ( FORMAT text ); COPY 20000000 Time: 76249.142 ms (01:16.249) COPY uuid_v1 FROM '...' WITH ( FORMAT text ); COPY 20000000 Time: 804291.611 ms (13:24.292) ...more than 10x faster! ...and the resulting bloat (no VACUUM in between): idxname | bloat_ratio | bloat_size | real_size ---------------------+-------------+------------+----------- uuid_v1_pkey | 30.5 | 295 MB | 966 MB uuid_serial_pkey | 0.9 | 6056 kB | 677 MB ...still 30% savings in space. Cheers, Ancoron
On 2019-05-25 15:45, Ancoron Luciferis wrote: > So, my question now is: Would it make sense for you to handle these > time-based UUID's differently internally? Specifically un-shuffling the > timestamp before they are going to storage? It seems unlikely that we would do that, because that would break existing stored UUIDs, and we'd also need to figure out a way to store non-v1 UUIDs under this different scheme. The use case is pretty narrow compared to the enormous effort. It is well understood that using UUIDs or other somewhat-random keys perform worse than serial-like keys. Btw., it might be nice to rerun your tests with PostgreSQL 12beta1. The btree storage has gotten some improvements. I don't think it's going to fundamentally solve your problem, but it would be useful feedback. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes: > So I investigated the PostgreSQL code to see how it is handling UUID's > with respect to storage, sorting, aso. but all I could find was that it > basically falls back to the 16-byte. Yup, they're just blobs to us. > After struggling to find a way to optimize things inside the database, I > reverted to introduce a hack into the application by not shuffling the > timestamp bytes for the UUID's, which makes it look quite serial in > terms of byte order. > So, my question now is: Would it make sense for you to handle these > time-based UUID's differently internally? Specifically un-shuffling the > timestamp before they are going to storage? No, because (1) UUID layout is standardized; (2) such a change would break on-disk compatibility for existing databases; (3) particularly for the case of application-generated UUIDs, we do not have enough information to know that this would actually do anything useful; (4) it in fact *wouldn't* do anything useful, because we'd still have to sort UUIDs in the same order as today, meaning that btree index behavior would remain the same as before. Plus UUID comparison would get a lot more complicated and slower than it is now. (5) even if we ignored all that and did it anyway, it would only help for version-1 UUIDs. The index performance issue would still remain for version-4 (random) UUIDs, which are if anything more common than v1. FWIW, I don't know what tool you're using to get those "bloat" numbers, but just because somebody calls it bloat doesn't mean that it is. The normal, steady-state load factor for a btree index is generally understood to be about 2/3rds, and that looks to be about what you're getting for the regular-UUID-format index. The fact that the serially-loaded index has nearly no dead space is because we hack the page split logic to make that happen --- but that is a hack, and it's not without downsides. It should *not* be taken to be an indication of what you can expect for any other insertion pattern. The insertion-speed aspect is a real problem, but the core of that problem is that use of any sort of standard-format UUID turns applications that might have had considerable locality of reference into applications that have none. If you can manage to keep your whole index in RAM that would not hurt too much, but as soon as it doesn't fit you have a problem. When your app has more or less predictable reference patterns it's best to design a unique key that matches that, instead of expecting that essentially-random keys will work well. Or in short, hacking up the way your app generates UUIDs is exactly the right way to proceed here. regards, tom lane
On 25/05/2019 16:19, Peter Eisentraut wrote: > On 2019-05-25 15:45, Ancoron Luciferis wrote: >> So, my question now is: Would it make sense for you to handle these >> time-based UUID's differently internally? Specifically un-shuffling the >> timestamp before they are going to storage? > > It seems unlikely that we would do that, because that would break > existing stored UUIDs, and we'd also need to figure out a way to store > non-v1 UUIDs under this different scheme. The use case is pretty narrow > compared to the enormous effort. I agree, the backwards compatibility really would be a show stopper here. About the "enormous effort" I was thinking the simplest "solution" would be to have the version being detected at he time when the internal byte array is created from the provided representation which then could directly provide the unshuffled byte order. > > It is well understood that using UUIDs or other somewhat-random keys > perform worse than serial-like keys. Yes I know. We're using these UUID's for more than just some primary key, because they also tell use the creation time of the entry and which "node" of the highly distributed application generated the entry. It is like an audit-log for us without the need for extra columns and of fixed size, which helps performance also at the application level. > > Btw., it might be nice to rerun your tests with PostgreSQL 12beta1. The > btree storage has gotten some improvements. I don't think it's going to > fundamentally solve your problem, but it would be useful feedback. > Thank you for the pointer to 12beta1. I've just read about it and it might help a bit. I'll give it a try, for sure and report back. I also have to rerun those tests against PG 11 anyway. Cheers, Ancoron
On 25/05/2019 16:57, Tom Lane wrote: > Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes: >> So I investigated the PostgreSQL code to see how it is handling UUID's >> with respect to storage, sorting, aso. but all I could find was that it >> basically falls back to the 16-byte. > > Yup, they're just blobs to us. > >> After struggling to find a way to optimize things inside the database, I >> reverted to introduce a hack into the application by not shuffling the >> timestamp bytes for the UUID's, which makes it look quite serial in >> terms of byte order. > >> So, my question now is: Would it make sense for you to handle these >> time-based UUID's differently internally? Specifically un-shuffling the >> timestamp before they are going to storage? > > No, because > > (1) UUID layout is standardized; You mean the presentation at the byte-level is. ;) > > (2) such a change would break on-disk compatibility for existing > databases; Yes, that certainly is a show-stopper. > > (3) particularly for the case of application-generated UUIDs, we do > not have enough information to know that this would actually do anything > useful; Well, not only the layout is standardized, but also there is a semantic to it depending on the version. Specifically for version 1, it has: 1. a timestamp 2. a clock sequence 3. a node id Ans as PostgreSQL already provides this pretty concrete data type, it could be a natural consequence to also support the semantic of it. E.g. the network types also come with a lot of additional operators and functions. So I don't see a reason not to respect the additional capabilities of a UUID. For other versions of UUID's, functions like timestamp would certainly not be available (return NULL?), respecting the concrete semantic. > > (4) it in fact *wouldn't* do anything useful, because we'd still have > to sort UUIDs in the same order as today, meaning that btree index behavior > would remain the same as before. Plus UUID comparison would get a lot > more complicated and slower than it is now. I get your first sentence, but not your second. I know that when changing the internal byte order we'd have to completed re-compute everything on-disk (from table to index data), but why would the sorting in the index have to be the same? And actually, comparison logic wouldn't need to be changed at all if the byte order is "changed" when the UUID is read in when reading the representation into the internal UUID's byte array. Function: string_to_uuid(const char *source, pg_uuid_t *uuid); ^^ here I would apply the change. And of course, reverse it for generating the textual representation. That would slow down writes a bit, but that shouldn't be the case because index insertions are speed up even more. But still, on-disk change is still a show-stopper, I guess. > > (5) even if we ignored all that and did it anyway, it would only help > for version-1 UUIDs. The index performance issue would still remain for > version-4 (random) UUIDs, which are if anything more common than v1. Yes, I am aware that the changes might be of very limited gain. V4 UUID's are usually used for external identifiers. For internal ones, they don't make sense to me (too long, issues with randomness/enthropie under high load, ...). ;) I just recently used these UUID's also together with a function for TimescaleDB auto-partitioning to provide the timestamp for the partitioning logic instead of the need for a separate timestamp column. This is also one of the reasons why I was also asking for native support functions to extract the timestamp. I am apparently not very good at C so I am currently using Python and/or PgPLSQL for it, which is pretty slow. > > > FWIW, I don't know what tool you're using to get those "bloat" numbers, > but just because somebody calls it bloat doesn't mean that it is. > The normal, steady-state load factor for a btree index is generally > understood to be about 2/3rds, and that looks to be about what > you're getting for the regular-UUID-format index. The fact that the > serially-loaded index has nearly no dead space is because we hack the > page split logic to make that happen --- but that is a hack, and it's > not without downsides. It should *not* be taken to be an indication > of what you can expect for any other insertion pattern. OK, understood. I was actually a bit surprised by those numbers myself as these "serial" UUID's still only have the timestamp bytes in ascending order, the clock sequence and node is still pretty random (but not inside a single transaction, which might help the hack). > > The insertion-speed aspect is a real problem, but the core of that problem > is that use of any sort of standard-format UUID turns applications that > might have had considerable locality of reference into applications that > have none. If you can manage to keep your whole index in RAM that would > not hurt too much, but as soon as it doesn't fit you have a problem. > When your app has more or less predictable reference patterns it's best > to design a unique key that matches that, instead of expecting that > essentially-random keys will work well. The system was configured to have more than enough space for the index and table data to fit into memory, but I am not sure. How can I verify that? An EXPLAIN on the INSERT apparently doesn't include index insertion. > > Or in short, hacking up the way your app generates UUIDs is exactly > the right way to proceed here. OK. Glad to hear that. One last question, though: Would it make sense to create a specialized UUID v1 type (e.g. with an extension) that does the transformation and delegates for all other things to the existing UUID type support? > > regards, tom lane >
I am not sure why do you want to change on-disk storage format? If we are talking about indexes, it's more about comparison function (opclass) that is used in an index.
Am I wrong?
сб, 25 трав. 2019 о 11:21 Ancoron Luciferis <ancoron.luciferis@googlemail.com> пише:
On 25/05/2019 16:57, Tom Lane wrote:
> Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes:
>> So I investigated the PostgreSQL code to see how it is handling UUID's
>> with respect to storage, sorting, aso. but all I could find was that it
>> basically falls back to the 16-byte.
>
> Yup, they're just blobs to us.
>
>> After struggling to find a way to optimize things inside the database, I
>> reverted to introduce a hack into the application by not shuffling the
>> timestamp bytes for the UUID's, which makes it look quite serial in
>> terms of byte order.
>
>> So, my question now is: Would it make sense for you to handle these
>> time-based UUID's differently internally? Specifically un-shuffling the
>> timestamp before they are going to storage?
>
> No, because
>
> (1) UUID layout is standardized;
You mean the presentation at the byte-level is. ;)
>
> (2) such a change would break on-disk compatibility for existing
> databases;
Yes, that certainly is a show-stopper.
>
> (3) particularly for the case of application-generated UUIDs, we do
> not have enough information to know that this would actually do anything
> useful;
Well, not only the layout is standardized, but also there is a semantic
to it depending on the version. Specifically for version 1, it has:
1. a timestamp
2. a clock sequence
3. a node id
Ans as PostgreSQL already provides this pretty concrete data type, it
could be a natural consequence to also support the semantic of it.
E.g. the network types also come with a lot of additional operators and
functions. So I don't see a reason not to respect the additional
capabilities of a UUID.
For other versions of UUID's, functions like timestamp would certainly
not be available (return NULL?), respecting the concrete semantic.
>
> (4) it in fact *wouldn't* do anything useful, because we'd still have
> to sort UUIDs in the same order as today, meaning that btree index behavior
> would remain the same as before. Plus UUID comparison would get a lot
> more complicated and slower than it is now.
I get your first sentence, but not your second. I know that when
changing the internal byte order we'd have to completed re-compute
everything on-disk (from table to index data), but why would the sorting
in the index have to be the same?
And actually, comparison logic wouldn't need to be changed at all if the
byte order is "changed" when the UUID is read in when reading the
representation into the internal UUID's byte array.
Function:
string_to_uuid(const char *source, pg_uuid_t *uuid);
^^ here I would apply the change. And of course, reverse it for
generating the textual representation.
That would slow down writes a bit, but that shouldn't be the case
because index insertions are speed up even more.
But still, on-disk change is still a show-stopper, I guess.
>
> (5) even if we ignored all that and did it anyway, it would only help
> for version-1 UUIDs. The index performance issue would still remain for
> version-4 (random) UUIDs, which are if anything more common than v1.
Yes, I am aware that the changes might be of very limited gain. V4
UUID's are usually used for external identifiers. For internal ones,
they don't make sense to me (too long, issues with randomness/enthropie
under high load, ...). ;)
I just recently used these UUID's also together with a function for
TimescaleDB auto-partitioning to provide the timestamp for the
partitioning logic instead of the need for a separate timestamp column.
This is also one of the reasons why I was also asking for native support
functions to extract the timestamp. I am apparently not very good at C
so I am currently using Python and/or PgPLSQL for it, which is pretty slow.
>
>
> FWIW, I don't know what tool you're using to get those "bloat" numbers,
> but just because somebody calls it bloat doesn't mean that it is.
> The normal, steady-state load factor for a btree index is generally
> understood to be about 2/3rds, and that looks to be about what
> you're getting for the regular-UUID-format index. The fact that the
> serially-loaded index has nearly no dead space is because we hack the
> page split logic to make that happen --- but that is a hack, and it's
> not without downsides. It should *not* be taken to be an indication
> of what you can expect for any other insertion pattern.
OK, understood. I was actually a bit surprised by those numbers myself
as these "serial" UUID's still only have the timestamp bytes in
ascending order, the clock sequence and node is still pretty random (but
not inside a single transaction, which might help the hack).
>
> The insertion-speed aspect is a real problem, but the core of that problem
> is that use of any sort of standard-format UUID turns applications that
> might have had considerable locality of reference into applications that
> have none. If you can manage to keep your whole index in RAM that would
> not hurt too much, but as soon as it doesn't fit you have a problem.
> When your app has more or less predictable reference patterns it's best
> to design a unique key that matches that, instead of expecting that
> essentially-random keys will work well.
The system was configured to have more than enough space for the index
and table data to fit into memory, but I am not sure. How can I verify
that? An EXPLAIN on the INSERT apparently doesn't include index insertion.
>
> Or in short, hacking up the way your app generates UUIDs is exactly
> the right way to proceed here.
OK. Glad to hear that.
One last question, though:
Would it make sense to create a specialized UUID v1 type (e.g. with an
extension) that does the transformation and delegates for all other
things to the existing UUID type support?
>
> regards, tom lane
>
On 25/05/2019 21:00, Vitalii Tymchyshyn wrote: > I am not sure why do you want to change on-disk storage format? If we > are talking about indexes, it's more about comparison function (opclass) > that is used in an index. > Am I wrong? I don't "want" to change the on-disk format of the v1 UUID's but that seems to be the most efficient approach to me (if anyone would ever want to implement that) as not changing it would mean to introduce the unshuffling of the timestamp bytes in a lot of other places where it would decrease the performance instead of increasing it, not to speak of VACUUM, REINDEX, CLUSTER, .... One of my use-cases would also benefit a lot here when trying to find entries that have been created within a given time range. To do that (without an additional timestamp column), I'd end up with a full index scan to determine the rows to read as there is no correlation between those (timestamp and value sort order). For example I would like to query the DB with: WHERE id >= '7d9c4835-5378-11ec-0000-000000000000' AND id < '5cdcac78-5a9a-11ec-0000-000000000000' ...which is impossible atm. I have written myself a helper function that extracts the timestamp, but that is executed (of course) for each and every entry while filtering or increases (if used as an index expression) the write RTT by quite a lot (if you're dealing with peak-load style applications like we have): WHERE time_part_uuidv1(id) >= '2021-12-02 14:02:31.021778+00' AND time_part_uuidv1(id) < '2021-12-11 15:52:37.107212+00' So, although possible, it is not really an option for an application like we have (and we don't just want to throw hardware at it). I don't know if all that makes any sense to you or if I'm just missing a piece of the puzzle inside the PostgreSQL code base. If I do, please tell me. :) Cheers, Ancoron > > сб, 25 трав. 2019 о 11:21 Ancoron Luciferis > <ancoron.luciferis@googlemail.com > <mailto:ancoron.luciferis@googlemail.com>> пише: > > On 25/05/2019 16:57, Tom Lane wrote: > > Ancoron Luciferis <ancoron.luciferis@googlemail.com > <mailto:ancoron.luciferis@googlemail.com>> writes: > >> So I investigated the PostgreSQL code to see how it is handling > UUID's > >> with respect to storage, sorting, aso. but all I could find was > that it > >> basically falls back to the 16-byte. > > > > Yup, they're just blobs to us. > > > >> After struggling to find a way to optimize things inside the > database, I > >> reverted to introduce a hack into the application by not > shuffling the > >> timestamp bytes for the UUID's, which makes it look quite serial in > >> terms of byte order. > > > >> So, my question now is: Would it make sense for you to handle these > >> time-based UUID's differently internally? Specifically > un-shuffling the > >> timestamp before they are going to storage? > > > > No, because > > > > (1) UUID layout is standardized; > > You mean the presentation at the byte-level is. ;) > > > > > (2) such a change would break on-disk compatibility for existing > > databases; > > Yes, that certainly is a show-stopper. > > > > > (3) particularly for the case of application-generated UUIDs, we do > > not have enough information to know that this would actually do > anything > > useful; > > Well, not only the layout is standardized, but also there is a semantic > to it depending on the version. Specifically for version 1, it has: > 1. a timestamp > 2. a clock sequence > 3. a node id > > Ans as PostgreSQL already provides this pretty concrete data type, it > could be a natural consequence to also support the semantic of it. > > E.g. the network types also come with a lot of additional operators and > functions. So I don't see a reason not to respect the additional > capabilities of a UUID. > > For other versions of UUID's, functions like timestamp would certainly > not be available (return NULL?), respecting the concrete semantic. > > > > > (4) it in fact *wouldn't* do anything useful, because we'd still have > > to sort UUIDs in the same order as today, meaning that btree index > behavior > > would remain the same as before. Plus UUID comparison would get a lot > > more complicated and slower than it is now. > > I get your first sentence, but not your second. I know that when > changing the internal byte order we'd have to completed re-compute > everything on-disk (from table to index data), but why would the sorting > in the index have to be the same? > > And actually, comparison logic wouldn't need to be changed at all if the > byte order is "changed" when the UUID is read in when reading the > representation into the internal UUID's byte array. > > Function: > string_to_uuid(const char *source, pg_uuid_t *uuid); > > ^^ here I would apply the change. And of course, reverse it for > generating the textual representation. > > That would slow down writes a bit, but that shouldn't be the case > because index insertions are speed up even more. > > But still, on-disk change is still a show-stopper, I guess. > > > > > (5) even if we ignored all that and did it anyway, it would only help > > for version-1 UUIDs. The index performance issue would still > remain for > > version-4 (random) UUIDs, which are if anything more common than v1. > > Yes, I am aware that the changes might be of very limited gain. V4 > UUID's are usually used for external identifiers. For internal ones, > they don't make sense to me (too long, issues with randomness/enthropie > under high load, ...). ;) > > I just recently used these UUID's also together with a function for > TimescaleDB auto-partitioning to provide the timestamp for the > partitioning logic instead of the need for a separate timestamp column. > > This is also one of the reasons why I was also asking for native support > functions to extract the timestamp. I am apparently not very good at C > so I am currently using Python and/or PgPLSQL for it, which is > pretty slow. > > > > > > > FWIW, I don't know what tool you're using to get those "bloat" > numbers, > > but just because somebody calls it bloat doesn't mean that it is. > > The normal, steady-state load factor for a btree index is generally > > understood to be about 2/3rds, and that looks to be about what > > you're getting for the regular-UUID-format index. The fact that the > > serially-loaded index has nearly no dead space is because we hack the > > page split logic to make that happen --- but that is a hack, and it's > > not without downsides. It should *not* be taken to be an indication > > of what you can expect for any other insertion pattern. > > OK, understood. I was actually a bit surprised by those numbers myself > as these "serial" UUID's still only have the timestamp bytes in > ascending order, the clock sequence and node is still pretty random (but > not inside a single transaction, which might help the hack). > > > > > The insertion-speed aspect is a real problem, but the core of that > problem > > is that use of any sort of standard-format UUID turns applications > that > > might have had considerable locality of reference into > applications that > > have none. If you can manage to keep your whole index in RAM that > would > > not hurt too much, but as soon as it doesn't fit you have a problem. > > When your app has more or less predictable reference patterns it's > best > > to design a unique key that matches that, instead of expecting that > > essentially-random keys will work well. > > The system was configured to have more than enough space for the index > and table data to fit into memory, but I am not sure. How can I verify > that? An EXPLAIN on the INSERT apparently doesn't include index > insertion. > > > > > Or in short, hacking up the way your app generates UUIDs is exactly > > the right way to proceed here. > > OK. Glad to hear that. > > One last question, though: > Would it make sense to create a specialized UUID v1 type (e.g. with an > extension) that does the transformation and delegates for all other > things to the existing UUID type support? > > > > > regards, tom lane > > > >
On 25/05/2019 21:00, Vitalii Tymchyshyn wrote: > I am not sure why do you want to change on-disk storage format? If we > are talking about indexes, it's more about comparison function (opclass) > that is used in an index. > Am I wrong? I don't "want" to change the on-disk format of the v1 UUID's but that seems to be the most efficient approach to me (if anyone would ever want to implement that) as not changing it would mean to introduce the unshuffling of the timestamp bytes in a lot of other places where it would decrease the performance instead of increasing it, not to speak of VACUUM, REINDEX, CLUSTER, .... One of my use-cases would also benefit a lot here when trying to find entries that have been created within a given time range. To do that (without an additional timestamp column), I'd end up with a full index scan to determine the rows to read as there is no correlation between those (timestamp and value sort order). For example I would like to query the DB with: WHERE id >= '7d9c4835-5378-11ec-0000-000000000000' AND id < '5cdcac78-5a9a-11ec-0000-000000000000' ...which is impossible atm. I have written myself a helper function that extracts the timestamp, but that is executed (of course) for each and every entry while filtering or increases (if used as an index expression) the write RTT by quite a lot (if you're dealing with peak-load style applications like we have): WHERE time_part_uuidv1(id) >= '2021-12-02 14:02:31.021778+00' AND time_part_uuidv1(id) < '2021-12-11 15:52:37.107212+00' So, although possible, it is not really an option for an application like we have (and we don't just want to throw hardware at it). I don't know if all that makes any sense to you or if I'm just missing a piece of the puzzle inside the PostgreSQL code base. If I do, please tell me. :) Cheers, Ancoron > > сб, 25 трав. 2019 о 11:21 Ancoron Luciferis > <ancoron.luciferis@googlemail.com > <mailto:ancoron.luciferis@googlemail.com>> пише: > > On 25/05/2019 16:57, Tom Lane wrote: > > Ancoron Luciferis <ancoron.luciferis@googlemail.com > <mailto:ancoron.luciferis@googlemail.com>> writes: > >> So I investigated the PostgreSQL code to see how it is handling > UUID's > >> with respect to storage, sorting, aso. but all I could find was > that it > >> basically falls back to the 16-byte. > > > > Yup, they're just blobs to us. > > > >> After struggling to find a way to optimize things inside the > database, I > >> reverted to introduce a hack into the application by not > shuffling the > >> timestamp bytes for the UUID's, which makes it look quite serial in > >> terms of byte order. > > > >> So, my question now is: Would it make sense for you to handle these > >> time-based UUID's differently internally? Specifically > un-shuffling the > >> timestamp before they are going to storage? > > > > No, because > > > > (1) UUID layout is standardized; > > You mean the presentation at the byte-level is. ;) > > > > > (2) such a change would break on-disk compatibility for existing > > databases; > > Yes, that certainly is a show-stopper. > > > > > (3) particularly for the case of application-generated UUIDs, we do > > not have enough information to know that this would actually do > anything > > useful; > > Well, not only the layout is standardized, but also there is a semantic > to it depending on the version. Specifically for version 1, it has: > 1. a timestamp > 2. a clock sequence > 3. a node id > > Ans as PostgreSQL already provides this pretty concrete data type, it > could be a natural consequence to also support the semantic of it. > > E.g. the network types also come with a lot of additional operators and > functions. So I don't see a reason not to respect the additional > capabilities of a UUID. > > For other versions of UUID's, functions like timestamp would certainly > not be available (return NULL?), respecting the concrete semantic. > > > > > (4) it in fact *wouldn't* do anything useful, because we'd still have > > to sort UUIDs in the same order as today, meaning that btree index > behavior > > would remain the same as before. Plus UUID comparison would get a lot > > more complicated and slower than it is now. > > I get your first sentence, but not your second. I know that when > changing the internal byte order we'd have to completed re-compute > everything on-disk (from table to index data), but why would the sorting > in the index have to be the same? > > And actually, comparison logic wouldn't need to be changed at all if the > byte order is "changed" when the UUID is read in when reading the > representation into the internal UUID's byte array. > > Function: > string_to_uuid(const char *source, pg_uuid_t *uuid); > > ^^ here I would apply the change. And of course, reverse it for > generating the textual representation. > > That would slow down writes a bit, but that shouldn't be the case > because index insertions are speed up even more. > > But still, on-disk change is still a show-stopper, I guess. > > > > > (5) even if we ignored all that and did it anyway, it would only help > > for version-1 UUIDs. The index performance issue would still > remain for > > version-4 (random) UUIDs, which are if anything more common than v1. > > Yes, I am aware that the changes might be of very limited gain. V4 > UUID's are usually used for external identifiers. For internal ones, > they don't make sense to me (too long, issues with randomness/enthropie > under high load, ...). ;) > > I just recently used these UUID's also together with a function for > TimescaleDB auto-partitioning to provide the timestamp for the > partitioning logic instead of the need for a separate timestamp column. > > This is also one of the reasons why I was also asking for native support > functions to extract the timestamp. I am apparently not very good at C > so I am currently using Python and/or PgPLSQL for it, which is > pretty slow. > > > > > > > FWIW, I don't know what tool you're using to get those "bloat" > numbers, > > but just because somebody calls it bloat doesn't mean that it is. > > The normal, steady-state load factor for a btree index is generally > > understood to be about 2/3rds, and that looks to be about what > > you're getting for the regular-UUID-format index. The fact that the > > serially-loaded index has nearly no dead space is because we hack the > > page split logic to make that happen --- but that is a hack, and it's > > not without downsides. It should *not* be taken to be an indication > > of what you can expect for any other insertion pattern. > > OK, understood. I was actually a bit surprised by those numbers myself > as these "serial" UUID's still only have the timestamp bytes in > ascending order, the clock sequence and node is still pretty random (but > not inside a single transaction, which might help the hack). > > > > > The insertion-speed aspect is a real problem, but the core of that > problem > > is that use of any sort of standard-format UUID turns applications > that > > might have had considerable locality of reference into > applications that > > have none. If you can manage to keep your whole index in RAM that > would > > not hurt too much, but as soon as it doesn't fit you have a problem. > > When your app has more or less predictable reference patterns it's > best > > to design a unique key that matches that, instead of expecting that > > essentially-random keys will work well. > > The system was configured to have more than enough space for the index > and table data to fit into memory, but I am not sure. How can I verify > that? An EXPLAIN on the INSERT apparently doesn't include index > insertion. > > > > > Or in short, hacking up the way your app generates UUIDs is exactly > > the right way to proceed here. > > OK. Glad to hear that. > > One last question, though: > Would it make sense to create a specialized UUID v1 type (e.g. with an > extension) that does the transformation and delegates for all other > things to the existing UUID type support? > > > > > regards, tom lane > > > >
Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes: > On 25/05/2019 16:57, Tom Lane wrote: >> (4) it in fact *wouldn't* do anything useful, because we'd still have >> to sort UUIDs in the same order as today, meaning that btree index behavior >> would remain the same as before. Plus UUID comparison would get a lot >> more complicated and slower than it is now. > I get your first sentence, but not your second. I know that when > changing the internal byte order we'd have to completed re-compute > everything on-disk (from table to index data), but why would the sorting > in the index have to be the same? Because we aren't going to change the existing sort order of UUIDs. We have no idea what applications might be dependent on that. As Vitalii correctly pointed out, your beef is not with the physical storage of UUIDs anyway: you just wish they'd sort differently, since that is what determines the behavior of a btree index. But we aren't going to change the sort ordering because that's an even bigger compatibility break than changing the physical storage; it'd affect application-visible semantics. What you might want to think about is creating a function that maps UUIDs into an ordering that makes sense to you, and then creating a unique index over that function instead of the raw UUIDs. That would give the results you want without having to negotiate with the rest of the world about whether it's okay to change the semantics of type uuid. regards, tom lane
On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote: >Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes: >> On 25/05/2019 16:57, Tom Lane wrote: >>> (4) it in fact *wouldn't* do anything useful, because we'd still have >>> to sort UUIDs in the same order as today, meaning that btree index behavior >>> would remain the same as before. Plus UUID comparison would get a lot >>> more complicated and slower than it is now. > >> I get your first sentence, but not your second. I know that when >> changing the internal byte order we'd have to completed re-compute >> everything on-disk (from table to index data), but why would the sorting >> in the index have to be the same? > >Because we aren't going to change the existing sort order of UUIDs. >We have no idea what applications might be dependent on that. > >As Vitalii correctly pointed out, your beef is not with the physical >storage of UUIDs anyway: you just wish they'd sort differently, since >that is what determines the behavior of a btree index. But we aren't >going to change the sort ordering because that's an even bigger >compatibility break than changing the physical storage; it'd affect >application-visible semantics. > >What you might want to think about is creating a function that maps >UUIDs into an ordering that makes sense to you, and then creating >a unique index over that function instead of the raw UUIDs. That >would give the results you want without having to negotiate with the >rest of the world about whether it's okay to change the semantics >of type uuid. > FWIW that's essentially what I implemented as an extension some time ago. See [1] for a more detailed explanation and some benchmarks. The thing is - it's not really desirable to get perfectly ordered ordering, because that would mean we never get back to older parts of the index (so if you delete data, we'd never fill that space). [1] https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/ regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote: >> What you might want to think about is creating a function that maps >> UUIDs into an ordering that makes sense to you, and then creating >> a unique index over that function instead of the raw UUIDs. That >> would give the results you want without having to negotiate with the >> rest of the world about whether it's okay to change the semantics >> of type uuid. > FWIW that's essentially what I implemented as an extension some time > ago. See [1] for a more detailed explanation and some benchmarks. Also, another way to attack this is to create a new set of ordering operators for UUID and an associated non-default btree opclass. Then you declare your index using that opclass and you're done. The key advantage of this way is that the regular UUID equality operator can still be a member of that opclass, meaning that searches of the form "uuidcol = constant" can still use this index, so you don't have to change your queries (at least not for that common case). Look at the interrelationship of the regular text btree operators and the "pattern_ops" btree operators for a precedent. regards, tom lane
On 25/05/2019 23:54, Tom Lane wrote: > Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes: >> On 25/05/2019 16:57, Tom Lane wrote: >>> (4) it in fact *wouldn't* do anything useful, because we'd still have >>> to sort UUIDs in the same order as today, meaning that btree index behavior >>> would remain the same as before. Plus UUID comparison would get a lot >>> more complicated and slower than it is now. > >> I get your first sentence, but not your second. I know that when >> changing the internal byte order we'd have to completed re-compute >> everything on-disk (from table to index data), but why would the sorting >> in the index have to be the same? > > Because we aren't going to change the existing sort order of UUIDs. > We have no idea what applications might be dependent on that. > > As Vitalii correctly pointed out, your beef is not with the physical > storage of UUIDs anyway: you just wish they'd sort differently, since > that is what determines the behavior of a btree index. But we aren't > going to change the sort ordering because that's an even bigger > compatibility break than changing the physical storage; it'd affect > application-visible semantics. > > What you might want to think about is creating a function that maps > UUIDs into an ordering that makes sense to you, and then creating > a unique index over that function instead of the raw UUIDs. That > would give the results you want without having to negotiate with the > rest of the world about whether it's okay to change the semantics > of type uuid. > > regards, tom lane > I understand. Point taken, I really didn't think about someone could depend on an index order of a (pretty random) UUID. The whole point of me starting this discussion was about performance in multiple areas, but INSERT performance was really becoming an issue for us apart from the index bloat, which was way higher than just the 30% at several occasions (including out-of-disk-space in the early days), apart from the fact that the index was regularly dismissed due to it not being in memory. In that sense, just creating additional indexes with functions doesn't really solve the core issues that we had. Not to mention the performance of VACUUM, among other things. So, even we currently "solved" a lot of these issues at the application level, we now have UUID's that look like v1 UUID's but in fact will not be readable (in the representation as returned by PostgreSQL) by any other application that doesn't know our specific implementation. This forces us to hack other tools written in other languages that would otherwise understand and handle regular v1 UUID's as well. I should add that the tests I have made where all running on dedicated SSD's, one for the table data, one for the indexes and one for the WAL. If those where running against the same disks the difference would probably be much higher during writes. I'll think about creating an extension to provide a custom data type instead. So nobody would be at risk and anyone would decide explicitly for it with all consequences. Thank you for your time and precious input. :) Cheers, Ancoron
On Sat, May 25, 2019 at 06:38:08PM -0400, Tom Lane wrote: >Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote: >>> What you might want to think about is creating a function that maps >>> UUIDs into an ordering that makes sense to you, and then creating >>> a unique index over that function instead of the raw UUIDs. That >>> would give the results you want without having to negotiate with the >>> rest of the world about whether it's okay to change the semantics >>> of type uuid. > >> FWIW that's essentially what I implemented as an extension some time >> ago. See [1] for a more detailed explanation and some benchmarks. > >Also, another way to attack this is to create a new set of ordering >operators for UUID and an associated non-default btree opclass. >Then you declare your index using that opclass and you're done. >The key advantage of this way is that the regular UUID equality >operator can still be a member of that opclass, meaning that searches >of the form "uuidcol = constant" can still use this index, so you >don't have to change your queries (at least not for that common case). >Look at the interrelationship of the regular text btree operators and >the "pattern_ops" btree operators for a precedent. > Perhaps. But it does not allow to tune how often the values "wrap" and, which I think is an useful capability. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 26/05/2019 00:14, Tomas Vondra wrote: > On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote: >> Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes: >>> On 25/05/2019 16:57, Tom Lane wrote: >>>> (4) it in fact *wouldn't* do anything useful, because we'd still have >>>> to sort UUIDs in the same order as today, meaning that btree index >>>> behavior >>>> would remain the same as before. Plus UUID comparison would get a lot >>>> more complicated and slower than it is now. >> >>> I get your first sentence, but not your second. I know that when >>> changing the internal byte order we'd have to completed re-compute >>> everything on-disk (from table to index data), but why would the sorting >>> in the index have to be the same? >> >> Because we aren't going to change the existing sort order of UUIDs. >> We have no idea what applications might be dependent on that. >> >> As Vitalii correctly pointed out, your beef is not with the physical >> storage of UUIDs anyway: you just wish they'd sort differently, since >> that is what determines the behavior of a btree index. But we aren't >> going to change the sort ordering because that's an even bigger >> compatibility break than changing the physical storage; it'd affect >> application-visible semantics. >> >> What you might want to think about is creating a function that maps >> UUIDs into an ordering that makes sense to you, and then creating >> a unique index over that function instead of the raw UUIDs. That >> would give the results you want without having to negotiate with the >> rest of the world about whether it's okay to change the semantics >> of type uuid. >> > > FWIW that's essentially what I implemented as an extension some time > ago. See [1] for a more detailed explanation and some benchmarks. Yes, I've seen that before. Pretty nice work you but together there and I'll surely have a look at it but we certainly need the node id in compliance with v1 UUID's so that's why we've been generating UUID's at the application side from day 1. > > The thing is - it's not really desirable to get perfectly ordered > ordering, because that would mean we never get back to older parts of > the index (so if you delete data, we'd never fill that space). Wouldn't this apply also to any sequential-looking index (e.g. on serial)? The main issue with the UUID's is that it almost instantly consumes a big part of the total value space (e.g. first value is '01...' and second is coming as 'f3...') which I would assume not being very efficient with btrees (space reclaim? - bloat). One of our major concerns is to keep index size small (VACUUM can't be run every minute) to fit into memory next to a lot of others. I've experimented with the rollover "prefix" myself but found that it makes the index too big (same or larger index size than standard v1 UUIDs) and VACUUM too slow (almost as slow as a standard V1 UUID), although INSERT performance wasn't that bad, our sequential UUID's where way faster (at least pre-generated and imported with COPY to eliminate any value generation impact). Cheers, Ancoron > > [1] https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/ > > > regards >
On Sun, May 26, 2019 at 01:49:30AM +0200, Ancoron Luciferis wrote: >On 26/05/2019 00:14, Tomas Vondra wrote: >> On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote: >>> Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes: >>>> On 25/05/2019 16:57, Tom Lane wrote: >>>>> (4) it in fact *wouldn't* do anything useful, because we'd still have >>>>> to sort UUIDs in the same order as today, meaning that btree index >>>>> behavior >>>>> would remain the same as before. Plus UUID comparison would get a lot >>>>> more complicated and slower than it is now. >>> >>>> I get your first sentence, but not your second. I know that when >>>> changing the internal byte order we'd have to completed re-compute >>>> everything on-disk (from table to index data), but why would the sorting >>>> in the index have to be the same? >>> >>> Because we aren't going to change the existing sort order of UUIDs. >>> We have no idea what applications might be dependent on that. >>> >>> As Vitalii correctly pointed out, your beef is not with the physical >>> storage of UUIDs anyway: you just wish they'd sort differently, since >>> that is what determines the behavior of a btree index. But we aren't >>> going to change the sort ordering because that's an even bigger >>> compatibility break than changing the physical storage; it'd affect >>> application-visible semantics. >>> >>> What you might want to think about is creating a function that maps >>> UUIDs into an ordering that makes sense to you, and then creating >>> a unique index over that function instead of the raw UUIDs. That >>> would give the results you want without having to negotiate with the >>> rest of the world about whether it's okay to change the semantics >>> of type uuid. >>> >> >> FWIW that's essentially what I implemented as an extension some time >> ago. See [1] for a more detailed explanation and some benchmarks. > >Yes, I've seen that before. Pretty nice work you but together there and >I'll surely have a look at it but we certainly need the node id in >compliance with v1 UUID's so that's why we've been generating UUID's at >the application side from day 1. > >> >> The thing is - it's not really desirable to get perfectly ordered >> ordering, because that would mean we never get back to older parts of >> the index (so if you delete data, we'd never fill that space). > >Wouldn't this apply also to any sequential-looking index (e.g. on >serial)? Yes, it does apply to any index on sequential (ordered) data. If you delete data from the "old" part (but not all, so the pages don't get completely empty), that space is lost. It's available for new data, but if we only insert to "new" part of the index, that's useless. > The main issue with the UUID's is that it almost instantly >consumes a big part of the total value space (e.g. first value is >'01...' and second is coming as 'f3...') which I would assume not being >very efficient with btrees (space reclaim? - bloat). > I don't understand what you mean here. Perhaps you misunderstand how btree indexes grow? It's not like we allocate separate pages for different values/prefixes - we insert the data until a page gets full, then it's split in half. There is some dependency on the order in which the values are inserted, but AFAIK random order is generally fine. >One of our major concerns is to keep index size small (VACUUM can't be >run every minute) to fit into memory next to a lot of others. > I don't think this has much to do with vacuum - I don't see how it's related to the ordering of generated UUID values. And I don't see where the "can't run vacuum every minute" comes from. >I've experimented with the rollover "prefix" myself but found that it >makes the index too big (same or larger index size than standard v1 >UUIDs) and VACUUM too slow (almost as slow as a standard V1 UUID), >although INSERT performance wasn't that bad, our sequential UUID's where >way faster (at least pre-generated and imported with COPY to eliminate >any value generation impact). > I very much doubt that has anything to do with the prefix. You'll need to share more details about how you did your tests. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I'm not worthy to post here, but a bit of a random thought.
If I've followed the conversation correctly, the reason for a V1 UUID is partly to order and partition rows by a timestamp value, but without the cost of a timestamp column. As I was told as a boy, "Smart numbers aren't." Is it _absolutely_ the case that you can't afford another column? I don't know the ins and outs of the Postgres row format, but my impression is that it's a fixed size, so you may be able to add the column without splitting rows? Anyway, even if that's not true and the extra column costs you disk space, is it the index that concerns you? Have you considered a timestamp column, or a numeric column with an epoch offset, and a BRIN index? If you insert data is in pretty much chronological order, that might work well for you.
Best of luck, I've enjoyed following the commentary.
On Sun, May 26, 2019 at 11:09 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Sun, May 26, 2019 at 01:49:30AM +0200, Ancoron Luciferis wrote:
>On 26/05/2019 00:14, Tomas Vondra wrote:
>> On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote:
>>> Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes:
>>>> On 25/05/2019 16:57, Tom Lane wrote:
>>>>> (4) it in fact *wouldn't* do anything useful, because we'd still have
>>>>> to sort UUIDs in the same order as today, meaning that btree index
>>>>> behavior
>>>>> would remain the same as before. Plus UUID comparison would get a lot
>>>>> more complicated and slower than it is now.
>>>
>>>> I get your first sentence, but not your second. I know that when
>>>> changing the internal byte order we'd have to completed re-compute
>>>> everything on-disk (from table to index data), but why would the sorting
>>>> in the index have to be the same?
>>>
>>> Because we aren't going to change the existing sort order of UUIDs.
>>> We have no idea what applications might be dependent on that.
>>>
>>> As Vitalii correctly pointed out, your beef is not with the physical
>>> storage of UUIDs anyway: you just wish they'd sort differently, since
>>> that is what determines the behavior of a btree index. But we aren't
>>> going to change the sort ordering because that's an even bigger
>>> compatibility break than changing the physical storage; it'd affect
>>> application-visible semantics.
>>>
>>> What you might want to think about is creating a function that maps
>>> UUIDs into an ordering that makes sense to you, and then creating
>>> a unique index over that function instead of the raw UUIDs. That
>>> would give the results you want without having to negotiate with the
>>> rest of the world about whether it's okay to change the semantics
>>> of type uuid.
>>>
>>
>> FWIW that's essentially what I implemented as an extension some time
>> ago. See [1] for a more detailed explanation and some benchmarks.
>
>Yes, I've seen that before. Pretty nice work you but together there and
>I'll surely have a look at it but we certainly need the node id in
>compliance with v1 UUID's so that's why we've been generating UUID's at
>the application side from day 1.
>
>>
>> The thing is - it's not really desirable to get perfectly ordered
>> ordering, because that would mean we never get back to older parts of
>> the index (so if you delete data, we'd never fill that space).
>
>Wouldn't this apply also to any sequential-looking index (e.g. on
>serial)?
Yes, it does apply to any index on sequential (ordered) data. If you
delete data from the "old" part (but not all, so the pages don't get
completely empty), that space is lost. It's available for new data, but
if we only insert to "new" part of the index, that's useless.
> The main issue with the UUID's is that it almost instantly
>consumes a big part of the total value space (e.g. first value is
>'01...' and second is coming as 'f3...') which I would assume not being
>very efficient with btrees (space reclaim? - bloat).
>
I don't understand what you mean here. Perhaps you misunderstand how
btree indexes grow? It's not like we allocate separate pages for
different values/prefixes - we insert the data until a page gets full,
then it's split in half. There is some dependency on the order in which
the values are inserted, but AFAIK random order is generally fine.
>One of our major concerns is to keep index size small (VACUUM can't be
>run every minute) to fit into memory next to a lot of others.
>
I don't think this has much to do with vacuum - I don't see how it's
related to the ordering of generated UUID values. And I don't see where
the "can't run vacuum every minute" comes from.
>I've experimented with the rollover "prefix" myself but found that it
>makes the index too big (same or larger index size than standard v1
>UUIDs) and VACUUM too slow (almost as slow as a standard V1 UUID),
>although INSERT performance wasn't that bad, our sequential UUID's where
>way faster (at least pre-generated and imported with COPY to eliminate
>any value generation impact).
>
I very much doubt that has anything to do with the prefix. You'll need
to share more details about how you did your tests.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 26/05/2019 03:09, Tomas Vondra wrote: > On Sun, May 26, 2019 at 01:49:30AM +0200, Ancoron Luciferis wrote: >> On 26/05/2019 00:14, Tomas Vondra wrote: >>> On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote: >>>> Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes: >>>>> On 25/05/2019 16:57, Tom Lane wrote: >>>>>> (4) it in fact *wouldn't* do anything useful, because we'd still have >>>>>> to sort UUIDs in the same order as today, meaning that btree index >>>>>> behavior >>>>>> would remain the same as before. Plus UUID comparison would get a >>>>>> lot >>>>>> more complicated and slower than it is now. >>>> >>>>> I get your first sentence, but not your second. I know that when >>>>> changing the internal byte order we'd have to completed re-compute >>>>> everything on-disk (from table to index data), but why would the >>>>> sorting >>>>> in the index have to be the same? >>>> >>>> Because we aren't going to change the existing sort order of UUIDs. >>>> We have no idea what applications might be dependent on that. >>>> >>>> As Vitalii correctly pointed out, your beef is not with the physical >>>> storage of UUIDs anyway: you just wish they'd sort differently, since >>>> that is what determines the behavior of a btree index. But we aren't >>>> going to change the sort ordering because that's an even bigger >>>> compatibility break than changing the physical storage; it'd affect >>>> application-visible semantics. >>>> >>>> What you might want to think about is creating a function that maps >>>> UUIDs into an ordering that makes sense to you, and then creating >>>> a unique index over that function instead of the raw UUIDs. That >>>> would give the results you want without having to negotiate with the >>>> rest of the world about whether it's okay to change the semantics >>>> of type uuid. >>>> >>> >>> FWIW that's essentially what I implemented as an extension some time >>> ago. See [1] for a more detailed explanation and some benchmarks. >> >> Yes, I've seen that before. Pretty nice work you but together there and >> I'll surely have a look at it but we certainly need the node id in >> compliance with v1 UUID's so that's why we've been generating UUID's at >> the application side from day 1. >> >>> >>> The thing is - it's not really desirable to get perfectly ordered >>> ordering, because that would mean we never get back to older parts of >>> the index (so if you delete data, we'd never fill that space). >> >> Wouldn't this apply also to any sequential-looking index (e.g. on >> serial)? > > Yes, it does apply to any index on sequential (ordered) data. If you > delete data from the "old" part (but not all, so the pages don't get > completely empty), that space is lost. It's available for new data, but > if we only insert to "new" part of the index, that's useless. OK, thanks for clearing that up for me. > >> The main issue with the UUID's is that it almost instantly >> consumes a big part of the total value space (e.g. first value is >> '01...' and second is coming as 'f3...') which I would assume not being >> very efficient with btrees (space reclaim? - bloat). >> > > I don't understand what you mean here. Perhaps you misunderstand how > btree indexes grow? It's not like we allocate separate pages for > different values/prefixes - we insert the data until a page gets full, > then it's split in half. There is some dependency on the order in which > the values are inserted, but AFAIK random order is generally fine. OK, I might not understand the basics of the btree implementation. Sorry for that. However, one of our issues with standard v1 UUID's was bloat of the indexes, although we kept only a few months of data in it. I think, this was due to the pages still containing at least one value and not reclaimed by vacuum. It just continued to grow. Now, as we have this different ever-increasing prefix, we still have some constant growing but we see that whenever historic data get's deleted (in a separate process), space get's reclaimed. > >> One of our major concerns is to keep index size small (VACUUM can't be >> run every minute) to fit into memory next to a lot of others. >> > > I don't think this has much to do with vacuum - I don't see how it's > related to the ordering of generated UUID values. And I don't see where > the "can't run vacuum every minute" comes from. OK, numbers (after VACUUM) which I really found strange using the query from pgexperts [1]: index_name | bloat_pct | bloat_mb | index_mb | table_mb ---------------------+-----------+----------+----------+---------- uuid_v1_pkey | 38 | 363 | 965.742 | 950.172 uuid_serial_pkey | 11 | 74 | 676.844 | 950.172 uuid_serial_8_pkey | 46 | 519 | 1122.031 | 950.172 uuid_serial_16_pkey | 39 | 389 | 991.195 | 950.172 ...where the "8" and "16" is a "shift" of the timestamp value, implemented with: timestamp = (timestamp >>> (60 - shift)) | (timestamp << (shift + 4)) If someone could shed some light on why that is (the huge difference in index sizes) I'd be happy. > >> I've experimented with the rollover "prefix" myself but found that it >> makes the index too big (same or larger index size than standard v1 >> UUIDs) and VACUUM too slow (almost as slow as a standard V1 UUID), >> although INSERT performance wasn't that bad, our sequential UUID's where >> way faster (at least pre-generated and imported with COPY to eliminate >> any value generation impact). >> > > I very much doubt that has anything to do with the prefix. You'll need > to share more details about how you did your tests. OK, I'll see if I can prepare something and publish it. > > > regards > Refs: [1] https://github.com/pgexperts/pgx_scripts/blob/master/bloat/index_bloat_check.sql
On 26/05/2019 06:27, Morris de Oryx wrote: > I'm not worthy to post here, but a bit of a random thought. > > If I've followed the conversation correctly, the reason for a V1 UUID is > partly to order and partition rows by a timestamp value, but without the > cost of a timestamp column. As I was told as a boy, "Smart numbers > aren't." Is it _absolutely_ the case that you can't afford another > column? I don't know the ins and outs of the Postgres row format, but my > impression is that it's a fixed size, so you may be able to add the > column without splitting rows? Anyway, even if that's not true and the > extra column costs you disk space, is it the index that concerns you? > Have you considered a timestamp column, or a numeric column with an > epoch offset, and a BRIN index? If you insert data is in pretty much > chronological order, that might work well for you. Exactly, we are using the actual information combined within a v1 UUID in multiple places and would like to avoid redundancy of information in the database as we strive to keep as much of it in memory and partitioning as well as timestamp (range) searching and sorting is a pretty common thing for us. For us, it's not an absolute "no-go" to have an additional column but the semantics of the v1 UUID already guarantees us uniqueness across all partitions, is the internal primary key and has additional information we are using (creation time, node). In addition, the extra column would need yet another index which brings our write performance down again. So, while it would improve reading, we're currently (still) more concerned about the write performance. The BRIN index is something I might need to test, though. > > Best of luck, I've enjoyed following the commentary. > > > On Sun, May 26, 2019 at 11:09 AM Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > On Sun, May 26, 2019 at 01:49:30AM +0200, Ancoron Luciferis wrote: > >On 26/05/2019 00:14, Tomas Vondra wrote: > >> On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote: > >>> Ancoron Luciferis <ancoron.luciferis@googlemail.com > <mailto:ancoron.luciferis@googlemail.com>> writes: > >>>> On 25/05/2019 16:57, Tom Lane wrote: > >>>>> (4) it in fact *wouldn't* do anything useful, because we'd > still have > >>>>> to sort UUIDs in the same order as today, meaning that btree index > >>>>> behavior > >>>>> would remain the same as before. Plus UUID comparison would > get a lot > >>>>> more complicated and slower than it is now. > >>> > >>>> I get your first sentence, but not your second. I know that when > >>>> changing the internal byte order we'd have to completed re-compute > >>>> everything on-disk (from table to index data), but why would > the sorting > >>>> in the index have to be the same? > >>> > >>> Because we aren't going to change the existing sort order of UUIDs. > >>> We have no idea what applications might be dependent on that. > >>> > >>> As Vitalii correctly pointed out, your beef is not with the physical > >>> storage of UUIDs anyway: you just wish they'd sort differently, > since > >>> that is what determines the behavior of a btree index. But we > aren't > >>> going to change the sort ordering because that's an even bigger > >>> compatibility break than changing the physical storage; it'd affect > >>> application-visible semantics. > >>> > >>> What you might want to think about is creating a function that maps > >>> UUIDs into an ordering that makes sense to you, and then creating > >>> a unique index over that function instead of the raw UUIDs. That > >>> would give the results you want without having to negotiate with the > >>> rest of the world about whether it's okay to change the semantics > >>> of type uuid. > >>> > >> > >> FWIW that's essentially what I implemented as an extension some time > >> ago. See [1] for a more detailed explanation and some benchmarks. > > > >Yes, I've seen that before. Pretty nice work you but together there and > >I'll surely have a look at it but we certainly need the node id in > >compliance with v1 UUID's so that's why we've been generating UUID's at > >the application side from day 1. > > > >> > >> The thing is - it's not really desirable to get perfectly ordered > >> ordering, because that would mean we never get back to older parts of > >> the index (so if you delete data, we'd never fill that space). > > > >Wouldn't this apply also to any sequential-looking index (e.g. on > >serial)? > > Yes, it does apply to any index on sequential (ordered) data. If you > delete data from the "old" part (but not all, so the pages don't get > completely empty), that space is lost. It's available for new data, but > if we only insert to "new" part of the index, that's useless. > > > The main issue with the UUID's is that it almost instantly > >consumes a big part of the total value space (e.g. first value is > >'01...' and second is coming as 'f3...') which I would assume not being > >very efficient with btrees (space reclaim? - bloat). > > > > I don't understand what you mean here. Perhaps you misunderstand how > btree indexes grow? It's not like we allocate separate pages for > different values/prefixes - we insert the data until a page gets full, > then it's split in half. There is some dependency on the order in which > the values are inserted, but AFAIK random order is generally fine. > > >One of our major concerns is to keep index size small (VACUUM can't be > >run every minute) to fit into memory next to a lot of others. > > > > I don't think this has much to do with vacuum - I don't see how it's > related to the ordering of generated UUID values. And I don't see where > the "can't run vacuum every minute" comes from. > > >I've experimented with the rollover "prefix" myself but found that it > >makes the index too big (same or larger index size than standard v1 > >UUIDs) and VACUUM too slow (almost as slow as a standard V1 UUID), > >although INSERT performance wasn't that bad, our sequential UUID's > where > >way faster (at least pre-generated and imported with COPY to eliminate > >any value generation impact). > > > > I very much doubt that has anything to do with the prefix. You'll need > to share more details about how you did your tests. > > > regards > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > >
On Sun, May 26, 2019 at 7:38 PM Ancoron Luciferis <ancoron.luciferis@googlemail.com> wrote:
The BRIN index is something I might need to test, though.
Yes, check that out, it might give you some ideas. A B-tree (in whatever variant) is inherently a large index type. They're ideal for finding unique values quickly, not ideal for storing redundant values, and pretty decent at finding ranges. A BRIN (Block Range Index), as implemented in Postgres, is good for finding unique values and and ranges. But here's the thing, a BRIN index takes some absurdly small % of the space of a B-tree. You have to blink and check again to be sure you've figured it right.
How can a BRIN index be so much smaller? By throwing virtually everything out. A BRIN index doesn't store all of the values in a page, it stores the min/max value and that's it. So it's a probabilistic index (of sorts.) Is the value you're seeking on such and so page? The answer is "No" or "Maybe." That's a fast test on a very cheap data structure.
When the answer is "maybe", the full has to be loaded and scanned to determine if a specific value is found. So, you have a very small index structure, but have to do more sequential scanning to determine if a record is indexed in that page or not. In the real world, this can work out to be a high-performance structure at very low cost. But for it to work, your records need to be physically ordered (CLUSTER) by the condition in that index. And, going forward, you ought to be inserting in order too.(More-or-less.) So, a BRIN index is a great option if you have an insertion pattern that allows it to remain efficient, and if you're goal is range searching without a heavy B-tree index to maintain.
I have no clue how BRIN indexes and partitioning interact.
On Sun, May 26, 2019 at 02:27:05PM +1000, Morris de Oryx wrote: >I'm not worthy to post here, but a bit of a random thought. > >If I've followed the conversation correctly, the reason for a V1 UUID is >partly to order and partition rows by a timestamp value, but without the >cost of a timestamp column. As I was told as a boy, "Smart numbers aren't." >Is it _absolutely_ the case that you can't afford another column? I don't >know the ins and outs of the Postgres row format, but my impression is that >it's a fixed size, so you may be able to add the column without splitting >rows? Anyway, even if that's not true and the extra column costs you disk >space, is it the index that concerns you? Have you considered a timestamp >column, or a numeric column with an epoch offset, and a BRIN index? If you >insert data is in pretty much chronological order, that might work well for >you. > >Best of luck, I've enjoyed following the commentary. > No, an extra column is not a solution, because it has no impact on the index on the UUID column. One of the problems with indexes on random data is that the entries go to random parts of the index. In the extreme case, each index insert goes to a different index page (since the last checkpoint) and therefore has to write the whole page into the WAL. That's what full-page writes do. This inflates the amount of WAL, may trigger more frequent checkpoints and (of course) reduces the cache hit ratio for index pages (because we have to touch many of them). The point of generating UUIDs in a more sequential way is to limit this behavior by "concentrating" the index inserts into a smaller part of the index. That's why indexes on sequential data (say, generated from a SERIAL column) perform better. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Here's what I was thinking of regarding disk space:
That's the kind of low-level detail I try not to worry about, but like to have in the back of my mind at least a little. If I read it correctly, a fixed-length field is going to be marked as present with a bit, and then inlined in the row storage without any extra cost for an address. So not much in the way of extra overhead. Spending the space on a field, reduces the compute needed to constantly perform extracts on the UUID field to access the same information.
But that particular trade-off is an ancient discussion and judgement call, you know your requirements and constraints better than anyone else. So, I'll leave it at that.
On Sun, May 26, 2019 at 8:24 PM Morris de Oryx <morrisdeoryx@gmail.com> wrote:
On Sun, May 26, 2019 at 7:38 PM Ancoron Luciferis <ancoron.luciferis@googlemail.com> wrote:The BRIN index is something I might need to test, though.Yes, check that out, it might give you some ideas. A B-tree (in whatever variant) is inherently a large index type. They're ideal for finding unique values quickly, not ideal for storing redundant values, and pretty decent at finding ranges. A BRIN (Block Range Index), as implemented in Postgres, is good for finding unique values and and ranges. But here's the thing, a BRIN index takes some absurdly small % of the space of a B-tree. You have to blink and check again to be sure you've figured it right.How can a BRIN index be so much smaller? By throwing virtually everything out. A BRIN index doesn't store all of the values in a page, it stores the min/max value and that's it. So it's a probabilistic index (of sorts.) Is the value you're seeking on such and so page? The answer is "No" or "Maybe." That's a fast test on a very cheap data structure.When the answer is "maybe", the full has to be loaded and scanned to determine if a specific value is found. So, you have a very small index structure, but have to do more sequential scanning to determine if a record is indexed in that page or not. In the real world, this can work out to be a high-performance structure at very low cost. But for it to work, your records need to be physically ordered (CLUSTER) by the condition in that index. And, going forward, you ought to be inserting in order too.(More-or-less.) So, a BRIN index is a great option if you have an insertion pattern that allows it to remain efficient, and if you're goal is range searching without a heavy B-tree index to maintain.I have no clue how BRIN indexes and partitioning interact.
On Sun, May 26, 2019 at 8:37 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
No, an extra column is not a solution, because it has no impact on the
index on the UUID column.
Possibly talking at cross-purposes here. I was honing in on the OPs wish to search and sort by creation order. For which my first (and only) instinct would be to have a timestamp. In fact, the OP wants to work with multiple subcomponents encoded in their magic number, so I'm likely off base entirely. I have a long-standing allergy to concatenated key-like fields as they're opaque, collapse multiple values into a single column (0NF), and inevitably (in my experience) get you into a bind when requirements change.
But everyone's got their own point of view on such judgement calls. I'm not currently dealing with anything where the cost of adding a few small, fixed-type columns would give me a moment's hesitation. I'm sure we all like to save space, but when saving space costs you clarity, flexibility, and compute, the "savings" aren't free. So, it's a judgment call. The OP may well have 1B rows and really quite good reasons for worrying about disk-level optimizations.
Hi, I've finally found some time to redo some tests using PostgreSQL 11.3. Scenario is the following: 1.) add 10 M rows 2.) add another 30 M rows 3.) delete the first 10 M rows 4.) VACUUM 5.) REINDEX My goal is to find the most efficient way for UUID values to somewhat optimize for index page/node reclaim (or least amount of "bloat") without the need to execute a REINDEX (although we're currently doing that on partitions where we can be sure that no rows will be added/changed). I have tested the following: Table "uuid_v1": standard V1 UUID Table "uuid_v1_timestamp": standard V1 UUID with additional index using timestamp extract function Table "uuid_seq": sequential UUID from Thomas Vondra Table "uuid_serial": "serial" UUID from my Java implementation For the test, I created a little PostgreSQL extension for support functions using standard version 1 UUID's (mainly to extract the timestamp): https://github.com/ancoron/pg-uuid-ext My C skills really suck, so if someone could review that and tell if there are more efficient ways of doing certain things I'd be happy! And for generating the UUID values I've created another Java project: https://github.com/ancoron/java-uuid-serial Now some test results, which are somewhat in-line with what I've already seen in previous tests for PG 10, but less dramatic performance differences here: 1.) add 10 M rows Table: uuid_v1 Time: 33183.639 ms (00:33.184) Table: uuid_v1_timestamp Time: 49869.763 ms (00:49.870) Table: uuid_seq Time: 35794.906 ms (00:35.795) Table: uuid_serial Time: 22835.071 ms (00:22.835) As expected, the table with the additional function index is slowest but surprisingly the sequential UUID was not faster than the standard V1 UUID here. Still, serial version is the fastest. Primary key indexes after an ANALYZE: table_name | bloat | index_mb | table_mb -------------------+----------------+----------+---------- uuid_v1 | 128 MiB (32 %) | 395.570 | 422.305 uuid_v1_timestamp | 128 MiB (32 %) | 395.570 | 422.305 uuid_seq | 123 MiB (31 %) | 390.430 | 422.305 uuid_serial | 108 MiB (29 %) | 376.023 | 422.305 2.) add another 30 M rows Table: uuid_v1 Time: 136109.046 ms (02:16.109) Table: uuid_v1_timestamp Time: 193172.454 ms (03:13.172) Table: uuid_seq Time: 124713.530 ms (02:04.714) Table: uuid_serial Time: 78362.209 ms (01:18.362) Now the performance difference gets much more dramatic. Primary key indexes after an ANALYZE: table_name | bloat | index_mb | table_mb -------------------+----------------+----------+---------- uuid_v1 | 500 MiB (32 %) | 1571.039 | 1689.195 uuid_v1_timestamp | 500 MiB (32 %) | 1571.039 | 1689.195 uuid_seq | 492 MiB (31 %) | 1562.766 | 1689.195 uuid_serial | 433 MiB (29 %) | 1504.047 | 1689.195 Still no noticeable difference but I wonder why it looks like a fill-factor of 70 instead of the default 90. 3.) delete the first 10 M rows Implementations differ for the tables due to difference capabilities: DELETE FROM uuid_v1 WHERE id IN (select id from uuid_v1 limit 10000000); DELETE FROM uuid_v1_timestamp WHERE uuid_v1_timestamp(id) < '2019-03-09T07:58:02.056'; DELETE FROM uuid_seq WHERE id IN (select id from uuid_seq limit 10000000); DELETE FROM uuid_serial WHERE id < '1e942411-004c-1d50-0000-000000000000'; Table: uuid_v1 Time: 38308.299 ms (00:38.308) Table: uuid_v1_timestamp Time: 11589.941 ms (00:11.590) Table: uuid_seq Time: 37171.331 ms (00:37.171) Table: uuid_serial Time: 11694.893 ms (00:11.695) As expected, using the timestamp index directly of being able to compare on the UUID in a time-wise manner produces the best results. 4.) VACUUM Table: uuid_v1 Time: 69740.952 ms (01:09.741) Table: uuid_v1_timestamp Time: 67347.469 ms (01:07.347) Table: uuid_seq Time: 25832.355 ms (00:25.832) Table: uuid_serial Time: 12339.531 ms (00:12.340) This is pretty important to us as it consumes system resources and we have quite a lot of big tables. So my serial implementation seems to beat even the sequential one which was pretty surprising for me to see. Primary key indexes after an ANALYZE: table_name | bloat | index_mb | table_mb -------------------+----------------+----------+---------- uuid_v1 | 767 MiB (49 %) | 1571.039 | 1689.195 uuid_v1_timestamp | 768 MiB (49 %) | 1571.039 | 1689.195 uuid_seq | 759 MiB (49 %) | 1562.766 | 1689.195 uuid_serial | 700 MiB (47 %) | 1504.047 | 1689.195 OK, sadly no reclaim in any of them. 5.) REINDEX Table: uuid_v1 Time: 21549.860 ms (00:21.550) Table: uuid_v1_timestamp Time: 27367.817 ms (00:27.368) Table: uuid_seq Time: 19142.711 ms (00:19.143) Table: uuid_serial Time: 16889.807 ms (00:16.890) Even in this case it looks as if my implementation is faster than anything else - which I really don't get. Overall, as write performance is our major concern atm. without sacrificing read performance we'll continue using our "serial" implementation for the time being. Nevertheless, I am still puzzled by the index organization and how to keep it within a certain size in cases where you have a rolling window of value space to avoid unnecessary disk I/O (even if it is SSD). So I wonder if there is anything that can be done on the index side? I might implement a different opclass for the standard UUID to enable time-wise index sort order. This will naturally be very close to physical order but I doubt that this is something I can tell PostgreSQL, or? Cheers, Ancoron On 26/05/2019 11:01, Ancoron Luciferis wrote: > On 26/05/2019 03:09, Tomas Vondra wrote: >> On Sun, May 26, 2019 at 01:49:30AM +0200, Ancoron Luciferis wrote: >>> On 26/05/2019 00:14, Tomas Vondra wrote: >>>> On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote: >>>>> Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes: >>>>>> On 25/05/2019 16:57, Tom Lane wrote: >>>>>>> (4) it in fact *wouldn't* do anything useful, because we'd still have >>>>>>> to sort UUIDs in the same order as today, meaning that btree index >>>>>>> behavior >>>>>>> would remain the same as before. Plus UUID comparison would get a >>>>>>> lot >>>>>>> more complicated and slower than it is now. >>>>> >>>>>> I get your first sentence, but not your second. I know that when >>>>>> changing the internal byte order we'd have to completed re-compute >>>>>> everything on-disk (from table to index data), but why would the >>>>>> sorting >>>>>> in the index have to be the same? >>>>> >>>>> Because we aren't going to change the existing sort order of UUIDs. >>>>> We have no idea what applications might be dependent on that. >>>>> >>>>> As Vitalii correctly pointed out, your beef is not with the physical >>>>> storage of UUIDs anyway: you just wish they'd sort differently, since >>>>> that is what determines the behavior of a btree index. But we aren't >>>>> going to change the sort ordering because that's an even bigger >>>>> compatibility break than changing the physical storage; it'd affect >>>>> application-visible semantics. >>>>> >>>>> What you might want to think about is creating a function that maps >>>>> UUIDs into an ordering that makes sense to you, and then creating >>>>> a unique index over that function instead of the raw UUIDs. That >>>>> would give the results you want without having to negotiate with the >>>>> rest of the world about whether it's okay to change the semantics >>>>> of type uuid. >>>>> >>>> >>>> FWIW that's essentially what I implemented as an extension some time >>>> ago. See [1] for a more detailed explanation and some benchmarks. >>> >>> Yes, I've seen that before. Pretty nice work you but together there and >>> I'll surely have a look at it but we certainly need the node id in >>> compliance with v1 UUID's so that's why we've been generating UUID's at >>> the application side from day 1. >>> >>>> >>>> The thing is - it's not really desirable to get perfectly ordered >>>> ordering, because that would mean we never get back to older parts of >>>> the index (so if you delete data, we'd never fill that space). >>> >>> Wouldn't this apply also to any sequential-looking index (e.g. on >>> serial)? >> >> Yes, it does apply to any index on sequential (ordered) data. If you >> delete data from the "old" part (but not all, so the pages don't get >> completely empty), that space is lost. It's available for new data, but >> if we only insert to "new" part of the index, that's useless. > > OK, thanks for clearing that up for me. > >> >>> The main issue with the UUID's is that it almost instantly >>> consumes a big part of the total value space (e.g. first value is >>> '01...' and second is coming as 'f3...') which I would assume not being >>> very efficient with btrees (space reclaim? - bloat). >>> >> >> I don't understand what you mean here. Perhaps you misunderstand how >> btree indexes grow? It's not like we allocate separate pages for >> different values/prefixes - we insert the data until a page gets full, >> then it's split in half. There is some dependency on the order in which >> the values are inserted, but AFAIK random order is generally fine. > > OK, I might not understand the basics of the btree implementation. Sorry > for that. > > However, one of our issues with standard v1 UUID's was bloat of the > indexes, although we kept only a few months of data in it. I think, this > was due to the pages still containing at least one value and not > reclaimed by vacuum. It just continued to grow. > > Now, as we have this different ever-increasing prefix, we still have > some constant growing but we see that whenever historic data get's > deleted (in a separate process), space get's reclaimed. > > >> >>> One of our major concerns is to keep index size small (VACUUM can't be >>> run every minute) to fit into memory next to a lot of others. >>> >> >> I don't think this has much to do with vacuum - I don't see how it's >> related to the ordering of generated UUID values. And I don't see where >> the "can't run vacuum every minute" comes from. > > OK, numbers (after VACUUM) which I really found strange using the query > from pgexperts [1]: > > index_name | bloat_pct | bloat_mb | index_mb | table_mb > ---------------------+-----------+----------+----------+---------- > uuid_v1_pkey | 38 | 363 | 965.742 | 950.172 > uuid_serial_pkey | 11 | 74 | 676.844 | 950.172 > uuid_serial_8_pkey | 46 | 519 | 1122.031 | 950.172 > uuid_serial_16_pkey | 39 | 389 | 991.195 | 950.172 > > ...where the "8" and "16" is a "shift" of the timestamp value, > implemented with: > > timestamp = (timestamp >>> (60 - shift)) | (timestamp << (shift + 4)) > > If someone could shed some light on why that is (the huge difference in > index sizes) I'd be happy. > >> >>> I've experimented with the rollover "prefix" myself but found that it >>> makes the index too big (same or larger index size than standard v1 >>> UUIDs) and VACUUM too slow (almost as slow as a standard V1 UUID), >>> although INSERT performance wasn't that bad, our sequential UUID's where >>> way faster (at least pre-generated and imported with COPY to eliminate >>> any value generation impact). >>> >> >> I very much doubt that has anything to do with the prefix. You'll need >> to share more details about how you did your tests. > > OK, I'll see if I can prepare something and publish it. > >> >> >> regards >> > > > Refs: > [1] > https://github.com/pgexperts/pgx_scripts/blob/master/bloat/index_bloat_check.sql >
Please don't top post -- trim the your response down so that only still-relevant text remains. On Tue, Jun 11, 2019 at 1:27 PM Ancoron Luciferis <ancoron.luciferis@googlemail.com> wrote: > Primary key indexes after an ANALYZE: > table_name | bloat | index_mb | table_mb > -------------------+----------------+----------+---------- > uuid_v1 | 767 MiB (49 %) | 1571.039 | 1689.195 > uuid_v1_timestamp | 768 MiB (49 %) | 1571.039 | 1689.195 > uuid_seq | 759 MiB (49 %) | 1562.766 | 1689.195 > uuid_serial | 700 MiB (47 %) | 1504.047 | 1689.195 > > OK, sadly no reclaim in any of them. I don't know how you got these figures, but most likely they don't take into account the fact that the FSM for the index has free blocks available. You'll only notice that if you have additional page splits that can recycle that space. Or, you could use pg_freespacemap to get some idea. > 5.) REINDEX > Table: uuid_v1 Time: 21549.860 ms (00:21.550) > Table: uuid_v1_timestamp Time: 27367.817 ms (00:27.368) > Table: uuid_seq Time: 19142.711 ms (00:19.143) > Table: uuid_serial Time: 16889.807 ms (00:16.890) > > Even in this case it looks as if my implementation is faster than > anything else - which I really don't get. Sorting already-sorted data is faster. CREATE INDEX is mostly a big sort operation in the case of B-Tree indexes. > I might implement a different opclass for the standard UUID to enable > time-wise index sort order. This will naturally be very close to > physical order but I doubt that this is something I can tell PostgreSQL, or? PostgreSQL only knows whether or not your page splits occur in the rightmost page in the index -- it fills the page differently according to whether or not that is the case. -- Peter Geoghegan
On 08/07/2019 02:26, Peter Geoghegan wrote: > Please don't top post -- trim the your response down so that only > still-relevant text remains. > > On Tue, Jun 11, 2019 at 1:27 PM Ancoron Luciferis > <ancoron.luciferis@googlemail.com> wrote: >> Primary key indexes after an ANALYZE: >> table_name | bloat | index_mb | table_mb >> -------------------+----------------+----------+---------- >> uuid_v1 | 767 MiB (49 %) | 1571.039 | 1689.195 >> uuid_v1_timestamp | 768 MiB (49 %) | 1571.039 | 1689.195 >> uuid_seq | 759 MiB (49 %) | 1562.766 | 1689.195 >> uuid_serial | 700 MiB (47 %) | 1504.047 | 1689.195 >> >> OK, sadly no reclaim in any of them. > > I don't know how you got these figures, but most likely they don't > take into account the fact that the FSM for the index has free blocks > available. You'll only notice that if you have additional page splits > that can recycle that space. Or, you could use pg_freespacemap to get > some idea. Hm, I think I've already read quite a bit about the internals of the PG b-tree index implementation but still cannot get to the answer how I could influence that on my end as I want to stay compatible with the standard UUID data storage but need time sorting support. Anyway, I've made a bit of progress in testing and now have the full tests executing unattended with the help of a script: https://github.com/ancoron/pg-uuid-test I've uploaded one of the test run results here: https://gist.github.com/ancoron/d5114b0907e8974b6808077e02f8d109 After the first mass deletion, I can now see quite some savings for both, serial and for my new time-sorted index: table_name | bloat | index_mb | table_mb -------------+-----------------+----------+---------- uuid_v1 | 1500 MiB (48 %) | 3106.406 | 3378.383 uuid_serial | 800 MiB (33 %) | 2406.453 | 3378.383 uuid_v1_ext | 800 MiB (33 %) | 2406.453 | 3378.383 ...but in a second case (DELETE old + INSERT new), the savings are gone again in both cases: table_name | bloat | index_mb | table_mb -------------+-----------------+----------+---------- uuid_v1 | 1547 MiB (49 %) | 3153.859 | 3378.383 uuid_serial | 1402 MiB (47 %) | 3008.055 | 3378.383 uuid_v1_ext | 1403 MiB (47 %) | 3008.055 | 3378.383 So, the question for me would be: Is there any kind of data that plays optimal with space-savings in a rolling (e.g. last X rows) scenario? > >> 5.) REINDEX >> Table: uuid_v1 Time: 21549.860 ms (00:21.550) >> Table: uuid_v1_timestamp Time: 27367.817 ms (00:27.368) >> Table: uuid_seq Time: 19142.711 ms (00:19.143) >> Table: uuid_serial Time: 16889.807 ms (00:16.890) >> >> Even in this case it looks as if my implementation is faster than >> anything else - which I really don't get. > > Sorting already-sorted data is faster. CREATE INDEX is mostly a big > sort operation in the case of B-Tree indexes. Understood, this seems to be confirmed by my time-sorted index in the new tests: uuid_v1: 27632.660 ms (00:27.633) uuid_serial: 20519.363 ms (00:20.519) x1.35 uuid_v1_ext: 23846.474 ms (00:23.846) x1.16 > >> I might implement a different opclass for the standard UUID to enable >> time-wise index sort order. This will naturally be very close to >> physical order but I doubt that this is something I can tell PostgreSQL, or? > > PostgreSQL only knows whether or not your page splits occur in the > rightmost page in the index -- it fills the page differently according > to whether or not that is the case. > As I've implemented the new opclass and the new tests showing the results now, I think I can say that the time-sorting behavior as opposed to rather random really benefits the overall performance, which is what I actually care about most. Cheers, Ancoron