Thread: UUID v1 optimizations...

UUID v1 optimizations...

From
Ancoron Luciferis
Date:
Hi all,

Some time ago, I was having trouble with some rather high load OLTP
application (in Java, but that doesn't really matter) that was using v1
UUID's for primary keys and after some time, the bloat of certain
indexes went quite high.

So I investigated the PostgreSQL code to see how it is handling UUID's
with respect to storage, sorting, aso. but all I could find was that it
basically falls back to the 16-byte.

After struggling to find a way to optimize things inside the database, I
reverted to introduce a hack into the application by not shuffling the
timestamp bytes for the UUID's, which makes it look quite serial in
terms of byte order.

With that, we were able to reduce bloat by magnitudes and finally VACUUM
also was able to reclaim index space.

So, my question now is: Would it make sense for you to handle these
time-based UUID's differently internally? Specifically un-shuffling the
timestamp before they are going to storage?

A second question would be whether native support functions could be
introduced? Most interesting for me would be:
* timestamp
* version
* variant


Here are some numbers from the tests I have run (against a PostgreSQL 10
server):

1. insert 5,000,000 rows
2. delete 2,500,000 rows

Index Bloat:
       idxname       | bloat_ratio | bloat_size | real_size
---------------------+-------------+------------+-----------
 uuid_v1_pkey        |        23.6 | 46 MB      | 195 MB
 uuid_serial_pkey    |        50.4 | 76 MB      | 150 MB

Higher ratio for "serial", but still lower total index size. :)

Now, the performance of VACUUM is also very interesting here:

# vacuum (verbose, analyze, freeze) uuid_serial;
INFO:  vacuuming "public.uuid_serial"
INFO:  index "uuid_serial_pkey" now contains 2500001 row versions in
19253 pages
DETAIL:  0 index row versions were removed.ce(toast.reltuples, 0) / 4 )
* bs ) as expected_bytes
9624 index pages have been deleted, 9624 are currently reusable.
CPU: user: 0.03 s, system: 0.01 s, elapsed: 0.05 s.t.oid
INFO:  "uuid_serial": found 0 removable, 2500001 nonremovable row
versions in 13515 out of 27028 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 270712
There were 94 unused item pointers.
Skipped 0 pages due to buffer pins, 13513 frozen pages.
0 pages are entirely empty.be reused
CPU: user: 0.37 s, system: 0.16 s, elapsed: 2.83 s.
INFO:  analyzing "public.uuid_serial"e compressed
INFO:  "uuid_serial": scanned 27028 of 27028 pages, containing 2500001
live rows and 0 dead rows; 30000 rows in sample, 2500001 estimated total
rows
VACUUM          schemaname, tablename, can_estimate,
Time: 3969.812 ms (00:03.970)

# vacuum (verbose, analyze, freeze) uuid_v1;
INFO:  vacuuming "public.uuid_v1"
INFO:  scanned index "uuid_v1_pkey" to remove 2499999 row versions
DETAIL:  CPU: user: 1.95 s, system: 0.13 s, elapsed: 5.09 s
INFO:  "uuid_v1": removed 2499999 row versions in 27028 pages
DETAIL:  CPU: user: 0.22 s, system: 0.26 s, elapsed: 3.93 s
INFO:  index "uuid_v1_pkey" now contains 2500001 row versions in 24991 pages
DETAIL:  2499999 index row versions were removed.
12111 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "uuid_v1": found 1791266 removable, 2500001 nonremovable row
versions in 27028 out of 27028 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 270716
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 2.90 s, system: 0.71 s, elapsed: 14.54 s.
INFO:  analyzing "public.uuid_v1"
INFO:  "uuid_v1": scanned 27028 of 27028 pages, containing 2500001 live
rows and 0 dead rows; 30000 rows in sample, 2500001 estimated total rows
VACUUM
Time: 15702.803 ms (00:15.703)

...almost 5x faster!

Now insert another 20 million:

COPY uuid_serial FROM '...' WITH ( FORMAT text );
COPY 20000000
Time: 76249.142 ms (01:16.249)

COPY uuid_v1 FROM '...' WITH ( FORMAT text );
COPY 20000000
Time: 804291.611 ms (13:24.292)

...more than 10x faster!

...and the resulting bloat (no VACUUM in between):
       idxname       | bloat_ratio | bloat_size | real_size
---------------------+-------------+------------+-----------
 uuid_v1_pkey        |        30.5 | 295 MB     | 966 MB
 uuid_serial_pkey    |         0.9 | 6056 kB    | 677 MB

...still 30% savings in space.


Cheers,

    Ancoron




Re: UUID v1 optimizations...

From
Peter Eisentraut
Date:
On 2019-05-25 15:45, Ancoron Luciferis wrote:
> So, my question now is: Would it make sense for you to handle these
> time-based UUID's differently internally? Specifically un-shuffling the
> timestamp before they are going to storage?

It seems unlikely that we would do that, because that would break
existing stored UUIDs, and we'd also need to figure out a way to store
non-v1 UUIDs under this different scheme.  The use case is pretty narrow
compared to the enormous effort.

It is well understood that using UUIDs or other somewhat-random keys
perform worse than serial-like keys.

Btw., it might be nice to rerun your tests with PostgreSQL 12beta1.  The
btree storage has gotten some improvements.  I don't think it's going to
fundamentally solve your problem, but it would be useful feedback.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: UUID v1 optimizations...

From
Tom Lane
Date:
Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes:
> So I investigated the PostgreSQL code to see how it is handling UUID's
> with respect to storage, sorting, aso. but all I could find was that it
> basically falls back to the 16-byte.

Yup, they're just blobs to us.

> After struggling to find a way to optimize things inside the database, I
> reverted to introduce a hack into the application by not shuffling the
> timestamp bytes for the UUID's, which makes it look quite serial in
> terms of byte order.

> So, my question now is: Would it make sense for you to handle these
> time-based UUID's differently internally? Specifically un-shuffling the
> timestamp before they are going to storage?

No, because

(1) UUID layout is standardized;

(2) such a change would break on-disk compatibility for existing
databases;

(3) particularly for the case of application-generated UUIDs, we do
not have enough information to know that this would actually do anything
useful;

(4) it in fact *wouldn't* do anything useful, because we'd still have
to sort UUIDs in the same order as today, meaning that btree index behavior
would remain the same as before.  Plus UUID comparison would get a lot
more complicated and slower than it is now.

(5) even if we ignored all that and did it anyway, it would only help
for version-1 UUIDs.  The index performance issue would still remain for
version-4 (random) UUIDs, which are if anything more common than v1.


FWIW, I don't know what tool you're using to get those "bloat" numbers,
but just because somebody calls it bloat doesn't mean that it is.
The normal, steady-state load factor for a btree index is generally
understood to be about 2/3rds, and that looks to be about what
you're getting for the regular-UUID-format index.  The fact that the
serially-loaded index has nearly no dead space is because we hack the
page split logic to make that happen --- but that is a hack, and it's
not without downsides.  It should *not* be taken to be an indication
of what you can expect for any other insertion pattern.

The insertion-speed aspect is a real problem, but the core of that problem
is that use of any sort of standard-format UUID turns applications that
might have had considerable locality of reference into applications that
have none.  If you can manage to keep your whole index in RAM that would
not hurt too much, but as soon as it doesn't fit you have a problem.
When your app has more or less predictable reference patterns it's best
to design a unique key that matches that, instead of expecting that
essentially-random keys will work well.

Or in short, hacking up the way your app generates UUIDs is exactly
the right way to proceed here.

            regards, tom lane



Re: UUID v1 optimizations...

From
Ancoron Luciferis
Date:
On 25/05/2019 16:19, Peter Eisentraut wrote:
> On 2019-05-25 15:45, Ancoron Luciferis wrote:
>> So, my question now is: Would it make sense for you to handle these
>> time-based UUID's differently internally? Specifically un-shuffling the
>> timestamp before they are going to storage?
> 
> It seems unlikely that we would do that, because that would break
> existing stored UUIDs, and we'd also need to figure out a way to store
> non-v1 UUIDs under this different scheme.  The use case is pretty narrow
> compared to the enormous effort.

I agree, the backwards compatibility really would be a show stopper here.

About the "enormous effort" I was thinking the simplest "solution" would
be to have the version being detected at he time when the internal byte
array is created from the provided representation which then could
directly provide the unshuffled byte order.

> 
> It is well understood that using UUIDs or other somewhat-random keys
> perform worse than serial-like keys.

Yes I know. We're using these UUID's for more than just some primary
key, because they also tell use the creation time of the entry and which
"node" of the highly distributed application generated the entry. It is
like an audit-log for us without the need for extra columns and of fixed
size, which helps performance also at the application level.

> 
> Btw., it might be nice to rerun your tests with PostgreSQL 12beta1.  The
> btree storage has gotten some improvements.  I don't think it's going to
> fundamentally solve your problem, but it would be useful feedback.
> 

Thank you for the pointer to 12beta1. I've just read about it and it
might help a bit. I'll give it a try, for sure and report back.

I also have to rerun those tests against PG 11 anyway.

Cheers,

    Ancoron



Re: UUID v1 optimizations...

From
Ancoron Luciferis
Date:
On 25/05/2019 16:57, Tom Lane wrote:
> Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes:
>> So I investigated the PostgreSQL code to see how it is handling UUID's
>> with respect to storage, sorting, aso. but all I could find was that it
>> basically falls back to the 16-byte.
> 
> Yup, they're just blobs to us.
> 
>> After struggling to find a way to optimize things inside the database, I
>> reverted to introduce a hack into the application by not shuffling the
>> timestamp bytes for the UUID's, which makes it look quite serial in
>> terms of byte order.
> 
>> So, my question now is: Would it make sense for you to handle these
>> time-based UUID's differently internally? Specifically un-shuffling the
>> timestamp before they are going to storage?
> 
> No, because
> 
> (1) UUID layout is standardized;

You mean the presentation at the byte-level is. ;)

> 
> (2) such a change would break on-disk compatibility for existing
> databases;

Yes, that certainly is a show-stopper.

> 
> (3) particularly for the case of application-generated UUIDs, we do
> not have enough information to know that this would actually do anything
> useful;

Well, not only the layout is standardized, but also there is a semantic
to it depending on the version. Specifically for version 1, it has:
1. a timestamp
2. a clock sequence
3. a node id

Ans as PostgreSQL already provides this pretty concrete data type, it
could be a natural consequence to also support the semantic of it.

E.g. the network types also come with a lot of additional operators and
functions. So I don't see a reason not to respect the additional
capabilities of a UUID.

For other versions of UUID's, functions like timestamp would certainly
not be available (return NULL?), respecting the concrete semantic.

> 
> (4) it in fact *wouldn't* do anything useful, because we'd still have
> to sort UUIDs in the same order as today, meaning that btree index behavior
> would remain the same as before.  Plus UUID comparison would get a lot
> more complicated and slower than it is now.

I get your first sentence, but not your second. I know that when
changing the internal byte order we'd have to completed re-compute
everything on-disk (from table to index data), but why would the sorting
in the index have to be the same?

And actually, comparison logic wouldn't need to be changed at all if the
byte order is "changed" when the UUID is read in when reading the
representation into the internal UUID's byte array.

Function:
string_to_uuid(const char *source, pg_uuid_t *uuid);

^^ here I would apply the change. And of course, reverse it for
generating the textual representation.

That would slow down writes a bit, but that shouldn't be the case
because index insertions are speed up even more.

But still, on-disk change is still a show-stopper, I guess.

> 
> (5) even if we ignored all that and did it anyway, it would only help
> for version-1 UUIDs.  The index performance issue would still remain for
> version-4 (random) UUIDs, which are if anything more common than v1.

Yes, I am aware that the changes might be of very limited gain. V4
UUID's are usually used for external identifiers. For internal ones,
they don't make sense to me (too long, issues with randomness/enthropie
under high load, ...). ;)

I just recently used these UUID's also together with a function for
TimescaleDB auto-partitioning to provide the timestamp for the
partitioning logic instead of the need for a separate timestamp column.

This is also one of the reasons why I was also asking for native support
functions to extract the timestamp. I am apparently not very good at C
so I am currently using Python and/or PgPLSQL for it, which is pretty slow.

> 
> 
> FWIW, I don't know what tool you're using to get those "bloat" numbers,
> but just because somebody calls it bloat doesn't mean that it is.
> The normal, steady-state load factor for a btree index is generally
> understood to be about 2/3rds, and that looks to be about what
> you're getting for the regular-UUID-format index.  The fact that the
> serially-loaded index has nearly no dead space is because we hack the
> page split logic to make that happen --- but that is a hack, and it's
> not without downsides.  It should *not* be taken to be an indication
> of what you can expect for any other insertion pattern.

OK, understood. I was actually a bit surprised by those numbers myself
as these "serial" UUID's still only have the timestamp bytes in
ascending order, the clock sequence and node is still pretty random (but
not inside a single transaction, which might help the hack).

> 
> The insertion-speed aspect is a real problem, but the core of that problem
> is that use of any sort of standard-format UUID turns applications that
> might have had considerable locality of reference into applications that
> have none.  If you can manage to keep your whole index in RAM that would
> not hurt too much, but as soon as it doesn't fit you have a problem.
> When your app has more or less predictable reference patterns it's best
> to design a unique key that matches that, instead of expecting that
> essentially-random keys will work well.

The system was configured to have more than enough space for the index
and table data to fit into memory, but I am not sure. How can I verify
that? An EXPLAIN on the INSERT apparently doesn't include index insertion.

> 
> Or in short, hacking up the way your app generates UUIDs is exactly
> the right way to proceed here.

OK. Glad to hear that.

One last question, though:
Would it make sense to create a specialized UUID v1 type (e.g. with an
extension) that does the transformation and delegates for all other
things to the existing UUID type support?

> 
>             regards, tom lane
> 




Re: UUID v1 optimizations...

From
Vitalii Tymchyshyn
Date:
I am not sure why do you want to change on-disk storage format? If we are talking about indexes, it's more about comparison function (opclass) that is used in an index. 
Am I wrong?

сб, 25 трав. 2019 о 11:21 Ancoron Luciferis <ancoron.luciferis@googlemail.com> пише:
On 25/05/2019 16:57, Tom Lane wrote:
> Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes:
>> So I investigated the PostgreSQL code to see how it is handling UUID's
>> with respect to storage, sorting, aso. but all I could find was that it
>> basically falls back to the 16-byte.
>
> Yup, they're just blobs to us.
>
>> After struggling to find a way to optimize things inside the database, I
>> reverted to introduce a hack into the application by not shuffling the
>> timestamp bytes for the UUID's, which makes it look quite serial in
>> terms of byte order.
>
>> So, my question now is: Would it make sense for you to handle these
>> time-based UUID's differently internally? Specifically un-shuffling the
>> timestamp before they are going to storage?
>
> No, because
>
> (1) UUID layout is standardized;

You mean the presentation at the byte-level is. ;)

>
> (2) such a change would break on-disk compatibility for existing
> databases;

Yes, that certainly is a show-stopper.

>
> (3) particularly for the case of application-generated UUIDs, we do
> not have enough information to know that this would actually do anything
> useful;

Well, not only the layout is standardized, but also there is a semantic
to it depending on the version. Specifically for version 1, it has:
1. a timestamp
2. a clock sequence
3. a node id

Ans as PostgreSQL already provides this pretty concrete data type, it
could be a natural consequence to also support the semantic of it.

E.g. the network types also come with a lot of additional operators and
functions. So I don't see a reason not to respect the additional
capabilities of a UUID.

For other versions of UUID's, functions like timestamp would certainly
not be available (return NULL?), respecting the concrete semantic.

>
> (4) it in fact *wouldn't* do anything useful, because we'd still have
> to sort UUIDs in the same order as today, meaning that btree index behavior
> would remain the same as before.  Plus UUID comparison would get a lot
> more complicated and slower than it is now.

I get your first sentence, but not your second. I know that when
changing the internal byte order we'd have to completed re-compute
everything on-disk (from table to index data), but why would the sorting
in the index have to be the same?

And actually, comparison logic wouldn't need to be changed at all if the
byte order is "changed" when the UUID is read in when reading the
representation into the internal UUID's byte array.

Function:
string_to_uuid(const char *source, pg_uuid_t *uuid);

^^ here I would apply the change. And of course, reverse it for
generating the textual representation.

That would slow down writes a bit, but that shouldn't be the case
because index insertions are speed up even more.

But still, on-disk change is still a show-stopper, I guess.

>
> (5) even if we ignored all that and did it anyway, it would only help
> for version-1 UUIDs.  The index performance issue would still remain for
> version-4 (random) UUIDs, which are if anything more common than v1.

Yes, I am aware that the changes might be of very limited gain. V4
UUID's are usually used for external identifiers. For internal ones,
they don't make sense to me (too long, issues with randomness/enthropie
under high load, ...). ;)

I just recently used these UUID's also together with a function for
TimescaleDB auto-partitioning to provide the timestamp for the
partitioning logic instead of the need for a separate timestamp column.

This is also one of the reasons why I was also asking for native support
functions to extract the timestamp. I am apparently not very good at C
so I am currently using Python and/or PgPLSQL for it, which is pretty slow.

>
>
> FWIW, I don't know what tool you're using to get those "bloat" numbers,
> but just because somebody calls it bloat doesn't mean that it is.
> The normal, steady-state load factor for a btree index is generally
> understood to be about 2/3rds, and that looks to be about what
> you're getting for the regular-UUID-format index.  The fact that the
> serially-loaded index has nearly no dead space is because we hack the
> page split logic to make that happen --- but that is a hack, and it's
> not without downsides.  It should *not* be taken to be an indication
> of what you can expect for any other insertion pattern.

OK, understood. I was actually a bit surprised by those numbers myself
as these "serial" UUID's still only have the timestamp bytes in
ascending order, the clock sequence and node is still pretty random (but
not inside a single transaction, which might help the hack).

>
> The insertion-speed aspect is a real problem, but the core of that problem
> is that use of any sort of standard-format UUID turns applications that
> might have had considerable locality of reference into applications that
> have none.  If you can manage to keep your whole index in RAM that would
> not hurt too much, but as soon as it doesn't fit you have a problem.
> When your app has more or less predictable reference patterns it's best
> to design a unique key that matches that, instead of expecting that
> essentially-random keys will work well.

The system was configured to have more than enough space for the index
and table data to fit into memory, but I am not sure. How can I verify
that? An EXPLAIN on the INSERT apparently doesn't include index insertion.

>
> Or in short, hacking up the way your app generates UUIDs is exactly
> the right way to proceed here.

OK. Glad to hear that.

One last question, though:
Would it make sense to create a specialized UUID v1 type (e.g. with an
extension) that does the transformation and delegates for all other
things to the existing UUID type support?

>
>                       regards, tom lane
>


Re: UUID v1 optimizations...

From
Ancoron Luciferis
Date:
On 25/05/2019 21:00, Vitalii Tymchyshyn wrote:
> I am not sure why do you want to change on-disk storage format? If we
> are talking about indexes, it's more about comparison function (opclass)
> that is used in an index. 
> Am I wrong?

I don't "want" to change the on-disk format of the v1 UUID's but that
seems to be the most efficient approach to me (if anyone would ever want
to implement that) as not changing it would mean to introduce the
unshuffling of the timestamp bytes in a lot of other places where it
would decrease the performance instead of increasing it, not to speak of
VACUUM, REINDEX, CLUSTER, ....

One of my use-cases would also benefit a lot here when trying to find
entries that have been created within a given time range. To do that
(without an additional timestamp column), I'd end up with a full index
scan to determine the rows to read as there is no correlation between
those (timestamp and value sort order).

For example I would like to query the DB with:

WHERE id >= '7d9c4835-5378-11ec-0000-000000000000' AND id <
'5cdcac78-5a9a-11ec-0000-000000000000'

...which is impossible atm. I have written myself a helper function that
extracts the timestamp, but that is executed (of course) for each and
every entry while filtering or increases (if used as an index
expression) the write RTT by quite a lot (if you're dealing with
peak-load style applications like we have):

WHERE time_part_uuidv1(id) >= '2021-12-02 14:02:31.021778+00' AND
time_part_uuidv1(id) < '2021-12-11 15:52:37.107212+00'

So, although possible, it is not really an option for an application
like we have (and we don't just want to throw hardware at it).

I don't know if all that makes any sense to you or if I'm just missing a
piece of the puzzle inside the PostgreSQL code base. If I do, please
tell me. :)

Cheers,

    Ancoron

> 
> сб, 25 трав. 2019 о 11:21 Ancoron Luciferis
> <ancoron.luciferis@googlemail.com
> <mailto:ancoron.luciferis@googlemail.com>> пише:
> 
>     On 25/05/2019 16:57, Tom Lane wrote:
>     > Ancoron Luciferis <ancoron.luciferis@googlemail.com
>     <mailto:ancoron.luciferis@googlemail.com>> writes:
>     >> So I investigated the PostgreSQL code to see how it is handling
>     UUID's
>     >> with respect to storage, sorting, aso. but all I could find was
>     that it
>     >> basically falls back to the 16-byte.
>     >
>     > Yup, they're just blobs to us.
>     >
>     >> After struggling to find a way to optimize things inside the
>     database, I
>     >> reverted to introduce a hack into the application by not
>     shuffling the
>     >> timestamp bytes for the UUID's, which makes it look quite serial in
>     >> terms of byte order.
>     >
>     >> So, my question now is: Would it make sense for you to handle these
>     >> time-based UUID's differently internally? Specifically
>     un-shuffling the
>     >> timestamp before they are going to storage?
>     >
>     > No, because
>     >
>     > (1) UUID layout is standardized;
> 
>     You mean the presentation at the byte-level is. ;)
> 
>     >
>     > (2) such a change would break on-disk compatibility for existing
>     > databases;
> 
>     Yes, that certainly is a show-stopper.
> 
>     >
>     > (3) particularly for the case of application-generated UUIDs, we do
>     > not have enough information to know that this would actually do
>     anything
>     > useful;
> 
>     Well, not only the layout is standardized, but also there is a semantic
>     to it depending on the version. Specifically for version 1, it has:
>     1. a timestamp
>     2. a clock sequence
>     3. a node id
> 
>     Ans as PostgreSQL already provides this pretty concrete data type, it
>     could be a natural consequence to also support the semantic of it.
> 
>     E.g. the network types also come with a lot of additional operators and
>     functions. So I don't see a reason not to respect the additional
>     capabilities of a UUID.
> 
>     For other versions of UUID's, functions like timestamp would certainly
>     not be available (return NULL?), respecting the concrete semantic.
> 
>     >
>     > (4) it in fact *wouldn't* do anything useful, because we'd still have
>     > to sort UUIDs in the same order as today, meaning that btree index
>     behavior
>     > would remain the same as before.  Plus UUID comparison would get a lot
>     > more complicated and slower than it is now.
> 
>     I get your first sentence, but not your second. I know that when
>     changing the internal byte order we'd have to completed re-compute
>     everything on-disk (from table to index data), but why would the sorting
>     in the index have to be the same?
> 
>     And actually, comparison logic wouldn't need to be changed at all if the
>     byte order is "changed" when the UUID is read in when reading the
>     representation into the internal UUID's byte array.
> 
>     Function:
>     string_to_uuid(const char *source, pg_uuid_t *uuid);
> 
>     ^^ here I would apply the change. And of course, reverse it for
>     generating the textual representation.
> 
>     That would slow down writes a bit, but that shouldn't be the case
>     because index insertions are speed up even more.
> 
>     But still, on-disk change is still a show-stopper, I guess.
> 
>     >
>     > (5) even if we ignored all that and did it anyway, it would only help
>     > for version-1 UUIDs.  The index performance issue would still
>     remain for
>     > version-4 (random) UUIDs, which are if anything more common than v1.
> 
>     Yes, I am aware that the changes might be of very limited gain. V4
>     UUID's are usually used for external identifiers. For internal ones,
>     they don't make sense to me (too long, issues with randomness/enthropie
>     under high load, ...). ;)
> 
>     I just recently used these UUID's also together with a function for
>     TimescaleDB auto-partitioning to provide the timestamp for the
>     partitioning logic instead of the need for a separate timestamp column.
> 
>     This is also one of the reasons why I was also asking for native support
>     functions to extract the timestamp. I am apparently not very good at C
>     so I am currently using Python and/or PgPLSQL for it, which is
>     pretty slow.
> 
>     >
>     >
>     > FWIW, I don't know what tool you're using to get those "bloat"
>     numbers,
>     > but just because somebody calls it bloat doesn't mean that it is.
>     > The normal, steady-state load factor for a btree index is generally
>     > understood to be about 2/3rds, and that looks to be about what
>     > you're getting for the regular-UUID-format index.  The fact that the
>     > serially-loaded index has nearly no dead space is because we hack the
>     > page split logic to make that happen --- but that is a hack, and it's
>     > not without downsides.  It should *not* be taken to be an indication
>     > of what you can expect for any other insertion pattern.
> 
>     OK, understood. I was actually a bit surprised by those numbers myself
>     as these "serial" UUID's still only have the timestamp bytes in
>     ascending order, the clock sequence and node is still pretty random (but
>     not inside a single transaction, which might help the hack).
> 
>     >
>     > The insertion-speed aspect is a real problem, but the core of that
>     problem
>     > is that use of any sort of standard-format UUID turns applications
>     that
>     > might have had considerable locality of reference into
>     applications that
>     > have none.  If you can manage to keep your whole index in RAM that
>     would
>     > not hurt too much, but as soon as it doesn't fit you have a problem.
>     > When your app has more or less predictable reference patterns it's
>     best
>     > to design a unique key that matches that, instead of expecting that
>     > essentially-random keys will work well.
> 
>     The system was configured to have more than enough space for the index
>     and table data to fit into memory, but I am not sure. How can I verify
>     that? An EXPLAIN on the INSERT apparently doesn't include index
>     insertion.
> 
>     >
>     > Or in short, hacking up the way your app generates UUIDs is exactly
>     > the right way to proceed here.
> 
>     OK. Glad to hear that.
> 
>     One last question, though:
>     Would it make sense to create a specialized UUID v1 type (e.g. with an
>     extension) that does the transformation and delegates for all other
>     things to the existing UUID type support?
> 
>     >
>     >                       regards, tom lane
>     >
> 
> 




Re: UUID v1 optimizations...

From
Ancoron Luciferis
Date:
On 25/05/2019 21:00, Vitalii Tymchyshyn wrote:
> I am not sure why do you want to change on-disk storage format? If we
> are talking about indexes, it's more about comparison function (opclass)
> that is used in an index. 
> Am I wrong?

I don't "want" to change the on-disk format of the v1 UUID's but that
seems to be the most efficient approach to me (if anyone would ever want
to implement that) as not changing it would mean to introduce the
unshuffling of the timestamp bytes in a lot of other places where it
would decrease the performance instead of increasing it, not to speak of
VACUUM, REINDEX, CLUSTER, ....

One of my use-cases would also benefit a lot here when trying to find
entries that have been created within a given time range. To do that
(without an additional timestamp column), I'd end up with a full index
scan to determine the rows to read as there is no correlation between
those (timestamp and value sort order).

For example I would like to query the DB with:

WHERE id >= '7d9c4835-5378-11ec-0000-000000000000' AND id <
'5cdcac78-5a9a-11ec-0000-000000000000'

...which is impossible atm. I have written myself a helper function that
extracts the timestamp, but that is executed (of course) for each and
every entry while filtering or increases (if used as an index
expression) the write RTT by quite a lot (if you're dealing with
peak-load style applications like we have):

WHERE time_part_uuidv1(id) >= '2021-12-02 14:02:31.021778+00' AND
time_part_uuidv1(id) < '2021-12-11 15:52:37.107212+00'

So, although possible, it is not really an option for an application
like we have (and we don't just want to throw hardware at it).

I don't know if all that makes any sense to you or if I'm just missing a
piece of the puzzle inside the PostgreSQL code base. If I do, please
tell me. :)

Cheers,

    Ancoron

> 
> сб, 25 трав. 2019 о 11:21 Ancoron Luciferis
> <ancoron.luciferis@googlemail.com
> <mailto:ancoron.luciferis@googlemail.com>> пише:
> 
>     On 25/05/2019 16:57, Tom Lane wrote:
>     > Ancoron Luciferis <ancoron.luciferis@googlemail.com
>     <mailto:ancoron.luciferis@googlemail.com>> writes:
>     >> So I investigated the PostgreSQL code to see how it is handling
>     UUID's
>     >> with respect to storage, sorting, aso. but all I could find was
>     that it
>     >> basically falls back to the 16-byte.
>     >
>     > Yup, they're just blobs to us.
>     >
>     >> After struggling to find a way to optimize things inside the
>     database, I
>     >> reverted to introduce a hack into the application by not
>     shuffling the
>     >> timestamp bytes for the UUID's, which makes it look quite serial in
>     >> terms of byte order.
>     >
>     >> So, my question now is: Would it make sense for you to handle these
>     >> time-based UUID's differently internally? Specifically
>     un-shuffling the
>     >> timestamp before they are going to storage?
>     >
>     > No, because
>     >
>     > (1) UUID layout is standardized;
> 
>     You mean the presentation at the byte-level is. ;)
> 
>     >
>     > (2) such a change would break on-disk compatibility for existing
>     > databases;
> 
>     Yes, that certainly is a show-stopper.
> 
>     >
>     > (3) particularly for the case of application-generated UUIDs, we do
>     > not have enough information to know that this would actually do
>     anything
>     > useful;
> 
>     Well, not only the layout is standardized, but also there is a semantic
>     to it depending on the version. Specifically for version 1, it has:
>     1. a timestamp
>     2. a clock sequence
>     3. a node id
> 
>     Ans as PostgreSQL already provides this pretty concrete data type, it
>     could be a natural consequence to also support the semantic of it.
> 
>     E.g. the network types also come with a lot of additional operators and
>     functions. So I don't see a reason not to respect the additional
>     capabilities of a UUID.
> 
>     For other versions of UUID's, functions like timestamp would certainly
>     not be available (return NULL?), respecting the concrete semantic.
> 
>     >
>     > (4) it in fact *wouldn't* do anything useful, because we'd still have
>     > to sort UUIDs in the same order as today, meaning that btree index
>     behavior
>     > would remain the same as before.  Plus UUID comparison would get a lot
>     > more complicated and slower than it is now.
> 
>     I get your first sentence, but not your second. I know that when
>     changing the internal byte order we'd have to completed re-compute
>     everything on-disk (from table to index data), but why would the sorting
>     in the index have to be the same?
> 
>     And actually, comparison logic wouldn't need to be changed at all if the
>     byte order is "changed" when the UUID is read in when reading the
>     representation into the internal UUID's byte array.
> 
>     Function:
>     string_to_uuid(const char *source, pg_uuid_t *uuid);
> 
>     ^^ here I would apply the change. And of course, reverse it for
>     generating the textual representation.
> 
>     That would slow down writes a bit, but that shouldn't be the case
>     because index insertions are speed up even more.
> 
>     But still, on-disk change is still a show-stopper, I guess.
> 
>     >
>     > (5) even if we ignored all that and did it anyway, it would only help
>     > for version-1 UUIDs.  The index performance issue would still
>     remain for
>     > version-4 (random) UUIDs, which are if anything more common than v1.
> 
>     Yes, I am aware that the changes might be of very limited gain. V4
>     UUID's are usually used for external identifiers. For internal ones,
>     they don't make sense to me (too long, issues with randomness/enthropie
>     under high load, ...). ;)
> 
>     I just recently used these UUID's also together with a function for
>     TimescaleDB auto-partitioning to provide the timestamp for the
>     partitioning logic instead of the need for a separate timestamp column.
> 
>     This is also one of the reasons why I was also asking for native support
>     functions to extract the timestamp. I am apparently not very good at C
>     so I am currently using Python and/or PgPLSQL for it, which is
>     pretty slow.
> 
>     >
>     >
>     > FWIW, I don't know what tool you're using to get those "bloat"
>     numbers,
>     > but just because somebody calls it bloat doesn't mean that it is.
>     > The normal, steady-state load factor for a btree index is generally
>     > understood to be about 2/3rds, and that looks to be about what
>     > you're getting for the regular-UUID-format index.  The fact that the
>     > serially-loaded index has nearly no dead space is because we hack the
>     > page split logic to make that happen --- but that is a hack, and it's
>     > not without downsides.  It should *not* be taken to be an indication
>     > of what you can expect for any other insertion pattern.
> 
>     OK, understood. I was actually a bit surprised by those numbers myself
>     as these "serial" UUID's still only have the timestamp bytes in
>     ascending order, the clock sequence and node is still pretty random (but
>     not inside a single transaction, which might help the hack).
> 
>     >
>     > The insertion-speed aspect is a real problem, but the core of that
>     problem
>     > is that use of any sort of standard-format UUID turns applications
>     that
>     > might have had considerable locality of reference into
>     applications that
>     > have none.  If you can manage to keep your whole index in RAM that
>     would
>     > not hurt too much, but as soon as it doesn't fit you have a problem.
>     > When your app has more or less predictable reference patterns it's
>     best
>     > to design a unique key that matches that, instead of expecting that
>     > essentially-random keys will work well.
> 
>     The system was configured to have more than enough space for the index
>     and table data to fit into memory, but I am not sure. How can I verify
>     that? An EXPLAIN on the INSERT apparently doesn't include index
>     insertion.
> 
>     >
>     > Or in short, hacking up the way your app generates UUIDs is exactly
>     > the right way to proceed here.
> 
>     OK. Glad to hear that.
> 
>     One last question, though:
>     Would it make sense to create a specialized UUID v1 type (e.g. with an
>     extension) that does the transformation and delegates for all other
>     things to the existing UUID type support?
> 
>     >
>     >                       regards, tom lane
>     >
> 
> 




Re: UUID v1 optimizations...

From
Tom Lane
Date:
Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes:
> On 25/05/2019 16:57, Tom Lane wrote:
>> (4) it in fact *wouldn't* do anything useful, because we'd still have
>> to sort UUIDs in the same order as today, meaning that btree index behavior
>> would remain the same as before.  Plus UUID comparison would get a lot
>> more complicated and slower than it is now.

> I get your first sentence, but not your second. I know that when
> changing the internal byte order we'd have to completed re-compute
> everything on-disk (from table to index data), but why would the sorting
> in the index have to be the same?

Because we aren't going to change the existing sort order of UUIDs.
We have no idea what applications might be dependent on that.

As Vitalii correctly pointed out, your beef is not with the physical
storage of UUIDs anyway: you just wish they'd sort differently, since
that is what determines the behavior of a btree index.  But we aren't
going to change the sort ordering because that's an even bigger
compatibility break than changing the physical storage; it'd affect
application-visible semantics.

What you might want to think about is creating a function that maps
UUIDs into an ordering that makes sense to you, and then creating
a unique index over that function instead of the raw UUIDs.  That
would give the results you want without having to negotiate with the
rest of the world about whether it's okay to change the semantics
of type uuid.

            regards, tom lane



Re: UUID v1 optimizations...

From
Tomas Vondra
Date:
On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote:
>Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes:
>> On 25/05/2019 16:57, Tom Lane wrote:
>>> (4) it in fact *wouldn't* do anything useful, because we'd still have
>>> to sort UUIDs in the same order as today, meaning that btree index behavior
>>> would remain the same as before.  Plus UUID comparison would get a lot
>>> more complicated and slower than it is now.
>
>> I get your first sentence, but not your second. I know that when
>> changing the internal byte order we'd have to completed re-compute
>> everything on-disk (from table to index data), but why would the sorting
>> in the index have to be the same?
>
>Because we aren't going to change the existing sort order of UUIDs.
>We have no idea what applications might be dependent on that.
>
>As Vitalii correctly pointed out, your beef is not with the physical
>storage of UUIDs anyway: you just wish they'd sort differently, since
>that is what determines the behavior of a btree index.  But we aren't
>going to change the sort ordering because that's an even bigger
>compatibility break than changing the physical storage; it'd affect
>application-visible semantics.
>
>What you might want to think about is creating a function that maps
>UUIDs into an ordering that makes sense to you, and then creating
>a unique index over that function instead of the raw UUIDs.  That
>would give the results you want without having to negotiate with the
>rest of the world about whether it's okay to change the semantics
>of type uuid.
>

FWIW that's essentially what I implemented as an extension some time
ago. See [1] for a more detailed explanation and some benchmarks.

The thing is - it's not really desirable to get perfectly ordered
ordering, because that would mean we never get back to older parts of
the index (so if you delete data, we'd never fill that space).

[1] https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: UUID v1 optimizations...

From
Tom Lane
Date:
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote:
>> What you might want to think about is creating a function that maps
>> UUIDs into an ordering that makes sense to you, and then creating
>> a unique index over that function instead of the raw UUIDs.  That
>> would give the results you want without having to negotiate with the
>> rest of the world about whether it's okay to change the semantics
>> of type uuid.

> FWIW that's essentially what I implemented as an extension some time
> ago. See [1] for a more detailed explanation and some benchmarks.

Also, another way to attack this is to create a new set of ordering
operators for UUID and an associated non-default btree opclass.
Then you declare your index using that opclass and you're done.
The key advantage of this way is that the regular UUID equality
operator can still be a member of that opclass, meaning that searches
of the form "uuidcol = constant" can still use this index, so you
don't have to change your queries (at least not for that common case).
Look at the interrelationship of the regular text btree operators and
the "pattern_ops" btree operators for a precedent.

            regards, tom lane



Re: UUID v1 optimizations...

From
Ancoron Luciferis
Date:
On 25/05/2019 23:54, Tom Lane wrote:
> Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes:
>> On 25/05/2019 16:57, Tom Lane wrote:
>>> (4) it in fact *wouldn't* do anything useful, because we'd still have
>>> to sort UUIDs in the same order as today, meaning that btree index behavior
>>> would remain the same as before.  Plus UUID comparison would get a lot
>>> more complicated and slower than it is now.
> 
>> I get your first sentence, but not your second. I know that when
>> changing the internal byte order we'd have to completed re-compute
>> everything on-disk (from table to index data), but why would the sorting
>> in the index have to be the same?
> 
> Because we aren't going to change the existing sort order of UUIDs.
> We have no idea what applications might be dependent on that.
> 
> As Vitalii correctly pointed out, your beef is not with the physical
> storage of UUIDs anyway: you just wish they'd sort differently, since
> that is what determines the behavior of a btree index.  But we aren't
> going to change the sort ordering because that's an even bigger
> compatibility break than changing the physical storage; it'd affect
> application-visible semantics.
> 
> What you might want to think about is creating a function that maps
> UUIDs into an ordering that makes sense to you, and then creating
> a unique index over that function instead of the raw UUIDs.  That
> would give the results you want without having to negotiate with the
> rest of the world about whether it's okay to change the semantics
> of type uuid.
> 
>             regards, tom lane
> 

I understand. Point taken, I really didn't think about someone could
depend on an index order of a (pretty random) UUID.

The whole point of me starting this discussion was about performance in
multiple areas, but INSERT performance was really becoming an issue for
us apart from the index bloat, which was way higher than just the 30% at
several occasions (including out-of-disk-space in the early days), apart
from the fact that the index was regularly dismissed due to it not being
in memory. In that sense, just creating additional indexes with
functions doesn't really solve the core issues that we had.

Not to mention the performance of VACUUM, among other things.

So, even we currently "solved" a lot of these issues at the application
level, we now have UUID's that look like v1 UUID's but in fact will not
be readable (in the representation as returned by PostgreSQL) by any
other application that doesn't know our specific implementation. This
forces us to hack other tools written in other languages that would
otherwise understand and handle regular v1 UUID's as well.

I should add that the tests I have made where all running on dedicated
SSD's, one for the table data, one for the indexes and one for the WAL.
If those where running against the same disks the difference would
probably be much higher during writes.

I'll think about creating an extension to provide a custom data type
instead. So nobody would be at risk and anyone would decide explicitly
for it with all consequences.

Thank you for your time and precious input. :)


Cheers,

    Ancoron



Re: UUID v1 optimizations...

From
Tomas Vondra
Date:
On Sat, May 25, 2019 at 06:38:08PM -0400, Tom Lane wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote:
>>> What you might want to think about is creating a function that maps
>>> UUIDs into an ordering that makes sense to you, and then creating
>>> a unique index over that function instead of the raw UUIDs.  That
>>> would give the results you want without having to negotiate with the
>>> rest of the world about whether it's okay to change the semantics
>>> of type uuid.
>
>> FWIW that's essentially what I implemented as an extension some time
>> ago. See [1] for a more detailed explanation and some benchmarks.
>
>Also, another way to attack this is to create a new set of ordering
>operators for UUID and an associated non-default btree opclass.
>Then you declare your index using that opclass and you're done.
>The key advantage of this way is that the regular UUID equality
>operator can still be a member of that opclass, meaning that searches
>of the form "uuidcol = constant" can still use this index, so you
>don't have to change your queries (at least not for that common case).
>Look at the interrelationship of the regular text btree operators and
>the "pattern_ops" btree operators for a precedent.
>

Perhaps. But it does not allow to tune how often the values "wrap" and,
which I think is an useful capability.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: UUID v1 optimizations...

From
Ancoron Luciferis
Date:
On 26/05/2019 00:14, Tomas Vondra wrote:
> On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote:
>> Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes:
>>> On 25/05/2019 16:57, Tom Lane wrote:
>>>> (4) it in fact *wouldn't* do anything useful, because we'd still have
>>>> to sort UUIDs in the same order as today, meaning that btree index
>>>> behavior
>>>> would remain the same as before.  Plus UUID comparison would get a lot
>>>> more complicated and slower than it is now.
>>
>>> I get your first sentence, but not your second. I know that when
>>> changing the internal byte order we'd have to completed re-compute
>>> everything on-disk (from table to index data), but why would the sorting
>>> in the index have to be the same?
>>
>> Because we aren't going to change the existing sort order of UUIDs.
>> We have no idea what applications might be dependent on that.
>>
>> As Vitalii correctly pointed out, your beef is not with the physical
>> storage of UUIDs anyway: you just wish they'd sort differently, since
>> that is what determines the behavior of a btree index.  But we aren't
>> going to change the sort ordering because that's an even bigger
>> compatibility break than changing the physical storage; it'd affect
>> application-visible semantics.
>>
>> What you might want to think about is creating a function that maps
>> UUIDs into an ordering that makes sense to you, and then creating
>> a unique index over that function instead of the raw UUIDs.  That
>> would give the results you want without having to negotiate with the
>> rest of the world about whether it's okay to change the semantics
>> of type uuid.
>>
> 
> FWIW that's essentially what I implemented as an extension some time
> ago. See [1] for a more detailed explanation and some benchmarks.

Yes, I've seen that before. Pretty nice work you but together there and
I'll surely have a look at it but we certainly need the node id in
compliance with v1 UUID's so that's why we've been generating UUID's at
the application side from day 1.

> 
> The thing is - it's not really desirable to get perfectly ordered
> ordering, because that would mean we never get back to older parts of
> the index (so if you delete data, we'd never fill that space).

Wouldn't this apply also to any sequential-looking index (e.g. on
serial)? The main issue with the UUID's is that it almost instantly
consumes a big part of the total value space (e.g. first value is
'01...' and second is coming as 'f3...') which I would assume not being
very efficient with btrees (space reclaim? - bloat).

One of our major concerns is to keep index size small (VACUUM can't be
run every minute) to fit into memory next to a lot of others.

I've experimented with the rollover "prefix" myself but found that it
makes the index too big (same or larger index size than standard v1
UUIDs) and VACUUM too slow (almost as slow as a standard V1 UUID),
although INSERT performance wasn't that bad, our sequential UUID's where
way faster (at least pre-generated and imported with COPY to eliminate
any value generation impact).

Cheers,

    Ancoron

> 
> [1] https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/
> 
> 
> regards
> 




Re: UUID v1 optimizations...

From
Tomas Vondra
Date:
On Sun, May 26, 2019 at 01:49:30AM +0200, Ancoron Luciferis wrote:
>On 26/05/2019 00:14, Tomas Vondra wrote:
>> On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote:
>>> Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes:
>>>> On 25/05/2019 16:57, Tom Lane wrote:
>>>>> (4) it in fact *wouldn't* do anything useful, because we'd still have
>>>>> to sort UUIDs in the same order as today, meaning that btree index
>>>>> behavior
>>>>> would remain the same as before.  Plus UUID comparison would get a lot
>>>>> more complicated and slower than it is now.
>>>
>>>> I get your first sentence, but not your second. I know that when
>>>> changing the internal byte order we'd have to completed re-compute
>>>> everything on-disk (from table to index data), but why would the sorting
>>>> in the index have to be the same?
>>>
>>> Because we aren't going to change the existing sort order of UUIDs.
>>> We have no idea what applications might be dependent on that.
>>>
>>> As Vitalii correctly pointed out, your beef is not with the physical
>>> storage of UUIDs anyway: you just wish they'd sort differently, since
>>> that is what determines the behavior of a btree index.  But we aren't
>>> going to change the sort ordering because that's an even bigger
>>> compatibility break than changing the physical storage; it'd affect
>>> application-visible semantics.
>>>
>>> What you might want to think about is creating a function that maps
>>> UUIDs into an ordering that makes sense to you, and then creating
>>> a unique index over that function instead of the raw UUIDs.  That
>>> would give the results you want without having to negotiate with the
>>> rest of the world about whether it's okay to change the semantics
>>> of type uuid.
>>>
>>
>> FWIW that's essentially what I implemented as an extension some time
>> ago. See [1] for a more detailed explanation and some benchmarks.
>
>Yes, I've seen that before. Pretty nice work you but together there and
>I'll surely have a look at it but we certainly need the node id in
>compliance with v1 UUID's so that's why we've been generating UUID's at
>the application side from day 1.
>
>>
>> The thing is - it's not really desirable to get perfectly ordered
>> ordering, because that would mean we never get back to older parts of
>> the index (so if you delete data, we'd never fill that space).
>
>Wouldn't this apply also to any sequential-looking index (e.g. on
>serial)?

Yes, it does apply to any index on sequential (ordered) data. If you
delete data from the "old" part (but not all, so the pages don't get
completely empty), that space is lost. It's available for new data, but
if we only insert to "new" part of the index, that's useless.

> The main issue with the UUID's is that it almost instantly
>consumes a big part of the total value space (e.g. first value is
>'01...' and second is coming as 'f3...') which I would assume not being
>very efficient with btrees (space reclaim? - bloat).
>

I don't understand what you mean here. Perhaps you misunderstand how
btree indexes grow? It's not like we allocate separate pages for
different values/prefixes - we insert the data until a page gets full,
then it's split in half. There is some dependency on the order in which
the values are inserted, but AFAIK random order is generally fine.

>One of our major concerns is to keep index size small (VACUUM can't be
>run every minute) to fit into memory next to a lot of others.
>

I don't think this has much to do with vacuum - I don't see how it's
related to the ordering of generated UUID values. And I don't see where
the "can't run vacuum every minute" comes from.

>I've experimented with the rollover "prefix" myself but found that it
>makes the index too big (same or larger index size than standard v1
>UUIDs) and VACUUM too slow (almost as slow as a standard V1 UUID),
>although INSERT performance wasn't that bad, our sequential UUID's where
>way faster (at least pre-generated and imported with COPY to eliminate
>any value generation impact).
>

I very much doubt that has anything to do with the prefix. You'll need
to share more details about how you did your tests.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: UUID v1 optimizations...

From
Morris de Oryx
Date:
I'm not worthy to post here, but a bit of a random thought.

If I've followed the conversation correctly, the reason for a V1 UUID is partly to order and partition rows by a timestamp value, but without the cost of a timestamp column. As I was told as a boy, "Smart numbers aren't." Is it _absolutely_ the case that you can't afford another column? I don't know the ins and outs of the Postgres row format, but my impression is that it's a fixed size, so you may be able to add the column without splitting rows? Anyway, even if that's not true and the extra column costs you disk space, is it the index that concerns you?  Have you considered a timestamp column, or a numeric column with an epoch offset, and a BRIN index? If you insert data is in pretty much chronological order, that might work well for you.

Best of luck, I've enjoyed following the commentary.


On Sun, May 26, 2019 at 11:09 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On Sun, May 26, 2019 at 01:49:30AM +0200, Ancoron Luciferis wrote:
>On 26/05/2019 00:14, Tomas Vondra wrote:
>> On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote:
>>> Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes:
>>>> On 25/05/2019 16:57, Tom Lane wrote:
>>>>> (4) it in fact *wouldn't* do anything useful, because we'd still have
>>>>> to sort UUIDs in the same order as today, meaning that btree index
>>>>> behavior
>>>>> would remain the same as before.  Plus UUID comparison would get a lot
>>>>> more complicated and slower than it is now.
>>>
>>>> I get your first sentence, but not your second. I know that when
>>>> changing the internal byte order we'd have to completed re-compute
>>>> everything on-disk (from table to index data), but why would the sorting
>>>> in the index have to be the same?
>>>
>>> Because we aren't going to change the existing sort order of UUIDs.
>>> We have no idea what applications might be dependent on that.
>>>
>>> As Vitalii correctly pointed out, your beef is not with the physical
>>> storage of UUIDs anyway: you just wish they'd sort differently, since
>>> that is what determines the behavior of a btree index.  But we aren't
>>> going to change the sort ordering because that's an even bigger
>>> compatibility break than changing the physical storage; it'd affect
>>> application-visible semantics.
>>>
>>> What you might want to think about is creating a function that maps
>>> UUIDs into an ordering that makes sense to you, and then creating
>>> a unique index over that function instead of the raw UUIDs.  That
>>> would give the results you want without having to negotiate with the
>>> rest of the world about whether it's okay to change the semantics
>>> of type uuid.
>>>
>>
>> FWIW that's essentially what I implemented as an extension some time
>> ago. See [1] for a more detailed explanation and some benchmarks.
>
>Yes, I've seen that before. Pretty nice work you but together there and
>I'll surely have a look at it but we certainly need the node id in
>compliance with v1 UUID's so that's why we've been generating UUID's at
>the application side from day 1.
>
>>
>> The thing is - it's not really desirable to get perfectly ordered
>> ordering, because that would mean we never get back to older parts of
>> the index (so if you delete data, we'd never fill that space).
>
>Wouldn't this apply also to any sequential-looking index (e.g. on
>serial)?

Yes, it does apply to any index on sequential (ordered) data. If you
delete data from the "old" part (but not all, so the pages don't get
completely empty), that space is lost. It's available for new data, but
if we only insert to "new" part of the index, that's useless.

> The main issue with the UUID's is that it almost instantly
>consumes a big part of the total value space (e.g. first value is
>'01...' and second is coming as 'f3...') which I would assume not being
>very efficient with btrees (space reclaim? - bloat).
>

I don't understand what you mean here. Perhaps you misunderstand how
btree indexes grow? It's not like we allocate separate pages for
different values/prefixes - we insert the data until a page gets full,
then it's split in half. There is some dependency on the order in which
the values are inserted, but AFAIK random order is generally fine.

>One of our major concerns is to keep index size small (VACUUM can't be
>run every minute) to fit into memory next to a lot of others.
>

I don't think this has much to do with vacuum - I don't see how it's
related to the ordering of generated UUID values. And I don't see where
the "can't run vacuum every minute" comes from.

>I've experimented with the rollover "prefix" myself but found that it
>makes the index too big (same or larger index size than standard v1
>UUIDs) and VACUUM too slow (almost as slow as a standard V1 UUID),
>although INSERT performance wasn't that bad, our sequential UUID's where
>way faster (at least pre-generated and imported with COPY to eliminate
>any value generation impact).
>

I very much doubt that has anything to do with the prefix. You'll need
to share more details about how you did your tests.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: UUID v1 optimizations...

From
Ancoron Luciferis
Date:
On 26/05/2019 03:09, Tomas Vondra wrote:
> On Sun, May 26, 2019 at 01:49:30AM +0200, Ancoron Luciferis wrote:
>> On 26/05/2019 00:14, Tomas Vondra wrote:
>>> On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote:
>>>> Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes:
>>>>> On 25/05/2019 16:57, Tom Lane wrote:
>>>>>> (4) it in fact *wouldn't* do anything useful, because we'd still have
>>>>>> to sort UUIDs in the same order as today, meaning that btree index
>>>>>> behavior
>>>>>> would remain the same as before.  Plus UUID comparison would get a
>>>>>> lot
>>>>>> more complicated and slower than it is now.
>>>>
>>>>> I get your first sentence, but not your second. I know that when
>>>>> changing the internal byte order we'd have to completed re-compute
>>>>> everything on-disk (from table to index data), but why would the
>>>>> sorting
>>>>> in the index have to be the same?
>>>>
>>>> Because we aren't going to change the existing sort order of UUIDs.
>>>> We have no idea what applications might be dependent on that.
>>>>
>>>> As Vitalii correctly pointed out, your beef is not with the physical
>>>> storage of UUIDs anyway: you just wish they'd sort differently, since
>>>> that is what determines the behavior of a btree index.  But we aren't
>>>> going to change the sort ordering because that's an even bigger
>>>> compatibility break than changing the physical storage; it'd affect
>>>> application-visible semantics.
>>>>
>>>> What you might want to think about is creating a function that maps
>>>> UUIDs into an ordering that makes sense to you, and then creating
>>>> a unique index over that function instead of the raw UUIDs.  That
>>>> would give the results you want without having to negotiate with the
>>>> rest of the world about whether it's okay to change the semantics
>>>> of type uuid.
>>>>
>>>
>>> FWIW that's essentially what I implemented as an extension some time
>>> ago. See [1] for a more detailed explanation and some benchmarks.
>>
>> Yes, I've seen that before. Pretty nice work you but together there and
>> I'll surely have a look at it but we certainly need the node id in
>> compliance with v1 UUID's so that's why we've been generating UUID's at
>> the application side from day 1.
>>
>>>
>>> The thing is - it's not really desirable to get perfectly ordered
>>> ordering, because that would mean we never get back to older parts of
>>> the index (so if you delete data, we'd never fill that space).
>>
>> Wouldn't this apply also to any sequential-looking index (e.g. on
>> serial)?
> 
> Yes, it does apply to any index on sequential (ordered) data. If you
> delete data from the "old" part (but not all, so the pages don't get
> completely empty), that space is lost. It's available for new data, but
> if we only insert to "new" part of the index, that's useless.

OK, thanks for clearing that up for me.

> 
>> The main issue with the UUID's is that it almost instantly
>> consumes a big part of the total value space (e.g. first value is
>> '01...' and second is coming as 'f3...') which I would assume not being
>> very efficient with btrees (space reclaim? - bloat).
>>
> 
> I don't understand what you mean here. Perhaps you misunderstand how
> btree indexes grow? It's not like we allocate separate pages for
> different values/prefixes - we insert the data until a page gets full,
> then it's split in half. There is some dependency on the order in which
> the values are inserted, but AFAIK random order is generally fine.

OK, I might not understand the basics of the btree implementation. Sorry
for that.

However, one of our issues with standard v1 UUID's was bloat of the
indexes, although we kept only a few months of data in it. I think, this
was due to the pages still containing at least one value and not
reclaimed by vacuum. It just continued to grow.

Now, as we have this different ever-increasing prefix, we still have
some constant growing but we see that whenever historic data get's
deleted (in a separate process), space get's reclaimed.


> 
>> One of our major concerns is to keep index size small (VACUUM can't be
>> run every minute) to fit into memory next to a lot of others.
>>
> 
> I don't think this has much to do with vacuum - I don't see how it's
> related to the ordering of generated UUID values. And I don't see where
> the "can't run vacuum every minute" comes from.

OK, numbers (after VACUUM) which I really found strange using the query
from pgexperts [1]:

     index_name      | bloat_pct | bloat_mb | index_mb | table_mb
---------------------+-----------+----------+----------+----------
 uuid_v1_pkey        |        38 |      363 |  965.742 |  950.172
 uuid_serial_pkey    |        11 |       74 |  676.844 |  950.172
 uuid_serial_8_pkey  |        46 |      519 | 1122.031 |  950.172
 uuid_serial_16_pkey |        39 |      389 |  991.195 |  950.172

...where the "8" and "16" is a "shift" of the timestamp value,
implemented with:

timestamp = (timestamp >>> (60 - shift)) | (timestamp << (shift + 4))

If someone could shed some light on why that is (the huge difference in
index sizes) I'd be happy.

> 
>> I've experimented with the rollover "prefix" myself but found that it
>> makes the index too big (same or larger index size than standard v1
>> UUIDs) and VACUUM too slow (almost as slow as a standard V1 UUID),
>> although INSERT performance wasn't that bad, our sequential UUID's where
>> way faster (at least pre-generated and imported with COPY to eliminate
>> any value generation impact).
>>
> 
> I very much doubt that has anything to do with the prefix. You'll need
> to share more details about how you did your tests.

OK, I'll see if I can prepare something and publish it.

> 
> 
> regards
> 


Refs:
[1]
https://github.com/pgexperts/pgx_scripts/blob/master/bloat/index_bloat_check.sql



Re: UUID v1 optimizations...

From
Ancoron Luciferis
Date:
On 26/05/2019 06:27, Morris de Oryx wrote:
> I'm not worthy to post here, but a bit of a random thought.
> 
> If I've followed the conversation correctly, the reason for a V1 UUID is
> partly to order and partition rows by a timestamp value, but without the
> cost of a timestamp column. As I was told as a boy, "Smart numbers
> aren't." Is it _absolutely_ the case that you can't afford another
> column? I don't know the ins and outs of the Postgres row format, but my
> impression is that it's a fixed size, so you may be able to add the
> column without splitting rows? Anyway, even if that's not true and the
> extra column costs you disk space, is it the index that concerns you? 
> Have you considered a timestamp column, or a numeric column with an
> epoch offset, and a BRIN index? If you insert data is in pretty much
> chronological order, that might work well for you.

Exactly, we are using the actual information combined within a v1 UUID
in multiple places and would like to avoid redundancy of information in
the database as we strive to keep as much of it in memory and
partitioning as well as timestamp (range) searching and sorting is a
pretty common thing for us.

For us, it's not an absolute "no-go" to have an additional column but
the semantics of the v1 UUID already guarantees us uniqueness across all
partitions, is the internal primary key and has additional information
we are using (creation time, node).

In addition, the extra column would need yet another index which brings
our write performance down again. So, while it would improve reading,
we're currently (still) more concerned about the write performance.

The BRIN index is something I might need to test, though.

> 
> Best of luck, I've enjoyed following the commentary.
> 
> 
> On Sun, May 26, 2019 at 11:09 AM Tomas Vondra
> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
> 
>     On Sun, May 26, 2019 at 01:49:30AM +0200, Ancoron Luciferis wrote:
>     >On 26/05/2019 00:14, Tomas Vondra wrote:
>     >> On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote:
>     >>> Ancoron Luciferis <ancoron.luciferis@googlemail.com
>     <mailto:ancoron.luciferis@googlemail.com>> writes:
>     >>>> On 25/05/2019 16:57, Tom Lane wrote:
>     >>>>> (4) it in fact *wouldn't* do anything useful, because we'd
>     still have
>     >>>>> to sort UUIDs in the same order as today, meaning that btree index
>     >>>>> behavior
>     >>>>> would remain the same as before.  Plus UUID comparison would
>     get a lot
>     >>>>> more complicated and slower than it is now.
>     >>>
>     >>>> I get your first sentence, but not your second. I know that when
>     >>>> changing the internal byte order we'd have to completed re-compute
>     >>>> everything on-disk (from table to index data), but why would
>     the sorting
>     >>>> in the index have to be the same?
>     >>>
>     >>> Because we aren't going to change the existing sort order of UUIDs.
>     >>> We have no idea what applications might be dependent on that.
>     >>>
>     >>> As Vitalii correctly pointed out, your beef is not with the physical
>     >>> storage of UUIDs anyway: you just wish they'd sort differently,
>     since
>     >>> that is what determines the behavior of a btree index.  But we
>     aren't
>     >>> going to change the sort ordering because that's an even bigger
>     >>> compatibility break than changing the physical storage; it'd affect
>     >>> application-visible semantics.
>     >>>
>     >>> What you might want to think about is creating a function that maps
>     >>> UUIDs into an ordering that makes sense to you, and then creating
>     >>> a unique index over that function instead of the raw UUIDs.  That
>     >>> would give the results you want without having to negotiate with the
>     >>> rest of the world about whether it's okay to change the semantics
>     >>> of type uuid.
>     >>>
>     >>
>     >> FWIW that's essentially what I implemented as an extension some time
>     >> ago. See [1] for a more detailed explanation and some benchmarks.
>     >
>     >Yes, I've seen that before. Pretty nice work you but together there and
>     >I'll surely have a look at it but we certainly need the node id in
>     >compliance with v1 UUID's so that's why we've been generating UUID's at
>     >the application side from day 1.
>     >
>     >>
>     >> The thing is - it's not really desirable to get perfectly ordered
>     >> ordering, because that would mean we never get back to older parts of
>     >> the index (so if you delete data, we'd never fill that space).
>     >
>     >Wouldn't this apply also to any sequential-looking index (e.g. on
>     >serial)?
> 
>     Yes, it does apply to any index on sequential (ordered) data. If you
>     delete data from the "old" part (but not all, so the pages don't get
>     completely empty), that space is lost. It's available for new data, but
>     if we only insert to "new" part of the index, that's useless.
> 
>     > The main issue with the UUID's is that it almost instantly
>     >consumes a big part of the total value space (e.g. first value is
>     >'01...' and second is coming as 'f3...') which I would assume not being
>     >very efficient with btrees (space reclaim? - bloat).
>     >
> 
>     I don't understand what you mean here. Perhaps you misunderstand how
>     btree indexes grow? It's not like we allocate separate pages for
>     different values/prefixes - we insert the data until a page gets full,
>     then it's split in half. There is some dependency on the order in which
>     the values are inserted, but AFAIK random order is generally fine.
> 
>     >One of our major concerns is to keep index size small (VACUUM can't be
>     >run every minute) to fit into memory next to a lot of others.
>     >
> 
>     I don't think this has much to do with vacuum - I don't see how it's
>     related to the ordering of generated UUID values. And I don't see where
>     the "can't run vacuum every minute" comes from.
> 
>     >I've experimented with the rollover "prefix" myself but found that it
>     >makes the index too big (same or larger index size than standard v1
>     >UUIDs) and VACUUM too slow (almost as slow as a standard V1 UUID),
>     >although INSERT performance wasn't that bad, our sequential UUID's
>     where
>     >way faster (at least pre-generated and imported with COPY to eliminate
>     >any value generation impact).
>     >
> 
>     I very much doubt that has anything to do with the prefix. You'll need
>     to share more details about how you did your tests.
> 
> 
>     regards
> 
>     -- 
>     Tomas Vondra                  http://www.2ndQuadrant.com
>     PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> 
> 




Re: UUID v1 optimizations...

From
Morris de Oryx
Date:
On Sun, May 26, 2019 at 7:38 PM Ancoron Luciferis <ancoron.luciferis@googlemail.com> wrote:

The BRIN index is something I might need to test, though.

Yes, check that out, it might give you some ideas. A B-tree (in whatever variant) is inherently a large index type. They're ideal for finding unique values quickly, not ideal for storing redundant values, and pretty decent at finding ranges. A BRIN (Block Range Index), as implemented in Postgres, is good for finding unique values and and ranges. But here's the thing, a BRIN index takes some absurdly small % of the space of a B-tree. You have to blink and check again to be sure you've figured it right.

How can a BRIN index be so much smaller? By throwing virtually everything out. A BRIN index doesn't store all of the values in a page, it stores the min/max value and that's it. So it's a probabilistic index (of sorts.) Is the value you're seeking on such and so page? The answer is "No" or "Maybe." That's a fast test on a very cheap data structure.

When the answer is "maybe", the full has to be loaded and scanned to determine if a specific value is found. So, you have a very small index structure, but have to do more sequential scanning to determine if a record is indexed in that page or not. In the real world, this can work out to be a high-performance structure at very low cost. But for it to work, your records need to be physically ordered (CLUSTER) by the condition in that index. And, going forward, you ought to be inserting in order too.(More-or-less.) So, a BRIN index is a great option if you have an insertion pattern that allows it to remain efficient, and if you're goal is range searching without a heavy B-tree index to maintain.

I have no clue how BRIN indexes and partitioning interact.

Re: UUID v1 optimizations...

From
Tomas Vondra
Date:
On Sun, May 26, 2019 at 02:27:05PM +1000, Morris de Oryx wrote:
>I'm not worthy to post here, but a bit of a random thought.
>
>If I've followed the conversation correctly, the reason for a V1 UUID is
>partly to order and partition rows by a timestamp value, but without the
>cost of a timestamp column. As I was told as a boy, "Smart numbers aren't."
>Is it _absolutely_ the case that you can't afford another column? I don't
>know the ins and outs of the Postgres row format, but my impression is that
>it's a fixed size, so you may be able to add the column without splitting
>rows? Anyway, even if that's not true and the extra column costs you disk
>space, is it the index that concerns you?  Have you considered a timestamp
>column, or a numeric column with an epoch offset, and a BRIN index? If you
>insert data is in pretty much chronological order, that might work well for
>you.
>
>Best of luck, I've enjoyed following the commentary.
>

No, an extra column is not a solution, because it has no impact on the
index on the UUID column. One of the problems with indexes on random
data is that the entries go to random parts of the index. In the extreme
case, each index insert goes to a different index page (since the last
checkpoint) and therefore has to write the whole page into the WAL.
That's what full-page writes do. This inflates the amount of WAL, may
trigger more frequent checkpoints and (of course) reduces the cache hit
ratio for index pages (because we have to touch many of them).

The point of generating UUIDs in a more sequential way is to limit this
behavior by "concentrating" the index inserts into a smaller part of the
index. That's why indexes on sequential data (say, generated from a
SERIAL column) perform better.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: UUID v1 optimizations...

From
Morris de Oryx
Date:
Here's what I was thinking of regarding disk space:


That's the kind of low-level detail I try not to worry about, but like to have in the back of my mind at least a little. If I read it correctly, a fixed-length field is going to be marked as present with a bit, and then inlined in the row storage without any extra cost for an address. So not much in the way of extra overhead. Spending the space on a field, reduces the compute needed to constantly perform extracts on the UUID field to access the same information.

But that particular trade-off is an ancient discussion and judgement call, you know your requirements and constraints better than anyone else. So, I'll leave it at that.

On Sun, May 26, 2019 at 8:24 PM Morris de Oryx <morrisdeoryx@gmail.com> wrote:
On Sun, May 26, 2019 at 7:38 PM Ancoron Luciferis <ancoron.luciferis@googlemail.com> wrote:

The BRIN index is something I might need to test, though.

Yes, check that out, it might give you some ideas. A B-tree (in whatever variant) is inherently a large index type. They're ideal for finding unique values quickly, not ideal for storing redundant values, and pretty decent at finding ranges. A BRIN (Block Range Index), as implemented in Postgres, is good for finding unique values and and ranges. But here's the thing, a BRIN index takes some absurdly small % of the space of a B-tree. You have to blink and check again to be sure you've figured it right.

How can a BRIN index be so much smaller? By throwing virtually everything out. A BRIN index doesn't store all of the values in a page, it stores the min/max value and that's it. So it's a probabilistic index (of sorts.) Is the value you're seeking on such and so page? The answer is "No" or "Maybe." That's a fast test on a very cheap data structure.

When the answer is "maybe", the full has to be loaded and scanned to determine if a specific value is found. So, you have a very small index structure, but have to do more sequential scanning to determine if a record is indexed in that page or not. In the real world, this can work out to be a high-performance structure at very low cost. But for it to work, your records need to be physically ordered (CLUSTER) by the condition in that index. And, going forward, you ought to be inserting in order too.(More-or-less.) So, a BRIN index is a great option if you have an insertion pattern that allows it to remain efficient, and if you're goal is range searching without a heavy B-tree index to maintain.

I have no clue how BRIN indexes and partitioning interact.

Re: UUID v1 optimizations...

From
Morris de Oryx
Date:


On Sun, May 26, 2019 at 8:37 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

No, an extra column is not a solution, because it has no impact on the
index on the UUID column.

Possibly talking at cross-purposes here. I was honing in on the OPs wish to search and sort by creation order. For which my first (and only) instinct would be to have a timestamp. In fact, the OP wants to work with multiple subcomponents encoded in their magic number, so I'm likely off base entirely. I have a long-standing allergy to concatenated key-like fields as they're opaque, collapse multiple values into a single column (0NF), and inevitably (in my experience) get you into a bind when requirements change.

But everyone's got their own point of view on such judgement calls. I'm not currently dealing with anything where the cost of adding a few small, fixed-type columns would give me a moment's hesitation. I'm sure we all like to save space, but when saving space costs you clarity, flexibility, and compute, the "savings" aren't free. So, it's a judgment call. The OP may well have 1B rows and really quite good reasons for worrying about disk-level optimizations. 

Re: UUID v1 optimizations...

From
Ancoron Luciferis
Date:
Hi,

I've finally found some time to redo some tests using PostgreSQL 11.3.

Scenario is the following:

1.) add 10 M rows
2.) add another 30 M rows
3.) delete the first 10 M rows
4.) VACUUM
5.) REINDEX

My goal is to find the most efficient way for UUID values to somewhat
optimize for index page/node reclaim (or least amount of "bloat")
without the need to execute a REINDEX (although we're currently doing
that on partitions where we can be sure that no rows will be added/changed).

I have tested the following:

Table "uuid_v1": standard V1 UUID

Table "uuid_v1_timestamp": standard V1 UUID with additional index using
timestamp extract function

Table "uuid_seq": sequential UUID from Thomas Vondra

Table "uuid_serial": "serial" UUID from my Java implementation


For the test, I created a little PostgreSQL extension for support
functions using standard version 1 UUID's (mainly to extract the timestamp):
https://github.com/ancoron/pg-uuid-ext

My C skills really suck, so if someone could review that and tell if
there are more efficient ways of doing certain things I'd be happy!

And for generating the UUID values I've created another Java project:
https://github.com/ancoron/java-uuid-serial

Now some test results, which are somewhat in-line with what I've already
seen in previous tests for PG 10, but less dramatic performance
differences here:

1.) add 10 M rows
Table: uuid_v1              Time: 33183.639 ms (00:33.184)
Table: uuid_v1_timestamp    Time: 49869.763 ms (00:49.870)
Table: uuid_seq             Time: 35794.906 ms (00:35.795)
Table: uuid_serial          Time: 22835.071 ms (00:22.835)

As expected, the table with the additional function index is slowest but
surprisingly the sequential UUID was not faster than the standard V1
UUID here. Still, serial version is the fastest.

Primary key indexes after an ANALYZE:
    table_name     |     bloat      | index_mb | table_mb
-------------------+----------------+----------+----------
 uuid_v1           | 128 MiB (32 %) |  395.570 |  422.305
 uuid_v1_timestamp | 128 MiB (32 %) |  395.570 |  422.305
 uuid_seq          | 123 MiB (31 %) |  390.430 |  422.305
 uuid_serial       | 108 MiB (29 %) |  376.023 |  422.305


2.) add another 30 M rows
Table: uuid_v1              Time: 136109.046 ms (02:16.109)
Table: uuid_v1_timestamp    Time: 193172.454 ms (03:13.172)
Table: uuid_seq             Time: 124713.530 ms (02:04.714)
Table: uuid_serial          Time:  78362.209 ms (01:18.362)

Now the performance difference gets much more dramatic.

Primary key indexes after an ANALYZE:
    table_name     |     bloat      | index_mb | table_mb
-------------------+----------------+----------+----------
 uuid_v1           | 500 MiB (32 %) | 1571.039 | 1689.195
 uuid_v1_timestamp | 500 MiB (32 %) | 1571.039 | 1689.195
 uuid_seq          | 492 MiB (31 %) | 1562.766 | 1689.195
 uuid_serial       | 433 MiB (29 %) | 1504.047 | 1689.195

Still no noticeable difference but I wonder why it looks like a
fill-factor of 70 instead of the default 90.


3.) delete the first 10 M rows

Implementations differ for the tables due to difference capabilities:
DELETE FROM uuid_v1 WHERE id IN (select id from uuid_v1 limit 10000000);
DELETE FROM uuid_v1_timestamp WHERE uuid_v1_timestamp(id) <
'2019-03-09T07:58:02.056';
DELETE FROM uuid_seq WHERE id IN (select id from uuid_seq limit 10000000);
DELETE FROM uuid_serial WHERE id < '1e942411-004c-1d50-0000-000000000000';

Table: uuid_v1              Time: 38308.299 ms (00:38.308)
Table: uuid_v1_timestamp    Time: 11589.941 ms (00:11.590)
Table: uuid_seq             Time: 37171.331 ms (00:37.171)
Table: uuid_serial          Time: 11694.893 ms (00:11.695)

As expected, using the timestamp index directly of being able to compare
on the UUID in a time-wise manner produces the best results.


4.) VACUUM
Table: uuid_v1              Time: 69740.952 ms (01:09.741)
Table: uuid_v1_timestamp    Time: 67347.469 ms (01:07.347)
Table: uuid_seq             Time: 25832.355 ms (00:25.832)
Table: uuid_serial          Time: 12339.531 ms (00:12.340)

This is pretty important to us as it consumes system resources and we
have quite a lot of big tables. So my serial implementation seems to
beat even the sequential one which was pretty surprising for me to see.

Primary key indexes after an ANALYZE:
    table_name     |     bloat      | index_mb | table_mb
-------------------+----------------+----------+----------
 uuid_v1           | 767 MiB (49 %) | 1571.039 | 1689.195
 uuid_v1_timestamp | 768 MiB (49 %) | 1571.039 | 1689.195
 uuid_seq          | 759 MiB (49 %) | 1562.766 | 1689.195
 uuid_serial       | 700 MiB (47 %) | 1504.047 | 1689.195

OK, sadly no reclaim in any of them.


5.) REINDEX
Table: uuid_v1              Time: 21549.860 ms (00:21.550)
Table: uuid_v1_timestamp    Time: 27367.817 ms (00:27.368)
Table: uuid_seq             Time: 19142.711 ms (00:19.143)
Table: uuid_serial          Time: 16889.807 ms (00:16.890)

Even in this case it looks as if my implementation is faster than
anything else - which I really don't get.

Overall, as write performance is our major concern atm. without
sacrificing read performance we'll continue using our "serial"
implementation for the time being.

Nevertheless, I am still puzzled by the index organization and how to
keep it within a certain size in cases where you have a rolling window
of value space to avoid unnecessary disk I/O (even if it is SSD).

So I wonder if there is anything that can be done on the index side?

I might implement a different opclass for the standard UUID to enable
time-wise index sort order. This will naturally be very close to
physical order but I doubt that this is something I can tell PostgreSQL, or?

Cheers,

    Ancoron


On 26/05/2019 11:01, Ancoron Luciferis wrote:
> On 26/05/2019 03:09, Tomas Vondra wrote:
>> On Sun, May 26, 2019 at 01:49:30AM +0200, Ancoron Luciferis wrote:
>>> On 26/05/2019 00:14, Tomas Vondra wrote:
>>>> On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote:
>>>>> Ancoron Luciferis <ancoron.luciferis@googlemail.com> writes:
>>>>>> On 25/05/2019 16:57, Tom Lane wrote:
>>>>>>> (4) it in fact *wouldn't* do anything useful, because we'd still have
>>>>>>> to sort UUIDs in the same order as today, meaning that btree index
>>>>>>> behavior
>>>>>>> would remain the same as before.  Plus UUID comparison would get a
>>>>>>> lot
>>>>>>> more complicated and slower than it is now.
>>>>>
>>>>>> I get your first sentence, but not your second. I know that when
>>>>>> changing the internal byte order we'd have to completed re-compute
>>>>>> everything on-disk (from table to index data), but why would the
>>>>>> sorting
>>>>>> in the index have to be the same?
>>>>>
>>>>> Because we aren't going to change the existing sort order of UUIDs.
>>>>> We have no idea what applications might be dependent on that.
>>>>>
>>>>> As Vitalii correctly pointed out, your beef is not with the physical
>>>>> storage of UUIDs anyway: you just wish they'd sort differently, since
>>>>> that is what determines the behavior of a btree index.  But we aren't
>>>>> going to change the sort ordering because that's an even bigger
>>>>> compatibility break than changing the physical storage; it'd affect
>>>>> application-visible semantics.
>>>>>
>>>>> What you might want to think about is creating a function that maps
>>>>> UUIDs into an ordering that makes sense to you, and then creating
>>>>> a unique index over that function instead of the raw UUIDs.  That
>>>>> would give the results you want without having to negotiate with the
>>>>> rest of the world about whether it's okay to change the semantics
>>>>> of type uuid.
>>>>>
>>>>
>>>> FWIW that's essentially what I implemented as an extension some time
>>>> ago. See [1] for a more detailed explanation and some benchmarks.
>>>
>>> Yes, I've seen that before. Pretty nice work you but together there and
>>> I'll surely have a look at it but we certainly need the node id in
>>> compliance with v1 UUID's so that's why we've been generating UUID's at
>>> the application side from day 1.
>>>
>>>>
>>>> The thing is - it's not really desirable to get perfectly ordered
>>>> ordering, because that would mean we never get back to older parts of
>>>> the index (so if you delete data, we'd never fill that space).
>>>
>>> Wouldn't this apply also to any sequential-looking index (e.g. on
>>> serial)?
>>
>> Yes, it does apply to any index on sequential (ordered) data. If you
>> delete data from the "old" part (but not all, so the pages don't get
>> completely empty), that space is lost. It's available for new data, but
>> if we only insert to "new" part of the index, that's useless.
> 
> OK, thanks for clearing that up for me.
> 
>>
>>> The main issue with the UUID's is that it almost instantly
>>> consumes a big part of the total value space (e.g. first value is
>>> '01...' and second is coming as 'f3...') which I would assume not being
>>> very efficient with btrees (space reclaim? - bloat).
>>>
>>
>> I don't understand what you mean here. Perhaps you misunderstand how
>> btree indexes grow? It's not like we allocate separate pages for
>> different values/prefixes - we insert the data until a page gets full,
>> then it's split in half. There is some dependency on the order in which
>> the values are inserted, but AFAIK random order is generally fine.
> 
> OK, I might not understand the basics of the btree implementation. Sorry
> for that.
> 
> However, one of our issues with standard v1 UUID's was bloat of the
> indexes, although we kept only a few months of data in it. I think, this
> was due to the pages still containing at least one value and not
> reclaimed by vacuum. It just continued to grow.
> 
> Now, as we have this different ever-increasing prefix, we still have
> some constant growing but we see that whenever historic data get's
> deleted (in a separate process), space get's reclaimed.
> 
> 
>>
>>> One of our major concerns is to keep index size small (VACUUM can't be
>>> run every minute) to fit into memory next to a lot of others.
>>>
>>
>> I don't think this has much to do with vacuum - I don't see how it's
>> related to the ordering of generated UUID values. And I don't see where
>> the "can't run vacuum every minute" comes from.
> 
> OK, numbers (after VACUUM) which I really found strange using the query
> from pgexperts [1]:
> 
>      index_name      | bloat_pct | bloat_mb | index_mb | table_mb
> ---------------------+-----------+----------+----------+----------
>  uuid_v1_pkey        |        38 |      363 |  965.742 |  950.172
>  uuid_serial_pkey    |        11 |       74 |  676.844 |  950.172
>  uuid_serial_8_pkey  |        46 |      519 | 1122.031 |  950.172
>  uuid_serial_16_pkey |        39 |      389 |  991.195 |  950.172
> 
> ...where the "8" and "16" is a "shift" of the timestamp value,
> implemented with:
> 
> timestamp = (timestamp >>> (60 - shift)) | (timestamp << (shift + 4))
> 
> If someone could shed some light on why that is (the huge difference in
> index sizes) I'd be happy.
> 
>>
>>> I've experimented with the rollover "prefix" myself but found that it
>>> makes the index too big (same or larger index size than standard v1
>>> UUIDs) and VACUUM too slow (almost as slow as a standard V1 UUID),
>>> although INSERT performance wasn't that bad, our sequential UUID's where
>>> way faster (at least pre-generated and imported with COPY to eliminate
>>> any value generation impact).
>>>
>>
>> I very much doubt that has anything to do with the prefix. You'll need
>> to share more details about how you did your tests.
> 
> OK, I'll see if I can prepare something and publish it.
> 
>>
>>
>> regards
>>
> 
> 
> Refs:
> [1]
> https://github.com/pgexperts/pgx_scripts/blob/master/bloat/index_bloat_check.sql
> 




Re: UUID v1 optimizations...

From
Peter Geoghegan
Date:
Please don't top post -- trim the your response down so that only
still-relevant text remains.

On Tue, Jun 11, 2019 at 1:27 PM Ancoron Luciferis
<ancoron.luciferis@googlemail.com> wrote:
> Primary key indexes after an ANALYZE:
>     table_name     |     bloat      | index_mb | table_mb
> -------------------+----------------+----------+----------
>  uuid_v1           | 767 MiB (49 %) | 1571.039 | 1689.195
>  uuid_v1_timestamp | 768 MiB (49 %) | 1571.039 | 1689.195
>  uuid_seq          | 759 MiB (49 %) | 1562.766 | 1689.195
>  uuid_serial       | 700 MiB (47 %) | 1504.047 | 1689.195
>
> OK, sadly no reclaim in any of them.

I don't know how you got these figures, but most likely they don't
take into account the fact that the FSM for the index has free blocks
available. You'll only notice that if you have additional page splits
that can recycle that space. Or, you could use pg_freespacemap to get
some idea.

> 5.) REINDEX
> Table: uuid_v1              Time: 21549.860 ms (00:21.550)
> Table: uuid_v1_timestamp    Time: 27367.817 ms (00:27.368)
> Table: uuid_seq             Time: 19142.711 ms (00:19.143)
> Table: uuid_serial          Time: 16889.807 ms (00:16.890)
>
> Even in this case it looks as if my implementation is faster than
> anything else - which I really don't get.

Sorting already-sorted data is faster. CREATE INDEX is mostly a big
sort operation in the case of B-Tree indexes.

> I might implement a different opclass for the standard UUID to enable
> time-wise index sort order. This will naturally be very close to
> physical order but I doubt that this is something I can tell PostgreSQL, or?

PostgreSQL only knows whether or not your page splits occur in the
rightmost page in the index -- it fills the page differently according
to whether or not that is the case.

-- 
Peter Geoghegan



Re: UUID v1 optimizations...

From
Ancoron Luciferis
Date:
On 08/07/2019 02:26, Peter Geoghegan wrote:
> Please don't top post -- trim the your response down so that only
> still-relevant text remains.
> 
> On Tue, Jun 11, 2019 at 1:27 PM Ancoron Luciferis
> <ancoron.luciferis@googlemail.com> wrote:
>> Primary key indexes after an ANALYZE:
>>     table_name     |     bloat      | index_mb | table_mb
>> -------------------+----------------+----------+----------
>>  uuid_v1           | 767 MiB (49 %) | 1571.039 | 1689.195
>>  uuid_v1_timestamp | 768 MiB (49 %) | 1571.039 | 1689.195
>>  uuid_seq          | 759 MiB (49 %) | 1562.766 | 1689.195
>>  uuid_serial       | 700 MiB (47 %) | 1504.047 | 1689.195
>>
>> OK, sadly no reclaim in any of them.
> 
> I don't know how you got these figures, but most likely they don't
> take into account the fact that the FSM for the index has free blocks
> available. You'll only notice that if you have additional page splits
> that can recycle that space. Or, you could use pg_freespacemap to get
> some idea.

Hm, I think I've already read quite a bit about the internals of the PG
b-tree index implementation but still cannot get to the answer how I
could influence that on my end as I want to stay compatible with the
standard UUID data storage but need time sorting support.

Anyway, I've made a bit of progress in testing and now have the full
tests executing unattended with the help of a script:
https://github.com/ancoron/pg-uuid-test

I've uploaded one of the test run results here:
https://gist.github.com/ancoron/d5114b0907e8974b6808077e02f8d109

After the first mass deletion, I can now see quite some savings for
both, serial and for my new time-sorted index:
 table_name  |      bloat      | index_mb | table_mb
-------------+-----------------+----------+----------
 uuid_v1     | 1500 MiB (48 %) | 3106.406 | 3378.383
 uuid_serial | 800 MiB (33 %)  | 2406.453 | 3378.383
 uuid_v1_ext | 800 MiB (33 %)  | 2406.453 | 3378.383

...but in a second case (DELETE old + INSERT new), the savings are gone
again in both cases:
 table_name  |      bloat      | index_mb | table_mb
-------------+-----------------+----------+----------
 uuid_v1     | 1547 MiB (49 %) | 3153.859 | 3378.383
 uuid_serial | 1402 MiB (47 %) | 3008.055 | 3378.383
 uuid_v1_ext | 1403 MiB (47 %) | 3008.055 | 3378.383

So, the question for me would be: Is there any kind of data that plays
optimal with space-savings in a rolling (e.g. last X rows) scenario?

> 
>> 5.) REINDEX
>> Table: uuid_v1              Time: 21549.860 ms (00:21.550)
>> Table: uuid_v1_timestamp    Time: 27367.817 ms (00:27.368)
>> Table: uuid_seq             Time: 19142.711 ms (00:19.143)
>> Table: uuid_serial          Time: 16889.807 ms (00:16.890)
>>
>> Even in this case it looks as if my implementation is faster than
>> anything else - which I really don't get.
> 
> Sorting already-sorted data is faster. CREATE INDEX is mostly a big
> sort operation in the case of B-Tree indexes.

Understood, this seems to be confirmed by my time-sorted index in the
new tests:
uuid_v1:     27632.660 ms (00:27.633)
uuid_serial: 20519.363 ms (00:20.519) x1.35
uuid_v1_ext: 23846.474 ms (00:23.846) x1.16

> 
>> I might implement a different opclass for the standard UUID to enable
>> time-wise index sort order. This will naturally be very close to
>> physical order but I doubt that this is something I can tell PostgreSQL, or?
> 
> PostgreSQL only knows whether or not your page splits occur in the
> rightmost page in the index -- it fills the page differently according
> to whether or not that is the case.
> 

As I've implemented the new opclass and the new tests showing the
results now, I think I can say that the time-sorting behavior as opposed
to rather random really benefits the overall performance, which is what
I actually care about most.


Cheers,

    Ancoron