Re: Sequence vs UUID - Mailing list pgsql-general

From veem v
Subject Re: Sequence vs UUID
Date
Msg-id CAB+=1TWQWNeBeRxWqBGYsw6HQujG984mcCBcm-TTBkuFqbZA4A@mail.gmail.com
Whole thread Raw
In response to Re: Sequence vs UUID  (Dominique Devienne <ddevienne@gmail.com>)
Responses Re: Sequence vs UUID  ("Peter J. Holzer" <hjp-pgsql@hjp.at>)
Re: Sequence vs UUID  (Kirk Wolak <wolakk@gmail.com>)
List pgsql-general
Thank you So much all for such valuable feedback. 

As "Julian" was pointing, I also tried to test the INSERT independently(as in below test case) without keeping the "generate_series" in the inline query. But in all the cases sequence is performing better as compared to both UUID V4 and UUID V7. And same with Index access path observed i.e. an index on sequence performs better as compared to an index on UUID column. So i believe , its understood that the performance wise sequence is better as compared to even sequential UUID (like UUID-V7). And it seems that, even the UUID V7 maintains better locality because of its sequential nature, but still it all may tied to the higher space/memory consumption of UUID because of its bigger size as compared to sequence and thus the performance hit.

But as i saw few comments in this discussion regarding the scenarios which will mandate the usage of UUID like "multi-master replication", "sharding", "Need to be able to move data between databases" Etc..So wanted to understand , why cant we use sequences as PK in these scenarios? Say for e.g. in case of multimaster replication we can use sequence someway as below.., 

So wanted to know from experts here, is there really exists any scenario in which UUID really cant be avoided?

Sequence Number = n*d+m+offset. Where n is the sequence order number, d is the dimensions of the multi-master replication, m ranges from 0 to n-1 is the number assigned to each node in the replication, and offset is the number to offset the sequence numbers. 

For a 4-ways multi-master replication where m=4, y is in (0, 1, 2, 3), and offset is 100. 
Node #1 (m=0) :Sequence number = n*4+100
Node #2 (m=1): Sequence number = n*4+101
Node #3 (m=2): Sequence number = n*4+102
Node #4 (m=3): Sequence number = n*4+103

Each sequence will have: 
100, 104, 108,112, 116, 120,...
101, 105, 109, 113, 117, 121,...
102, 106, 110, 114, 118, 122...
103, 107, 111, 115, 119, 123

***************** Test case *************************

CREATE UNLOGGED TABLE test_bigint ( id bigint PRIMARY KEY);
CREATE UNLOGGED TABLE test_uuid ( id uuid PRIMARY KEY);
CREATE UNLOGGED TABLE test_uuid7 ( id uuid PRIMARY KEY);
CREATE UNLOGGED TABLE test_bigint_1 ( id bigint PRIMARY KEY);
CREATE UNLOGGED TABLE test_uuid_1 ( id uuid PRIMARY KEY);
CREATE UNLOGGED TABLE test_uuid7_1 ( id uuid PRIMARY KEY);
create sequence myseq cache 32767;

*************** Insert Test***************
INSERT INTO test_bigint select nextval('myseq') from generate_series(1,100000);
INSERT INTO test_uuid select gen_random_uuid() from generate_series(1,100000);
INSERT INTO test_uuid7 select uuid_generate_v7() from generate_series(1,100000);


explain analyze INSERT INTO test_bigint_1 select id from test_bigint;

EXPLAIN
QUERY PLAN
Insert on test_bigint_1 (cost=0.00..1444.18 rows=0 width=0) (actual time=220.689..220.690 rows=0 loops=1)
-> Seq Scan on test_bigint (cost=0.00..1444.18 rows=100118 width=8) (actual time=0.012..17.488 rows=100000 loops=1)
Planning Time: 0.137 ms
Execution Time: 220.714 ms

explain analyze INSERT INTO test_uuid_1 select id from test_uuid;

EXPLAIN
QUERY PLAN
Insert on test_uuid_1 (cost=0.00..1541.85 rows=0 width=0) (actual time=311.949..311.950 rows=0 loops=1)
-> Seq Scan on test_uuid (cost=0.00..1541.85 rows=100085 width=16) (actual time=0.010..20.173 rows=100000 loops=1)
Planning Time: 0.082 ms
Execution Time: 311.973 ms


explain analyze INSERT INTO test_uuid7_1 select id from test_uuid7;

EXPLAIN
QUERY PLAN
Insert on test_uuid7_1 (cost=0.00..1541.85 rows=0 width=0) (actual time=244.531..244.532 rows=0 loops=1)
-> Seq Scan on test_uuid7 (cost=0.00..1541.85 rows=100085 width=16) (actual time=0.011..16.390 rows=100000 loops=1)
Planning Time: 0.084 ms
Execution Time: 244.554 ms

Explain analyze select * from test_bigint where id in (select id from test_bigint limit 100000);

Nested Loop (cost=2692.77..3793.23 rows=50059 width=8) (actual time=53.739..368.110 rows=100000 loops=1)
-> HashAggregate (cost=2692.48..2694.48 rows=200 width=8) (actual time=53.681..93.044 rows=100000 loops=1)
Group Key: test_bigint_1.id
Batches: 5 Memory Usage: 11073kB Disk Usage: 208kB
-> Limit (cost=0.00..1442.48 rows=100000 width=8) (actual time=0.020..18.985 rows=100000 loops=1)
-> Seq Scan on test_bigint test_bigint_1 (cost=0.00..1444.18 rows=100118 width=8) (actual time=0.019..11.330 rows=100000 loops=1)
-> Index Only Scan using test_bigint_pkey on test_bigint (cost=0.29..6.53 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=100000)
Index Cond: (id = test_bigint_1.id)
Heap Fetches: 100000
Planning Time: 0.373 ms
Execution Time: 373.440 ms
EXPLAIN

Explain analyze select * from test_uuid where id in (select id from test_uuid limit 100000);

QUERY PLAN
Nested Loop (cost=2790.96..4006.29 rows=50042 width=16) (actual time=48.251..410.786 rows=100000 loops=1)
-> HashAggregate (cost=2790.54..2792.54 rows=200 width=16) (actual time=48.157..76.176 rows=100000 loops=1)
Group Key: test_uuid_1.id
Batches: 1 Memory Usage: 7969kB
-> Limit (cost=0.00..1540.54 rows=100000 width=16) (actual time=0.018..19.217 rows=100000 loops=1)
-> Seq Scan on test_uuid test_uuid_1 (cost=0.00..1541.85 rows=100085 width=16) (actual time=0.017..11.391 rows=100000 loops=1)
-> Index Only Scan using test_uuid_pkey on test_uuid (cost=0.42..7.20 rows=1 width=16) (actual time=0.003..0.003 rows=1 loops=100000)
Index Cond: (id = test_uuid_1.id)
Heap Fetches: 100000
Planning Time: 0.378 ms
Execution Time: 415.547 ms

Explain analyze select * from test_uuid7 where id in (select id from test_uuid7 limit 100000);

EXPLAIN
QUERY PLAN
Nested Loop (cost=2790.96..3972.96 rows=50042 width=16) (actual time=45.919..401.184 rows=100000 loops=1)
-> HashAggregate (cost=2790.54..2792.54 rows=200 width=16) (actual time=45.806..80.545 rows=100000 loops=1)
Group Key: test_uuid7_1.id
Batches: 1 Memory Usage: 7969kB
-> Limit (cost=0.00..1540.54 rows=100000 width=16) (actual time=0.020..18.573 rows=100000 loops=1)
-> Seq Scan on test_uuid7 test_uuid7_1 (cost=0.00..1541.85 rows=100085 width=16) (actual time=0.018..10.920 rows=100000 loops=1)
-> Index Only Scan using test_uuid7_pkey on test_uuid7 (cost=0.42..6.99 rows=1 width=16) (actual time=0.003..0.003 rows=1 loops=100000)
Index Cond: (id = test_uuid7_1.id)
Heap Fetches: 100000
Planning Time: 0.194 ms
Execution Time: 405.931 ms

On Tue, 7 Feb, 2023, 9:38 pm Dominique Devienne, <ddevienne@gmail.com> wrote:
On Tue, Feb 7, 2023 at 3:47 PM Merlin Moncure <mmoncure@gmail.com> wrote:
On Mon, Feb 6, 2023 at 1:22 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
On 2023-02-06 20:04:39 +0100, Julian Backes wrote:
But UUIDs are random and that plays havoc with locality.

This is really key.  [...] the databases I've seen that are written with the
UUID pattern appear to be written by developers oblivious to this fact.

Well, perhaps these developers are not dealing with temporally clustered data, like commerce related DB,
and more scientific data? In any case, this developer will definitely investigate ULIDs, vs UUIDs, vs Ints (sequences),
based on the info from this thread. I'm aware of fragmentation issues, and cache hit/miss issues etc, in general;
but was probably not sufficiently making the mental connection with UUIDs and PostgreSQL. So thanks everyone.

pgsql-general by date:

Previous
From: "sunyucong@gmail.com"
Date:
Subject: Re: Need help debugging slow logical replication
Next
From: Laurenz Albe
Date:
Subject: Re: How do a user-defined function that returns a table executes a query?