Thread: [Proposal] global sequence implemented by snowflake ID

[Proposal] global sequence implemented by snowflake ID

From
"Hayato Kuroda (Fujitsu)"
Date:
Hi hackers,

I want to discuss a new feature for assigning a snowflake ID[1], which can be
cluster-wide unique numbers. Also, Snowflake ID can be allocated standalone.

# Use case

A typical use case is a multi-master system constructed by logical replication.
This feature allows multi-node system to use GENERATED values. IIUC, this is
desired in another thread [2].

When the postgres is standalone, it is quite often that a sequence is used as
default value of the primary key. However, this cannot be done on the multi-master
system as it is because the value on nodeA might be already used on nodeB.
Logical decoding of sequence partially solves the issue, but not sufficient -
what about the case of asynchronous replication? Managing chucks of values is worse.

# What is the formats of Snowflake ID?

Snowflake ID has a below form:

[1bit - unused] + [41bit millisecond timestamp] + [10bit machine ID] + [12bit local sequence number]

Trivially, the millisecond timestamp represents the time when the number is allocated.
I.e., the time nextval() is called. Using a UNIX time seems an easiest way.

Machine ID can be an arbitrary number, but recommended to be unique in the system.
Duplicated machine ID might trigger a conflict.

## Characteristics of snowflake ID

Snowflake ID can generate a unique numbers standalone. According to the old discussion,
allocating value spaces to each nodes was considered [3], but it must communicating
with other nodes, this brings extra difficulties. (e.g., Which protocol would be used?)

Also, Snowflake IDs are roughly time ordered. As Andres pointed out in the old
discussions [4], large indexes over random values perform worse.
Snowflake can avoid the situation.

Moreover, Snowflake IDs are 64-bit integer, shorter than UUID (128-bit).

# Implementation

There are several approaches for implementing a snowflake ID. For example,

* Implement as contrib module. Features needed for each components of snowflakeID
  have already been implemented in core, so basically it can be.
* Implement as a variant of sequence access method. I found that sequence AM was
  proposed many years ago [5], but it has not been active now. It might be a
  fundamental way but needs a huge works.

Attached patch adds a minimal contrib module which can be used for testing my proposal.
Below shows an usage.

```
-- Create an extension
postgres=# CREATE EXTENSION snowflake_sequence ;
CREATE EXTENSION
-- Create a sequence which generates snowflake IDs
postgres=# SELECT snowflake_sequence.create_sequence('test_sequence');
 create_sequence
-----------------

(1 row)
-- Get next snowflake ID
postgres=# SELECT snowflake_sequence.nextval('test_sequence');
       nextval
---------------------
 3162329056562487297
(1 row)
```

How do you think?

[1]: https://github.com/twitter-archive/snowflake/tree/b3f6a3c6ca8e1b6847baa6ff42bf72201e2c2231
[2]: https://www.postgresql.org/message-id/1b25328f-5f4d-9b75-b3f2-f9d9931d1b9d%40postgresql.org
[3]: https://www.postgresql.org/message-id/CA%2BU5nMLSh4fttA4BhAknpCE-iAWgK%2BBG-_wuJS%3DEAcx7hTYn-Q%40mail.gmail.com
[4]: https://www.postgresql.org/message-id/201210161515.54895.andres%402ndquadrant.com
[5]: https://www.postgresql.org/message-id/flat/CA%2BU5nMLV3ccdzbqCvcedd-HfrE4dUmoFmTBPL_uJ9YjsQbR7iQ%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Attachment

Re: [Proposal] global sequence implemented by snowflake ID

From
Michael Paquier
Date:
On Thu, Nov 23, 2023 at 10:18:59AM +0000, Hayato Kuroda (Fujitsu) wrote:
> * Implement as a variant of sequence access method. I found that sequence AM was
>   proposed many years ago [5], but it has not been active now. It might be a
>   fundamental way but needs a huge works.

Well, that's what I can call a timely proposal.  I've been working
this week on a design for sequence AMs, while considering the cases
that the original thread wanted to handle (spoiler: there are a lot of
pieces in the original patch that are not necessary, other parts are
incorrect like dump/restore), what you are trying to do here, and more
complex scenarios in terms of globally-distributed sequences.  My plan
was to send that next week or the week after, in time for January's
CF.
--
Michael

Attachment

Re: [Proposal] global sequence implemented by snowflake ID

From
Amit Kapila
Date:
On Thu, Nov 23, 2023 at 4:15 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Nov 23, 2023 at 10:18:59AM +0000, Hayato Kuroda (Fujitsu) wrote:
> > * Implement as a variant of sequence access method. I found that sequence AM was
> >   proposed many years ago [5], but it has not been active now. It might be a
> >   fundamental way but needs a huge works.
>
> Well, that's what I can call a timely proposal.  I've been working
> this week on a design for sequence AMs, while considering the cases
> that the original thread wanted to handle (spoiler: there are a lot of
> pieces in the original patch that are not necessary, other parts are
> incorrect like dump/restore), what you are trying to do here, and more
> complex scenarios in terms of globally-distributed sequences.
>

It is interesting to see you want to work towards globally distributed
sequences. I think it would be important to discuss how and what we
want to achieve with sequences w.r.t logical replication and or
active-active configuration. There is a patch [1] for logical
replication of sequences which will primarily achieve the failover
case, i.e. if the publisher goes down and the subscriber takes over
the role, one can re-direct connections to it. Now, if we have global
sequences, one can imagine that even after failover the clients can
still get unique values of sequences. It will be a bit more flexible
to use global sequences, for example, we can use the sequence on both
nodes at the same time which won't be possible with the replication of
sequences as they will become inconsistent. Now, it is also possible
that both serve different use cases and we need both functionalities
but it would be better to have some discussion on the same.

Thoughts?

[1] - https://commitfest.postgresql.org/45/3823/

--
With Regards,
Amit Kapila.



Re: [Proposal] global sequence implemented by snowflake ID

From
Michael Paquier
Date:
On Tue, Nov 28, 2023 at 02:23:44PM +0530, Amit Kapila wrote:
> It is interesting to see you want to work towards globally distributed
> sequences. I think it would be important to discuss how and what we
> want to achieve with sequences w.r.t logical replication and or
> active-active configuration. There is a patch [1] for logical
> replication of sequences which will primarily achieve the failover
> case, i.e. if the publisher goes down and the subscriber takes over
> the role, one can re-direct connections to it. Now, if we have global
> sequences, one can imagine that even after failover the clients can
> still get unique values of sequences. It will be a bit more flexible
> to use global sequences, for example, we can use the sequence on both
> nodes at the same time which won't be possible with the replication of
> sequences as they will become inconsistent. Now, it is also possible
> that both serve different use cases and we need both functionalities
> but it would be better to have some discussion on the same.
>
> Thoughts?
>
> [1] - https://commitfest.postgresql.org/45/3823/

Thanks for pointing this out.  I've read through the patch proposed by
Tomas and both are independent things IMO.  The logical decoding patch
relies on the SEQ_LOG records to find out which last_value/is_called
to transfer, which is something directly depending on the in-core
sequence implementation.  Sequence AMs are concepts that cover much
more ground, leaving it up to the implementor to do what they want
while hiding the activity with a RELKIND_SEQUENCE (generated columns
included).

To put it short, I have the impression that one and the other don't
really conflict, but just cover different ground.  However, I agree
that depending on the sequence AM implementation used in a cluster
(snowflake IDs guarantee unicity with their machine ID), replication
may not be necessary because the sequence implementation may be able
to ensure that no replication is required from the start.
--
Michael

Attachment

Re: [Proposal] global sequence implemented by snowflake ID

From
Nikita Malakhov
Date:
Hi!

I have reviewed the patch in this topic and have a question mentioning the machine ID -
INSERT INTO snowflake_sequence.machine_id
    SELECT round((random() * (0 - 511))::numeric, 0) + 511;

This kind of ID generation does not seem to guarantee from not having the same ID in a pool
of instances, does it?

On Thu, Nov 30, 2023 at 4:18 AM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Nov 28, 2023 at 02:23:44PM +0530, Amit Kapila wrote:
> It is interesting to see you want to work towards globally distributed
> sequences. I think it would be important to discuss how and what we
> want to achieve with sequences w.r.t logical replication and or
> active-active configuration. There is a patch [1] for logical
> replication of sequences which will primarily achieve the failover
> case, i.e. if the publisher goes down and the subscriber takes over
> the role, one can re-direct connections to it. Now, if we have global
> sequences, one can imagine that even after failover the clients can
> still get unique values of sequences. It will be a bit more flexible
> to use global sequences, for example, we can use the sequence on both
> nodes at the same time which won't be possible with the replication of
> sequences as they will become inconsistent. Now, it is also possible
> that both serve different use cases and we need both functionalities
> but it would be better to have some discussion on the same.
>
> Thoughts?
>
> [1] - https://commitfest.postgresql.org/45/3823/

Thanks for pointing this out.  I've read through the patch proposed by
Tomas and both are independent things IMO.  The logical decoding patch
relies on the SEQ_LOG records to find out which last_value/is_called
to transfer, which is something directly depending on the in-core
sequence implementation.  Sequence AMs are concepts that cover much
more ground, leaving it up to the implementor to do what they want
while hiding the activity with a RELKIND_SEQUENCE (generated columns
included).

To put it short, I have the impression that one and the other don't
really conflict, but just cover different ground.  However, I agree
that depending on the sequence AM implementation used in a cluster
(snowflake IDs guarantee unicity with their machine ID), replication
may not be necessary because the sequence implementation may be able
to ensure that no replication is required from the start.
--
Michael


--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company

RE: [Proposal] global sequence implemented by snowflake ID

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Nikita,

Thanks for reading my patch!

>
I have reviewed the patch in this topic and have a question mentioning the machine ID -
INSERT INTO snowflake_sequence.machine_id
    SELECT round((random() * (0 - 511))::numeric, 0) + 511;

This kind of ID generation does not seem to guarantee from not having the same ID in a pool
of instances, does it?
>

You are right. For now the part is randomly assigned, but it might be duplicated on another instance.
Maybe we should provide a way for setting it manually. Or, we may able to use another way
for determining machine ID.
(system_identifier is too long to use here...)

Note that the contrib module was provided just for the reference. We are now
discussing high-level items, like needs, use-cases and approaches. Could you
please your opinion around here if you have?

The implementation may be completely changed, so I did not change yet. Of course,
your comment is quite helpful so that it will be handled eventually.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Re: [Proposal] global sequence implemented by snowflake ID

From
Amit Kapila
Date:
On Thu, Nov 30, 2023 at 6:48 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Nov 28, 2023 at 02:23:44PM +0530, Amit Kapila wrote:
> > It is interesting to see you want to work towards globally distributed
> > sequences. I think it would be important to discuss how and what we
> > want to achieve with sequences w.r.t logical replication and or
> > active-active configuration. There is a patch [1] for logical
> > replication of sequences which will primarily achieve the failover
> > case, i.e. if the publisher goes down and the subscriber takes over
> > the role, one can re-direct connections to it. Now, if we have global
> > sequences, one can imagine that even after failover the clients can
> > still get unique values of sequences. It will be a bit more flexible
> > to use global sequences, for example, we can use the sequence on both
> > nodes at the same time which won't be possible with the replication of
> > sequences as they will become inconsistent. Now, it is also possible
> > that both serve different use cases and we need both functionalities
> > but it would be better to have some discussion on the same.
> >
> > Thoughts?
> >
> > [1] - https://commitfest.postgresql.org/45/3823/
>
> Thanks for pointing this out.  I've read through the patch proposed by
> Tomas and both are independent things IMO.  The logical decoding patch
> relies on the SEQ_LOG records to find out which last_value/is_called
> to transfer, which is something directly depending on the in-core
> sequence implementation.  Sequence AMs are concepts that cover much
> more ground, leaving it up to the implementor to do what they want
> while hiding the activity with a RELKIND_SEQUENCE (generated columns
> included).
>

Right, I understand that implementation-wise and or concept-wise they
are different. It is more about the use case, see below.

> To put it short, I have the impression that one and the other don't
> really conflict, but just cover different ground.  However, I agree
> that depending on the sequence AM implementation used in a cluster
> (snowflake IDs guarantee unicity with their machine ID), replication
> may not be necessary because the sequence implementation may be able
> to ensure that no replication is required from the start.
>

This was the key point that I wanted to discuss or hear opinions
about. So, if we wish to have some sort of global sequences then it is
not clear to me what benefits will we get by having replication of
non-global sequences. One thing that comes to mind is replication
covers a subset of use cases (like help in case of failover or
switchover to subscriber) and till the time we have some
implementation of global sequences, it can help users.

--
With Regards,
Amit Kapila.



Re: [Proposal] global sequence implemented by snowflake ID

From
Tomas Vondra
Date:
On 11/30/23 11:56, Amit Kapila wrote:
> On Thu, Nov 30, 2023 at 6:48 AM Michael Paquier <michael@paquier.xyz> wrote:
>>
>> On Tue, Nov 28, 2023 at 02:23:44PM +0530, Amit Kapila wrote:
>>> It is interesting to see you want to work towards globally distributed
>>> sequences. I think it would be important to discuss how and what we
>>> want to achieve with sequences w.r.t logical replication and or
>>> active-active configuration. There is a patch [1] for logical
>>> replication of sequences which will primarily achieve the failover
>>> case, i.e. if the publisher goes down and the subscriber takes over
>>> the role, one can re-direct connections to it. Now, if we have global
>>> sequences, one can imagine that even after failover the clients can
>>> still get unique values of sequences. It will be a bit more flexible
>>> to use global sequences, for example, we can use the sequence on both
>>> nodes at the same time which won't be possible with the replication of
>>> sequences as they will become inconsistent. Now, it is also possible
>>> that both serve different use cases and we need both functionalities
>>> but it would be better to have some discussion on the same.
>>>
>>> Thoughts?
>>>
>>> [1] - https://commitfest.postgresql.org/45/3823/
>>
>> Thanks for pointing this out.  I've read through the patch proposed by
>> Tomas and both are independent things IMO.  The logical decoding patch
>> relies on the SEQ_LOG records to find out which last_value/is_called
>> to transfer, which is something directly depending on the in-core
>> sequence implementation.  Sequence AMs are concepts that cover much
>> more ground, leaving it up to the implementor to do what they want
>> while hiding the activity with a RELKIND_SEQUENCE (generated columns
>> included).
>>
> 
> Right, I understand that implementation-wise and or concept-wise they
> are different. It is more about the use case, see below.
> 
>> To put it short, I have the impression that one and the other don't
>> really conflict, but just cover different ground.  However, I agree
>> that depending on the sequence AM implementation used in a cluster
>> (snowflake IDs guarantee unicity with their machine ID), replication
>> may not be necessary because the sequence implementation may be able
>> to ensure that no replication is required from the start.
>>

I certainly do agree solutions like UUID or SnowflakeID may be a better
choice for distributed systems (especially in active-active case),
because there's no internal state to replicate. That's what I'd use for
such systems, I think.

As for implementation/replication, I haven't checked the code, but I'd
imagine the AM should be able to decide whether something needs to be
replicated (and how) or not. So the traditional sequences would
replicate, and the alternative sequences would not replicate anything.

> 
> This was the key point that I wanted to discuss or hear opinions
> about. So, if we wish to have some sort of global sequences then it is
> not clear to me what benefits will we get by having replication of
> non-global sequences. One thing that comes to mind is replication
> covers a subset of use cases (like help in case of failover or
> switchover to subscriber) and till the time we have some
> implementation of global sequences, it can help users.
> 

What are you going to do about use cases like using logical replication
for upgrade to the next major version? Or applications that prefer (or
have to) use traditional sequences?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [Proposal] global sequence implemented by snowflake ID

From
Michael Paquier
Date:
On Thu, Nov 30, 2023 at 12:51:38PM +0100, Tomas Vondra wrote:
> As for implementation/replication, I haven't checked the code, but I'd
> imagine the AM should be able to decide whether something needs to be
> replicated (and how) or not. So the traditional sequences would
> replicate, and the alternative sequences would not replicate anything.

Yep, exactly.  Keeping compatibility for the in-core sequence
computation is very important (including the fact that this stuff uses
pseudo-heap tables for its metadata with the values computed).

>> This was the key point that I wanted to discuss or hear opinions
>> about. So, if we wish to have some sort of global sequences then it is
>> not clear to me what benefits will we get by having replication of
>> non-global sequences. One thing that comes to mind is replication
>> covers a subset of use cases (like help in case of failover or
>> switchover to subscriber) and till the time we have some
>> implementation of global sequences, it can help users.
>
> What are you going to do about use cases like using logical replication
> for upgrade to the next major version? Or applications that prefer (or
> have to) use traditional sequences?

Yeah, and that's why the logical replication of sequence has value.
Giving the possibility for users or application developers to use a
custom computation method may be useful for some applications, but not
others.  The use cases are too much different, so IMO both are useful,
when applied to each user's requirements.
--
Michael

Attachment

Re: [Proposal] global sequence implemented by snowflake ID

From
Amit Kapila
Date:
On Thu, Nov 30, 2023 at 5:21 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 11/30/23 11:56, Amit Kapila wrote:
>
> >
> > This was the key point that I wanted to discuss or hear opinions
> > about. So, if we wish to have some sort of global sequences then it is
> > not clear to me what benefits will we get by having replication of
> > non-global sequences. One thing that comes to mind is replication
> > covers a subset of use cases (like help in case of failover or
> > switchover to subscriber) and till the time we have some
> > implementation of global sequences, it can help users.
> >
>
> What are you going to do about use cases like using logical replication
> for upgrade to the next major version?


As per my understanding, they should work as it is when using a global
sequence. Just for the sake of example, considering we have a
same-name global sequence on both pub and sub now it should work
during and after major version upgrades.

>
> Or applications that prefer (or
> have to) use traditional sequences?
>

I think we have to suggest them to use global sequence for the use
cases where they want those to work with logical replication use
cases. Now, if still users want their existing sequences to work then
we can probably see if there is a way to provide an option via Alter
Sequence to change it to a global sequence.

--
With Regards,
Amit Kapila.



Re: [Proposal] global sequence implemented by snowflake ID

From
Tomas Vondra
Date:
On 12/1/23 07:15, Amit Kapila wrote:
> On Thu, Nov 30, 2023 at 5:21 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 11/30/23 11:56, Amit Kapila wrote:
>>
>>>
>>> This was the key point that I wanted to discuss or hear opinions
>>> about. So, if we wish to have some sort of global sequences then it is
>>> not clear to me what benefits will we get by having replication of
>>> non-global sequences. One thing that comes to mind is replication
>>> covers a subset of use cases (like help in case of failover or
>>> switchover to subscriber) and till the time we have some
>>> implementation of global sequences, it can help users.
>>>
>>
>> What are you going to do about use cases like using logical replication
>> for upgrade to the next major version?
> 
> 
> As per my understanding, they should work as it is when using a global
> sequence. Just for the sake of example, considering we have a
> same-name global sequence on both pub and sub now it should work
> during and after major version upgrades.
> 

Sequential IDs have significant benefits too, it's simply not that these
global sequences are universally superior. For example, with sequential
sequences you often get locality, because recent data have about the
same sequence values. With global sequences that's not really the case,
because they are often based on randomness, which massively limits the
access locality. (Yes, some variants may maintain the ordering, others
don't.)

>>
>> Or applications that prefer (or
>> have to) use traditional sequences?
>>
> 
> I think we have to suggest them to use global sequence for the use
> cases where they want those to work with logical replication use
> cases. Now, if still users want their existing sequences to work then
> we can probably see if there is a way to provide an option via Alter
> Sequence to change it to a global sequence.
> 

I really don't know how that would work e.g. for existing applications
that have already designed the schema long time ago. Or for systems that
use 32-bit sequences - I'm not aware of global sequences that narrow.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company