Thread: [Proposal] global sequence implemented by snowflake ID
Hi hackers, I want to discuss a new feature for assigning a snowflake ID[1], which can be cluster-wide unique numbers. Also, Snowflake ID can be allocated standalone. # Use case A typical use case is a multi-master system constructed by logical replication. This feature allows multi-node system to use GENERATED values. IIUC, this is desired in another thread [2]. When the postgres is standalone, it is quite often that a sequence is used as default value of the primary key. However, this cannot be done on the multi-master system as it is because the value on nodeA might be already used on nodeB. Logical decoding of sequence partially solves the issue, but not sufficient - what about the case of asynchronous replication? Managing chucks of values is worse. # What is the formats of Snowflake ID? Snowflake ID has a below form: [1bit - unused] + [41bit millisecond timestamp] + [10bit machine ID] + [12bit local sequence number] Trivially, the millisecond timestamp represents the time when the number is allocated. I.e., the time nextval() is called. Using a UNIX time seems an easiest way. Machine ID can be an arbitrary number, but recommended to be unique in the system. Duplicated machine ID might trigger a conflict. ## Characteristics of snowflake ID Snowflake ID can generate a unique numbers standalone. According to the old discussion, allocating value spaces to each nodes was considered [3], but it must communicating with other nodes, this brings extra difficulties. (e.g., Which protocol would be used?) Also, Snowflake IDs are roughly time ordered. As Andres pointed out in the old discussions [4], large indexes over random values perform worse. Snowflake can avoid the situation. Moreover, Snowflake IDs are 64-bit integer, shorter than UUID (128-bit). # Implementation There are several approaches for implementing a snowflake ID. For example, * Implement as contrib module. Features needed for each components of snowflakeID have already been implemented in core, so basically it can be. * Implement as a variant of sequence access method. I found that sequence AM was proposed many years ago [5], but it has not been active now. It might be a fundamental way but needs a huge works. Attached patch adds a minimal contrib module which can be used for testing my proposal. Below shows an usage. ``` -- Create an extension postgres=# CREATE EXTENSION snowflake_sequence ; CREATE EXTENSION -- Create a sequence which generates snowflake IDs postgres=# SELECT snowflake_sequence.create_sequence('test_sequence'); create_sequence ----------------- (1 row) -- Get next snowflake ID postgres=# SELECT snowflake_sequence.nextval('test_sequence'); nextval --------------------- 3162329056562487297 (1 row) ``` How do you think? [1]: https://github.com/twitter-archive/snowflake/tree/b3f6a3c6ca8e1b6847baa6ff42bf72201e2c2231 [2]: https://www.postgresql.org/message-id/1b25328f-5f4d-9b75-b3f2-f9d9931d1b9d%40postgresql.org [3]: https://www.postgresql.org/message-id/CA%2BU5nMLSh4fttA4BhAknpCE-iAWgK%2BBG-_wuJS%3DEAcx7hTYn-Q%40mail.gmail.com [4]: https://www.postgresql.org/message-id/201210161515.54895.andres%402ndquadrant.com [5]: https://www.postgresql.org/message-id/flat/CA%2BU5nMLV3ccdzbqCvcedd-HfrE4dUmoFmTBPL_uJ9YjsQbR7iQ%40mail.gmail.com Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On Thu, Nov 23, 2023 at 10:18:59AM +0000, Hayato Kuroda (Fujitsu) wrote: > * Implement as a variant of sequence access method. I found that sequence AM was > proposed many years ago [5], but it has not been active now. It might be a > fundamental way but needs a huge works. Well, that's what I can call a timely proposal. I've been working this week on a design for sequence AMs, while considering the cases that the original thread wanted to handle (spoiler: there are a lot of pieces in the original patch that are not necessary, other parts are incorrect like dump/restore), what you are trying to do here, and more complex scenarios in terms of globally-distributed sequences. My plan was to send that next week or the week after, in time for January's CF. -- Michael
Attachment
On Thu, Nov 23, 2023 at 4:15 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Nov 23, 2023 at 10:18:59AM +0000, Hayato Kuroda (Fujitsu) wrote: > > * Implement as a variant of sequence access method. I found that sequence AM was > > proposed many years ago [5], but it has not been active now. It might be a > > fundamental way but needs a huge works. > > Well, that's what I can call a timely proposal. I've been working > this week on a design for sequence AMs, while considering the cases > that the original thread wanted to handle (spoiler: there are a lot of > pieces in the original patch that are not necessary, other parts are > incorrect like dump/restore), what you are trying to do here, and more > complex scenarios in terms of globally-distributed sequences. > It is interesting to see you want to work towards globally distributed sequences. I think it would be important to discuss how and what we want to achieve with sequences w.r.t logical replication and or active-active configuration. There is a patch [1] for logical replication of sequences which will primarily achieve the failover case, i.e. if the publisher goes down and the subscriber takes over the role, one can re-direct connections to it. Now, if we have global sequences, one can imagine that even after failover the clients can still get unique values of sequences. It will be a bit more flexible to use global sequences, for example, we can use the sequence on both nodes at the same time which won't be possible with the replication of sequences as they will become inconsistent. Now, it is also possible that both serve different use cases and we need both functionalities but it would be better to have some discussion on the same. Thoughts? [1] - https://commitfest.postgresql.org/45/3823/ -- With Regards, Amit Kapila.
On Tue, Nov 28, 2023 at 02:23:44PM +0530, Amit Kapila wrote: > It is interesting to see you want to work towards globally distributed > sequences. I think it would be important to discuss how and what we > want to achieve with sequences w.r.t logical replication and or > active-active configuration. There is a patch [1] for logical > replication of sequences which will primarily achieve the failover > case, i.e. if the publisher goes down and the subscriber takes over > the role, one can re-direct connections to it. Now, if we have global > sequences, one can imagine that even after failover the clients can > still get unique values of sequences. It will be a bit more flexible > to use global sequences, for example, we can use the sequence on both > nodes at the same time which won't be possible with the replication of > sequences as they will become inconsistent. Now, it is also possible > that both serve different use cases and we need both functionalities > but it would be better to have some discussion on the same. > > Thoughts? > > [1] - https://commitfest.postgresql.org/45/3823/ Thanks for pointing this out. I've read through the patch proposed by Tomas and both are independent things IMO. The logical decoding patch relies on the SEQ_LOG records to find out which last_value/is_called to transfer, which is something directly depending on the in-core sequence implementation. Sequence AMs are concepts that cover much more ground, leaving it up to the implementor to do what they want while hiding the activity with a RELKIND_SEQUENCE (generated columns included). To put it short, I have the impression that one and the other don't really conflict, but just cover different ground. However, I agree that depending on the sequence AM implementation used in a cluster (snowflake IDs guarantee unicity with their machine ID), replication may not be necessary because the sequence implementation may be able to ensure that no replication is required from the start. -- Michael
Attachment
Hi!
I have reviewed the patch in this topic and have a question mentioning the machine ID -
INSERT INTO snowflake_sequence.machine_id
SELECT round((random() * (0 - 511))::numeric, 0) + 511;
SELECT round((random() * (0 - 511))::numeric, 0) + 511;
This kind of ID generation does not seem to guarantee from not having the same ID in a pool
of instances, does it?
On Thu, Nov 30, 2023 at 4:18 AM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Nov 28, 2023 at 02:23:44PM +0530, Amit Kapila wrote:
> It is interesting to see you want to work towards globally distributed
> sequences. I think it would be important to discuss how and what we
> want to achieve with sequences w.r.t logical replication and or
> active-active configuration. There is a patch [1] for logical
> replication of sequences which will primarily achieve the failover
> case, i.e. if the publisher goes down and the subscriber takes over
> the role, one can re-direct connections to it. Now, if we have global
> sequences, one can imagine that even after failover the clients can
> still get unique values of sequences. It will be a bit more flexible
> to use global sequences, for example, we can use the sequence on both
> nodes at the same time which won't be possible with the replication of
> sequences as they will become inconsistent. Now, it is also possible
> that both serve different use cases and we need both functionalities
> but it would be better to have some discussion on the same.
>
> Thoughts?
>
> [1] - https://commitfest.postgresql.org/45/3823/
Thanks for pointing this out. I've read through the patch proposed by
Tomas and both are independent things IMO. The logical decoding patch
relies on the SEQ_LOG records to find out which last_value/is_called
to transfer, which is something directly depending on the in-core
sequence implementation. Sequence AMs are concepts that cover much
more ground, leaving it up to the implementor to do what they want
while hiding the activity with a RELKIND_SEQUENCE (generated columns
included).
To put it short, I have the impression that one and the other don't
really conflict, but just cover different ground. However, I agree
that depending on the sequence AM implementation used in a cluster
(snowflake IDs guarantee unicity with their machine ID), replication
may not be necessary because the sequence implementation may be able
to ensure that no replication is required from the start.
--
Michael
Dear Nikita, Thanks for reading my patch! > I have reviewed the patch in this topic and have a question mentioning the machine ID - INSERT INTO snowflake_sequence.machine_id SELECT round((random() * (0 - 511))::numeric, 0) + 511; This kind of ID generation does not seem to guarantee from not having the same ID in a pool of instances, does it? > You are right. For now the part is randomly assigned, but it might be duplicated on another instance. Maybe we should provide a way for setting it manually. Or, we may able to use another way for determining machine ID. (system_identifier is too long to use here...) Note that the contrib module was provided just for the reference. We are now discussing high-level items, like needs, use-cases and approaches. Could you please your opinion around here if you have? The implementation may be completely changed, so I did not change yet. Of course, your comment is quite helpful so that it will be handled eventually. Best Regards, Hayato Kuroda FUJITSU LIMITED
On Thu, Nov 30, 2023 at 6:48 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Nov 28, 2023 at 02:23:44PM +0530, Amit Kapila wrote: > > It is interesting to see you want to work towards globally distributed > > sequences. I think it would be important to discuss how and what we > > want to achieve with sequences w.r.t logical replication and or > > active-active configuration. There is a patch [1] for logical > > replication of sequences which will primarily achieve the failover > > case, i.e. if the publisher goes down and the subscriber takes over > > the role, one can re-direct connections to it. Now, if we have global > > sequences, one can imagine that even after failover the clients can > > still get unique values of sequences. It will be a bit more flexible > > to use global sequences, for example, we can use the sequence on both > > nodes at the same time which won't be possible with the replication of > > sequences as they will become inconsistent. Now, it is also possible > > that both serve different use cases and we need both functionalities > > but it would be better to have some discussion on the same. > > > > Thoughts? > > > > [1] - https://commitfest.postgresql.org/45/3823/ > > Thanks for pointing this out. I've read through the patch proposed by > Tomas and both are independent things IMO. The logical decoding patch > relies on the SEQ_LOG records to find out which last_value/is_called > to transfer, which is something directly depending on the in-core > sequence implementation. Sequence AMs are concepts that cover much > more ground, leaving it up to the implementor to do what they want > while hiding the activity with a RELKIND_SEQUENCE (generated columns > included). > Right, I understand that implementation-wise and or concept-wise they are different. It is more about the use case, see below. > To put it short, I have the impression that one and the other don't > really conflict, but just cover different ground. However, I agree > that depending on the sequence AM implementation used in a cluster > (snowflake IDs guarantee unicity with their machine ID), replication > may not be necessary because the sequence implementation may be able > to ensure that no replication is required from the start. > This was the key point that I wanted to discuss or hear opinions about. So, if we wish to have some sort of global sequences then it is not clear to me what benefits will we get by having replication of non-global sequences. One thing that comes to mind is replication covers a subset of use cases (like help in case of failover or switchover to subscriber) and till the time we have some implementation of global sequences, it can help users. -- With Regards, Amit Kapila.
On 11/30/23 11:56, Amit Kapila wrote: > On Thu, Nov 30, 2023 at 6:48 AM Michael Paquier <michael@paquier.xyz> wrote: >> >> On Tue, Nov 28, 2023 at 02:23:44PM +0530, Amit Kapila wrote: >>> It is interesting to see you want to work towards globally distributed >>> sequences. I think it would be important to discuss how and what we >>> want to achieve with sequences w.r.t logical replication and or >>> active-active configuration. There is a patch [1] for logical >>> replication of sequences which will primarily achieve the failover >>> case, i.e. if the publisher goes down and the subscriber takes over >>> the role, one can re-direct connections to it. Now, if we have global >>> sequences, one can imagine that even after failover the clients can >>> still get unique values of sequences. It will be a bit more flexible >>> to use global sequences, for example, we can use the sequence on both >>> nodes at the same time which won't be possible with the replication of >>> sequences as they will become inconsistent. Now, it is also possible >>> that both serve different use cases and we need both functionalities >>> but it would be better to have some discussion on the same. >>> >>> Thoughts? >>> >>> [1] - https://commitfest.postgresql.org/45/3823/ >> >> Thanks for pointing this out. I've read through the patch proposed by >> Tomas and both are independent things IMO. The logical decoding patch >> relies on the SEQ_LOG records to find out which last_value/is_called >> to transfer, which is something directly depending on the in-core >> sequence implementation. Sequence AMs are concepts that cover much >> more ground, leaving it up to the implementor to do what they want >> while hiding the activity with a RELKIND_SEQUENCE (generated columns >> included). >> > > Right, I understand that implementation-wise and or concept-wise they > are different. It is more about the use case, see below. > >> To put it short, I have the impression that one and the other don't >> really conflict, but just cover different ground. However, I agree >> that depending on the sequence AM implementation used in a cluster >> (snowflake IDs guarantee unicity with their machine ID), replication >> may not be necessary because the sequence implementation may be able >> to ensure that no replication is required from the start. >> I certainly do agree solutions like UUID or SnowflakeID may be a better choice for distributed systems (especially in active-active case), because there's no internal state to replicate. That's what I'd use for such systems, I think. As for implementation/replication, I haven't checked the code, but I'd imagine the AM should be able to decide whether something needs to be replicated (and how) or not. So the traditional sequences would replicate, and the alternative sequences would not replicate anything. > > This was the key point that I wanted to discuss or hear opinions > about. So, if we wish to have some sort of global sequences then it is > not clear to me what benefits will we get by having replication of > non-global sequences. One thing that comes to mind is replication > covers a subset of use cases (like help in case of failover or > switchover to subscriber) and till the time we have some > implementation of global sequences, it can help users. > What are you going to do about use cases like using logical replication for upgrade to the next major version? Or applications that prefer (or have to) use traditional sequences? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Nov 30, 2023 at 12:51:38PM +0100, Tomas Vondra wrote: > As for implementation/replication, I haven't checked the code, but I'd > imagine the AM should be able to decide whether something needs to be > replicated (and how) or not. So the traditional sequences would > replicate, and the alternative sequences would not replicate anything. Yep, exactly. Keeping compatibility for the in-core sequence computation is very important (including the fact that this stuff uses pseudo-heap tables for its metadata with the values computed). >> This was the key point that I wanted to discuss or hear opinions >> about. So, if we wish to have some sort of global sequences then it is >> not clear to me what benefits will we get by having replication of >> non-global sequences. One thing that comes to mind is replication >> covers a subset of use cases (like help in case of failover or >> switchover to subscriber) and till the time we have some >> implementation of global sequences, it can help users. > > What are you going to do about use cases like using logical replication > for upgrade to the next major version? Or applications that prefer (or > have to) use traditional sequences? Yeah, and that's why the logical replication of sequence has value. Giving the possibility for users or application developers to use a custom computation method may be useful for some applications, but not others. The use cases are too much different, so IMO both are useful, when applied to each user's requirements. -- Michael
Attachment
On Thu, Nov 30, 2023 at 5:21 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > On 11/30/23 11:56, Amit Kapila wrote: > > > > > This was the key point that I wanted to discuss or hear opinions > > about. So, if we wish to have some sort of global sequences then it is > > not clear to me what benefits will we get by having replication of > > non-global sequences. One thing that comes to mind is replication > > covers a subset of use cases (like help in case of failover or > > switchover to subscriber) and till the time we have some > > implementation of global sequences, it can help users. > > > > What are you going to do about use cases like using logical replication > for upgrade to the next major version? As per my understanding, they should work as it is when using a global sequence. Just for the sake of example, considering we have a same-name global sequence on both pub and sub now it should work during and after major version upgrades. > > Or applications that prefer (or > have to) use traditional sequences? > I think we have to suggest them to use global sequence for the use cases where they want those to work with logical replication use cases. Now, if still users want their existing sequences to work then we can probably see if there is a way to provide an option via Alter Sequence to change it to a global sequence. -- With Regards, Amit Kapila.
On 12/1/23 07:15, Amit Kapila wrote: > On Thu, Nov 30, 2023 at 5:21 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> >> On 11/30/23 11:56, Amit Kapila wrote: >> >>> >>> This was the key point that I wanted to discuss or hear opinions >>> about. So, if we wish to have some sort of global sequences then it is >>> not clear to me what benefits will we get by having replication of >>> non-global sequences. One thing that comes to mind is replication >>> covers a subset of use cases (like help in case of failover or >>> switchover to subscriber) and till the time we have some >>> implementation of global sequences, it can help users. >>> >> >> What are you going to do about use cases like using logical replication >> for upgrade to the next major version? > > > As per my understanding, they should work as it is when using a global > sequence. Just for the sake of example, considering we have a > same-name global sequence on both pub and sub now it should work > during and after major version upgrades. > Sequential IDs have significant benefits too, it's simply not that these global sequences are universally superior. For example, with sequential sequences you often get locality, because recent data have about the same sequence values. With global sequences that's not really the case, because they are often based on randomness, which massively limits the access locality. (Yes, some variants may maintain the ordering, others don't.) >> >> Or applications that prefer (or >> have to) use traditional sequences? >> > > I think we have to suggest them to use global sequence for the use > cases where they want those to work with logical replication use > cases. Now, if still users want their existing sequences to work then > we can probably see if there is a way to provide an option via Alter > Sequence to change it to a global sequence. > I really don't know how that would work e.g. for existing applications that have already designed the schema long time ago. Or for systems that use 32-bit sequences - I'm not aware of global sequences that narrow. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company