Thread: INSERT ... ON CONFLICT DO UPDATE
Hello, I've just started to read through postgres-9.5 "what's new" ... before giving it a try. The "insert ... on conflict do update" is particularly atractive to me; but I was wondering why it does not cover the third usage scenario of action that a programmer may need for a PK conflict during insert. In my experience, most often I generate a random value for PK, with that random value becoming a unique ticket like a voucher (related to monetary value). for that I: CREATE TABLE vouchers (a_voucher bigint PRIMARY KEY default (random()*1000000000)::bigint, issued date default now(), .....); Naturally: 1. A_VOUCHER range space is always significantly larger then currently issued voucher count - so conflicts are rare. 2. with current (as of 9.5) implementation I think I can always "ON CONFLICT DO NOTHING", and retry the INSERT from application level. But it would be immenensly more comfortable if one could: "INSERT ... ON CONFLICT (a_voucher) DO RETRY"; with semantics of that statement being: 1. prepare should check if there is a DFAULT for specified "conflict column" (here: "a_voucher"), and fail if there isn't one. 2. prepare shoud check if the default is a VOLATILE function... or fail. 3. when all that pass, the prepared insert, when executed and with a conflict, should be re-attempt with NEW call to that DEFAULT function of the indicated CONFLICT column(s). 3. and there should be a /ETC/POSTGRES.CONF parameter limiting the number of retries for a single conflict - as a programmer I know, that if I need to retry more then twice, the space is too dense, always. So I need to change the DFAULT function, not increase the retry_count ... thus haveing DDS allowing the change to the DFAULT FUNCTION means it's not necesary to allow for change of the RETRY_CONT (during database life) - and when the later is in the CONFIG, the less it's prone to typo errors of application authors. Was the above considered for "ON CONFLICT" implementation before? If so, can someone pls point me to critics it received. If not: is it unreasonable? why? -R
Hello > I've just started to read through postgres-9.5 "what's new" ... before giving it > a try. The "insert ... on conflict do update" is particularly atractive to me; but I > was wondering why it does not cover the third usage scenario of action that a > programmer may need for a PK conflict during insert. > > In my experience, most often I generate a random value for PK, with that > random value becoming a unique ticket like a voucher (related to monetary > value). for that I: > > CREATE TABLE vouchers (a_voucher bigint PRIMARY KEY default > (random()*1000000000)::bigint, issued date default now(), .....); > > Naturally: > 1. A_VOUCHER range space is always significantly larger then currently issued > voucher count - so conflicts are rare. > 2. with current (as of 9.5) implementation I think I can always "ON CONFLICT > DO NOTHING", and retry the INSERT from application level. An UPSERT is "try an INSERT and if there is a conflict, do nothing or UPDATE some values of the existing record". The scenariothat you suggest is not an UPSERT, because what you want to reach is to try a new INSERT, hoping that this works. What speak against using a sequence for the primary key column a_voucher? This would guarantee that you don't have a conflict. > But it would be immenensly more comfortable if one could: "INSERT ... ON > CONFLICT (a_voucher) DO RETRY"; with semantics of that statement being: > 1. prepare should check if there is a DFAULT for specified "conflict column" > (here: "a_voucher"), and fail if there isn't one. > 2. prepare shoud check if the default is a VOLATILE function... or fail. > 3. when all that pass, the prepared insert, when executed and with a conflict, > should be re-attempt with NEW call to that DEFAULT function of the > indicated CONFLICT column(s). > 3. and there should be a /ETC/POSTGRES.CONF parameter limiting the > number of retries for a single conflict - as a programmer I know, that if I need > to retry more then twice, the space is too dense, always. So I need to change > the DFAULT function, not increase the retry_count ... > thus haveing DDS allowing the change to the DFAULT FUNCTION means it's > not necesary to allow for change of the RETRY_CONT (during database > life) - and when the later is in the CONFIG, the less it's prone to typo errors of > application authors. > > Was the above considered for "ON CONFLICT" implementation before? > > If so, can someone pls point me to critics it received. > > If not: is it unreasonable? why? IMHO, as I mentioned, this is not an UPSERT use case, but maybe the implementors of the feature may have different arguments.You could implement that in a function instead of the application, if you prefer. Bye Charles
Hi, W dniu 19.07.2015 o 09:33, Charles Clavadetscher pisze: [---------------] >> 2. with current (as of 9.5) implementation I think I can always "ON CONFLICT >> DO NOTHING", and retry the INSERT from application level. > > An UPSERT is "try an INSERT and if there is a conflict, do nothing or UPDATE some values of the existing record". The scenariothat you suggest is not an UPSERT, because what you want to reach is to try a new INSERT, hoping that this works. > What speak against using a sequence for the primary key column a_voucher? This would guarantee that you don't have a conflict. > It have to be random, since it barres a "sort of monetary" value. The vouches are destined to be one-time authorization tokens, they have to be harder to guess then those drawn from the sequence are. [------------] >> >> If not: is it unreasonable? why? > > IMHO, as I mentioned, this is not an UPSERT use case, but maybe the implementors of the feature may have different arguments.You could implement that in a function instead of the application, if you prefer. > I'm not particularly fond of using functions to accessing RDBMS instead of tables. And I'm not particularly fond of "workarounds". But if that usage scenario is not appreciated here, then guess I have to live with what is available. And the availability of ON CONFLICT is a great improvement anyway. Thenx, -R
On 19 July 2015 at 09:11, Rafal Pietrak <rafal@ztk-rp.eu> wrote:
I'm not particularly fond of using functions to accessing RDBMS instead
of tables.
And I'm not particularly fond of "workarounds".
Use a combination of factors (a sequence ID and the key) for your authorization. So in the extremely unlikely event that the random key isn't unique, it doesn't matter. That's not a workaround, it's good design.
You're asking for a feature that is completely unnecessary and is easily resolved. UPSERT is designed for a situation which is neither.
Geoff
Hi, W dniu 19.07.2015 o 10:27, Geoff Winkless pisze: > On 19 July 2015 at 09:11, Rafal Pietrak <rafal@ztk-rp.eu > <mailto:rafal@ztk-rp.eu>> wrote: > > I'm not particularly fond of using functions to accessing RDBMS instead > of tables. > > And I'm not particularly fond of "workarounds". > > > Use a combination of factors (a sequence ID and the key) for your > authorization. So in the extremely unlikely event that the random key > isn't unique, it doesn't matter. That's not a workaround, it's good design. I'm sory. At this point I don't want to prolong the discussion (like flaming), but I feel like having to defend myself a little. Regarding the above: when I have to invent/introduce additional features/columns/attributes (like a key in addition to a sequence), which are not required by the design, but necessary for implementation) is a workaround (almost by definition). In my current implementations I captures (the rare) PK conflict exceptions and redo the INSERT at application level. It works sufficiently well. I was just curious if that usage scenario is currently also covered by current ON CONFLICT, or not. > > You're asking for a feature that is completely unnecessary and is easily > resolved. UPSERT is designed for a situation which is neither. > 1. Despite possibly sounding like one, I wasn't actually asking for a feature - I wasn't sure if that could possibly be implemented using currently available postresql features. So I've just explained a usage scenario (explaining the semantics using "invented pseudo syntax") which I've experienced. 2. that usage scenario, IMHO wasn't obviously covered (as of my first reading of "the upsert" implementation). It might have been covered ... only I wasn't seeing it; so I've brought it up. 3. and obviously that usage scenario (despite my personal experience) might actually be very rare - unworthy implementation effort and thus qualifying for workarounds. This happen, I understand. I hope this explains my point better. -R
On 19 July 2015 at 11:30, Rafal Pietrak <rafal@ztk-rp.eu> wrote:
when I have to invent/introduce additional
features/columns/attributes (like a key in addition to a sequence),
which are not required by the design, but necessary for implementation)
is a workaround (almost by definition).
I'm sorry that you feel defensive about this, and apologies for repeating myself, but the fact that the random key can be duplicated means it should not be used as a primary key, so using a sequence as a primary key is not a workaround, it's a correction to the design.
Notwithstanding that, the reason UPSERT is required is because it's possible that two competing transactions can end up fighting over an INSERT and the workarounds that are required are either highly complex or not 100% successful (eg http://www.depesz.com/2012/06/10/why-is-upsert-so-complicated/).
Conversely, the workaround in the above case (even if you don't want to change the primary key) is trivial - as you yourself described.
Geoff
Hi, W dniu 19.07.2015 o 14:10, Geoff Winkless pisze: > On 19 July 2015 at 11:30, Rafal Pietrak <rafal@ztk-rp.eu > <mailto:rafal@ztk-rp.eu>> wrote: > > when I have to invent/introduce additional > features/columns/attributes (like a key in addition to a sequence), > which are not required by the design, but necessary for implementation) > is a workaround (almost by definition). > > > I'm sorry that you feel defensive about this, and apologies for > repeating myself, but the fact that the random key can be duplicated > means it should not be used as a primary key, so using a sequence as a > primary key is not a workaround, it's a correction to the design. OK. I think I need to apology myself, too. I hope my defense wasn't too fierce. But I need to clearify one thing: Although "a random" can duplicate its previous values, "my random(s)" (which are created for this application purpose) cannot be duplicated when it's stored in the database as "live active data". I understand, that UNIQUE constraint is precisely the RDBMS tool to guarantee that. Naturally, if I put a UNIQUE constraint on that column, or make it a PK, is just a matter of choice here. That shouldn't rise concern. I just use tools RDBMS provides for "semantics" the application needs. > > Notwithstanding that, the reason UPSERT is required is because it's > possible that two competing transactions can end up fighting over an > INSERT and the workarounds that are required are either highly complex > or not 100% successful (eg > http://www.depesz.com/2012/06/10/why-is-upsert-so-complicated/). > I knew that Depesz publication before. Actually it was the reason I've brought up "my usage scenario" here now. I'm not as competent as Depesz, so: 1. I worry, that while restarting a failed INSERT transaction at application level I miss something important (you people know by heart) and unwillingly corrupt and/or "suboptimise" my application/data. (much to the point Depesz described). 2. But, since the majority of the hard work of postgresql UPSERT implementation is already done; I wanted to check out if the usage scenario I point out falls into it as a "case", or is covered by it by some "indiomatic SQL sequence", or otherwise. From current discussion I gather: "its otherwise" - it isn't considered as applicable. (so I concluded: I'll live with manual re-attempt of failed insert) -R
On 07/19/2015 06:47 AM, Rafal Pietrak wrote: > Hi, > > W dniu 19.07.2015 o 14:10, Geoff Winkless pisze: >> On 19 July 2015 at 11:30, Rafal Pietrak <rafal@ztk-rp.eu >> <mailto:rafal@ztk-rp.eu>> wrote: >> >> when I have to invent/introduce additional >> features/columns/attributes (like a key in addition to a sequence), >> which are not required by the design, but necessary for implementation) >> is a workaround (almost by definition). >> >> >> I'm sorry that you feel defensive about this, and apologies for >> repeating myself, but the fact that the random key can be duplicated >> means it should not be used as a primary key, so using a sequence as a >> primary key is not a workaround, it's a correction to the design. > > OK. I think I need to apology myself, too. I hope my defense wasn't too > fierce. > > But I need to clearify one thing: > > Although "a random" can duplicate its previous values, "my random(s)" > (which are created for this application purpose) cannot be duplicated > when it's stored in the database as "live active data". I understand, > that UNIQUE constraint is precisely the RDBMS tool to guarantee that. From my perspective the issue is, you are using a 'unique' key generator that you know is not creating unique keys and then asking the database to make it right. Sort of like making a square peg fit a round hole by shaving the corners. It is possible but has sort of a messy feel to it. > > Naturally, if I put a UNIQUE constraint on that column, or make it a PK, > is just a matter of choice here. That shouldn't rise concern. I just use > tools RDBMS provides for "semantics" the application needs. > > >> >> Notwithstanding that, the reason UPSERT is required is because it's >> possible that two competing transactions can end up fighting over an >> INSERT and the workarounds that are required are either highly complex >> or not 100% successful (eg >> http://www.depesz.com/2012/06/10/why-is-upsert-so-complicated/). >> > > I knew that Depesz publication before. > > Actually it was the reason I've brought up "my usage scenario" here now. > I'm not as competent as Depesz, so: > > 1. I worry, that while restarting a failed INSERT transaction at > application level I miss something important (you people know by heart) > and unwillingly corrupt and/or "suboptimise" my application/data. (much > to the point Depesz described). > > 2. But, since the majority of the hard work of postgresql UPSERT > implementation is already done; I wanted to check out if the usage > scenario I point out falls into it as a "case", or is covered by it by > some "indiomatic SQL sequence", or otherwise. From current discussion I > gather: "its otherwise" - it isn't considered as applicable. (so I > concluded: I'll live with manual re-attempt of failed insert) As noted upstream, what you want is not an UPSERT. An UPSERT is based on the premise that if you try an INSERT where the unique constraint already exists then the INSERT is turned into an UPDATE. To be fair: http://www.postgresql.org/docs/9.5/static/sql-insert.html " ON CONFLICT DO UPDATE guarantees an atomic INSERT or UPDATE outcome - provided there is no independent error, one of those two outcomes is guaranteed, even under high concurrency. This feature is also known as UPSERT" So an UPSERT is just one feature of ON CONFLICT. The other being DO NOTHING. Therefore I could see an argument made for adding other ON CONFLICT clauses. How difficult/plausible that would be is above my level of expertise. > > -R > > -- Adrian Klaver adrian.klaver@aklaver.com
W dniu 19.07.2015 o 16:33, Adrian Klaver pisze: > On 07/19/2015 06:47 AM, Rafal Pietrak wrote: >> Hi, >> >> W dniu 19.07.2015 o 14:10, Geoff Winkless pisze: >>> On 19 July 2015 at 11:30, Rafal Pietrak <rafal@ztk-rp.eu [---------------] >> Although "a random" can duplicate its previous values, "my random(s)" >> (which are created for this application purpose) cannot be duplicated >> when it's stored in the database as "live active data". I understand, >> that UNIQUE constraint is precisely the RDBMS tool to guarantee that. > > From my perspective the issue is, you are using a 'unique' key generator > that you know is not creating unique keys and then asking the database > to make it right. Sort of like making a square peg fit a round hole by > shaving the corners. It is possible but has sort of a messy feel to it. Hmmm, yes. Yet, I don't feel guilty as much, since that is quite similar to a unique key on database "username", while the "generator" (human mind) does not guarantee that. The DB just makes sure it does. [--------------] > > So an UPSERT is just one feature of ON CONFLICT. The other being DO > NOTHING. Therefore I could see an argument made for adding other ON > CONFLICT clauses. How difficult/plausible that would be is above my > level of expertise. > Mine too. But I'd say that the above wording exactly makes the point I was trying to make. Thank you. -R
On 07/19/2015 08:04 AM, Rafal Pietrak wrote: > > > W dniu 19.07.2015 o 16:33, Adrian Klaver pisze: >> On 07/19/2015 06:47 AM, Rafal Pietrak wrote: >>> Hi, >>> >>> W dniu 19.07.2015 o 14:10, Geoff Winkless pisze: >>>> On 19 July 2015 at 11:30, Rafal Pietrak <rafal@ztk-rp.eu > [---------------] >>> Although "a random" can duplicate its previous values, "my random(s)" >>> (which are created for this application purpose) cannot be duplicated >>> when it's stored in the database as "live active data". I understand, >>> that UNIQUE constraint is precisely the RDBMS tool to guarantee that. >> >> From my perspective the issue is, you are using a 'unique' key generator >> that you know is not creating unique keys and then asking the database >> to make it right. Sort of like making a square peg fit a round hole by >> shaving the corners. It is possible but has sort of a messy feel to it. > > Hmmm, yes. > > Yet, I don't feel guilty as much, since that is quite similar to a > unique key on database "username", while the "generator" (human mind) > does not guarantee that. The DB just makes sure it does. I think the argument to be made here is you have no control over what people choose as a username, you do have control over what your key generator outputs. > > [--------------] >> >> So an UPSERT is just one feature of ON CONFLICT. The other being DO >> NOTHING. Therefore I could see an argument made for adding other ON >> CONFLICT clauses. How difficult/plausible that would be is above my >> level of expertise. >> > > Mine too. But I'd say that the above wording exactly makes the point I > was trying to make. Thank you. > > -R > -- Adrian Klaver adrian.klaver@aklaver.com
Aside from Tom Lane's comments, it seems to me you are reinventing the wheel by generating random values for keys. Why not just use UUID http://www.postgresql.org/docs/9.5/static/datatype-uuid.html
or serial http://www.postgresql.org/docs/9.5/static/datatype-numeric.html#DATATYPE-SERIAL? On Sun, Jul 19, 2015 at 11:12 AM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 07/19/2015 08:04 AM, Rafal Pietrak wrote:
W dniu 19.07.2015 o 16:33, Adrian Klaver pisze:On 07/19/2015 06:47 AM, Rafal Pietrak wrote:[---------------]Hi,
W dniu 19.07.2015 o 14:10, Geoff Winkless pisze:On 19 July 2015 at 11:30, Rafal Pietrak <rafal@ztk-rp.euAlthough "a random" can duplicate its previous values, "my random(s)"
(which are created for this application purpose) cannot be duplicated
when it's stored in the database as "live active data". I understand,
that UNIQUE constraint is precisely the RDBMS tool to guarantee that.
From my perspective the issue is, you are using a 'unique' key generator
that you know is not creating unique keys and then asking the database
to make it right. Sort of like making a square peg fit a round hole by
shaving the corners. It is possible but has sort of a messy feel to it.
Hmmm, yes.
Yet, I don't feel guilty as much, since that is quite similar to a
unique key on database "username", while the "generator" (human mind)
does not guarantee that. The DB just makes sure it does.
I think the argument to be made here is you have no control over what people choose as a username, you do have control over what your key generator outputs.
[--------------]
So an UPSERT is just one feature of ON CONFLICT. The other being DO
NOTHING. Therefore I could see an argument made for adding other ON
CONFLICT clauses. How difficult/plausible that would be is above my
level of expertise.
Mine too. But I'd say that the above wording exactly makes the point I
was trying to make. Thank you.
-R
--
Adrian Klaver
adrian.klaver@aklaver.com
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
--
Melvin Davidson
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.
Rafal Pietrak wrote: > CREATE TABLE vouchers (a_voucher bigint PRIMARY KEY default > (random()*1000000000)::bigint, issued date default now(), .....); Generators of truly unique pseudo-random values provide a better ground for this. Consider for example: https://wiki.postgresql.org/wiki/Pseudo_encrypt You may replace the round function with your own secret function, so you'll get the required randomness, secrecy and uniqueness. No need to deal with collisions on insertion as there are none. > 2. with current (as of 9.5) implementation I think I can always "ON > CONFLICT DO NOTHING", and retry the INSERT from application level. Yes, but retrying is now easy, let's not underappreciate that. As a test, with 9.5alpha1, I create a table with 100k unique random numbers: CREATE TABLE vouchers(id int primary key); Then try to populate it immediately with 100k rows: INSERT INTO vouchers select (random()*1000000000)::int from generate_series(1,100000) ON CONFLICT DO NOTHING; psql result: INSERT 0 99995 Note how 5 values conflicted right from the beginning, even though we're claiming only 10^5 out of the 10^9 output range (or 0.01%). The probability of at least one collision is pretty high, see the "birthday paradox" for the theory on that. Anyway the collisions got eliminated without any effort from me and that's quite useful already. Now trying to insert 10k rows at a time: INSERT INTO vouchers SELECT (random()*1000000000)::int FROM generate_series(1,10000) ON CONFLICT DO NOTHING RETURNING id; when run repeatedly, it tends to return between 9995 and 10000 values. If we want exactly N rows and we get back N-epsilon, then we need to re-ask for epsilon rows, but this will converge fast to completion. (that is, until you have enough values that the birthday paradox effect really kicks in). My point is that we can now achieve that without any exception handling or transaction retry, and no plpgsql function to create, so it's really a significant improvement in ease of use. And presumably in performance too. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @ManitouMail
Melvin Davidson wrote: > Aside from Tom Lane's comments, it seems to me you are reinventing the wheel > by generating random values for keys. Why not just use UUID > http://www.postgresql.org/docs/9.5/static/datatype-uuid.html > or serial > http://www.postgresql.org/docs/9.5/static/datatype-numeric.html#DATATYPE-SERIAL? > Wouldn't that simplify things by insuring uniqueness? UUIDs are 36 characters wide; it's too boring and error-prone for a person to type this on a keyboard or spell it over the phone to an operator. For SERIAL, it's too obvious to guess what is the next one, so malicious people could claim access codes or vouchers they don't own. The constraint is that such codes must be reasonably short, but someone who tries to make up one must have a near-zero chance of guessing one that actually exists. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @ManitouMail
Hi Daniel: On Sun, Jul 19, 2015 at 9:03 PM, Daniel Verite <daniel@manitou-mail.org> wrote: > For SERIAL, it's too obvious to guess what is the next one, > so malicious people could claim access codes or vouchers > they don't own. Why don't you use encryption? Specifically only on the external side. You use a serial in the DB and send the encrypted serial as voucher code ( this way you do not need to have database resident encryption ). Then when you receive the code in the program you decrypt it and are done. And having serial in the DB can be good for your internal operations. Encryption, reversible and colision free, not hashing. > The constraint is that such codes must be reasonably short, but > someone who tries to make up one must have a near-zero chance > of guessing one that actually exists. If you can live with a little longer voucher ( it seems you use 10^9 in random ), you can use 2^32, which is just 9.5 digits, and search for a 32 bit block cipher ( or build it yourself, it's not that hard using stream ciphers or other tools ). I also thinks random UUIDs are not ok, not because they are long but because they are random, and can collide ( encrypted serials can not ). Francisco Olarte.
For this purpose, I have seen it recommended to use a UUID instead of a randomly generated integer. I do this myself for production applications and over millions of records I have yet to log a conflict. Also, as stated above, you could create a plpgsql function which would achieve exactly what you want (retry insert until it succeeds).
Just my 2 cents,
Deven
On Sun, Jul 19, 2015 at 9:47 AM, Rafal Pietrak <rafal@ztk-rp.eu> wrote:
Hi,
W dniu 19.07.2015 o 14:10, Geoff Winkless pisze:
> On 19 July 2015 at 11:30, Rafal Pietrak <rafal@ztk-rp.eu
> <mailto:rafal@ztk-rp.eu>> wrote:
>
> when I have to invent/introduce additional
> features/columns/attributes (like a key in addition to a sequence),
> which are not required by the design, but necessary for implementation)
> is a workaround (almost by definition).
>
>
> I'm sorry that you feel defensive about this, and apologies for
> repeating myself, but the fact that the random key can be duplicated
> means it should not be used as a primary key, so using a sequence as a
> primary key is not a workaround, it's a correction to the design.
OK. I think I need to apology myself, too. I hope my defense wasn't too
fierce.
But I need to clearify one thing:
Although "a random" can duplicate its previous values, "my random(s)"
(which are created for this application purpose) cannot be duplicated
when it's stored in the database as "live active data". I understand,
that UNIQUE constraint is precisely the RDBMS tool to guarantee that.
Naturally, if I put a UNIQUE constraint on that column, or make it a PK,
is just a matter of choice here. That shouldn't rise concern. I just use
tools RDBMS provides for "semantics" the application needs.
>
> Notwithstanding that, the reason UPSERT is required is because it's
> possible that two competing transactions can end up fighting over an
> INSERT and the workarounds that are required are either highly complex
> or not 100% successful (eg
> http://www.depesz.com/2012/06/10/why-is-upsert-so-complicated/).
>
I knew that Depesz publication before.
Actually it was the reason I've brought up "my usage scenario" here now.
I'm not as competent as Depesz, so:
1. I worry, that while restarting a failed INSERT transaction at
application level I miss something important (you people know by heart)
and unwillingly corrupt and/or "suboptimise" my application/data. (much
to the point Depesz described).
2. But, since the majority of the hard work of postgresql UPSERT
implementation is already done; I wanted to check out if the usage
scenario I point out falls into it as a "case", or is covered by it by
some "indiomatic SQL sequence", or otherwise. From current discussion I
gather: "its otherwise" - it isn't considered as applicable. (so I
concluded: I'll live with manual re-attempt of failed insert)
-R
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
If I'm not mistaken, the conclusions from posts in this thread are: 1. recognizing of a "RETRY" action, as a separate case of "ON CONFLICT" transaction continuation is not generally appreciated. 2. I shouldn't expect any "hidden corruption/performance" obstacles when simply re-attempting of an INSERT at the application level (should constraint conflict arise). 3. there are methods (like cryptographic "random" sequence), which guarantee no conflicts. So one should resort to that. Regarding the last point. Usually, I implement one-time used vouchers as rows in table like: CREATE TABLE (voucher int not null, consumed bool, expire timestamp not null default timestamp_pl_interval(now()::timestamp, '2 min'::interval),..., unique (voucher,consumed) ); with CONSUMED column NULLyfied when voucher is used. The entire row of consumed voucher is purged after clearence and verification, which happen significantly later. Such short lived (when active) voucher is usually just 6-digit long, to help people enter it. I don't know much about cryptography, but would a generic encryption function (like that indicated by Daniel) have the same "waking through the entire range-space" behavior as the original when that range-space is externally (by my application) truncated to those 6 digits? If not, would it be as efficient in conflict avoidance as used with original 32-bit range-space? Then again. Is it really a "good practice" to rely on a programmer to peek "proper/correct encryption helper" instead of providing him/her with a database-integrated tool for a "well defined" and not so rare usage scenario as "random default" for UNIQUE/PK column? So my conclusion from this thread is that as this usage scenario does not seem to be foreseen by current implementation of ON CONFLICT transaction, a workaround exists (like: cryptographic range-walker). Being it a workaround, I'd vote for some direct supported of that scenario in the future at database level. -R
If I'm not mistaken, the conclusions from posts in this thread are:
3. there are methods (like cryptographic "random" sequence), which
guarantee no conflicts. So one should resort to that.
Some web research suggests that random sequences are not great for indexes because of the resultant "keyspace fragmentation". I'm assuming that means a low number of nodes in the btree leafs, so an increase in memory usage for the index?
Just a thought.
Geoff
Geoff Winkless wrote: > On 20 July 2015 at 14:33, Rafal Pietrak <rafal@ztk-rp.eu> wrote: > > > If I'm not mistaken, the conclusions from posts in this thread are: > > > > 3. there are methods (like cryptographic "random" sequence), which > > guarantee no conflicts. So one should resort to that. > > > > > Some web research suggests that random sequences are not great for indexes > because of the resultant "keyspace fragmentation". I'm assuming that means > a low number of nodes in the btree leafs, so an increase in memory usage > for the index? Not sure what type of indexes would be affected by that problem, but I don't think Postgres' btrees would be. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Not sure what type of indexes would be affected by that problem, but I
don't think Postgres' btrees would be.
I admit it's not really my area.
Take it up with Drew Blas, I guess :)
https://blog.starkandwayne.com/2015/05/23/uuid-primary-keys-in-postgresql/ (see the addendum near the bottom).
My first stab at comprehending what he means:
Imagine if you have around a thousand entries in your index. In a sequential set, the btree could be (with each leaf containing 100 values)
499
299 799
100 200 300 400 500 600 700 800 900
In a (16-bit, say) random set, the btree could be
33208
21728 45220
927 15927 21729 29661 33209 34917 40121 45221 49826
Inserting a new value in a sequential key is always going to go into the 900 node. When that fills, you can add a new node and just balance the parent branches (assuming postgres doesn't try to keep a proper balance across the nodes too?).
However in the random tree a new value could be inserted into _any_ of the leaf nodes, which means you either have to create more leaf nodes than you require (in order to leave plenty of blank space, and even if you do that you will probably end up with uneven fill, which means some searches will be slower than others) or you end up having to refactor all of your nodes every time you insert a value.
Geoff
Geoff Winkless wrote: > On 20 July 2015 at 14:33, Rafal Pietrak <rafal@ztk-rp.eu> wrote: > > > If I'm not mistaken, the conclusions from posts in this thread are: > > > > 3. there are methods (like cryptographic "random" sequence), which > > guarantee no conflicts. So one should resort to that. > > > > > Some web research suggests that random sequences are not great for > indexes because of the resultant "keyspace fragmentation". I'm > assuming that means a low number of nodes in the btree leafs, so an > increase in memory usage for the index? Not sure what type of indexes would be affected by that problem, but I don't think Postgres' btrees would be. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services ___ Well, there is a caveat. If I create table and couple indexes like this: create table test_index_size(c1 int, c2 int, constraint U2 unique (c2)); create index U1 on test_index_size(c1); and populate them: insert into test_index_size(c1, c2) select round(random()*1000000), a from generate_series(1,1000000) a limit 100000; and then check the size of the indexes: for "select pg_relation_size('U1')" I get 2834432 while " select pg_relation_size('U2')" returns 2285568. So, index based on randomly populated column is bigger than the one based on sequentially populated. But, on the other hand, after: reindex table test_index_size; both indexes are of the same size: 2260992. Regards, Igor Neyman
On Mon, Jul 20, 2015 at 7:01 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
If I'm not mistaken, the conclusions from posts in this thread are:
3. there are methods (like cryptographic "random" sequence), which
guarantee no conflicts. So one should resort to that.Some web research suggests that random sequences are not great for indexes because of the resultant "keyspace fragmentation". I'm assuming that means a low number of nodes in the btree leafs, so an increase in memory usage for the index?Just a thought.
If you are inserting random values into a very large btree index, you dirty at least one randomly-located block (the leaf page of the index) per inserted row. This can cause some serious IO pain when that dirty data needs to be written out without the benefit of write-combining or sequential writes.
However, this is usually only a concern for bulk loading operations, not real time operations. If you are generating money-equivalent vouchers at a rate of thousands per second, hopefully you can afford some nice hardware to run it on.
Cheers,
Jeff
Hi Alvaro. On Mon, Jul 20, 2015 at 4:07 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> Some web research suggests that random sequences are not great for indexes >> because of the resultant "keyspace fragmentation". I'm assuming that means >> a low number of nodes in the btree leafs, so an increase in memory usage >> for the index? > Not sure what type of indexes would be affected by that problem, but I > don't think Postgres' btrees would be. I do not know if postgres btrees do it, but I know you can build btree inserting code in such a way that inserting nodes sequentially leads to optimally filled leaf pages an denser trees, as an optimization for an easy and common case, which are better than the normal ones generated by random insertion. So is not that random are bad, it is that ordered are very good, or in another way thay are not affected by a problem, but do not get the advantage. Francisco Olarte.
Hi Igor: On Mon, Jul 20, 2015 at 4:56 PM, Igor Neyman <ineyman@perceptron.com> wrote: > Well, there is a caveat. > If I create table and couple indexes like this: .. > and populate them: > and then check the size of the indexes: > for "select pg_relation_size('U1')" I get 2834432 > while " select pg_relation_size('U2')" returns 2285568. > So, index based on randomly populated column is bigger than the one based on sequentially populated. > But, on the other hand, after: > reindex table test_index_size; > both indexes are of the same size: 2260992. I would totally expect this. On reindex you get the values from a tree walk, so both of them come in order, and being a reindex ( where you know in advance the full set of values, so you can plan ahead where to put the leaves, how many levels you need and how many splits ) you get an even bigger advantage from the squential insertion case. Francisco Olarte.
Hi Rafal: On Mon, Jul 20, 2015 at 3:33 PM, Rafal Pietrak <rafal@ztk-rp.eu> wrote: > 3. there are methods (like cryptographic "random" sequence), which > guarantee no conflicts. So one should resort to that. > Regarding the last point. Usually, I implement one-time used vouchers as > rows in table like: > CREATE TABLE (voucher int not null, consumed bool, expire timestamp not > null default timestamp_pl_interval(now()::timestamp, '2 > min'::interval),..., unique (voucher,consumed) ); > with CONSUMED column NULLyfied when voucher is used. The entire row of > consumed voucher is purged after clearence and verification, which > happen significantly later. Random as a primary key is considered a bad practice by many people with much experience, nullyfing it too. Many people even frown on just changing the primary key ( and one of the reasons for using serial as keys in many situations is to have a guaranteed not null unchanging value ). > Such short lived (when active) voucher is usually just 6-digit long, to > help people enter it. Then, random and with a narrow value domain, they make, IMNSHO, a really bad choice for primery keys. > I don't know much about cryptography, but would a generic encryption > function (like that indicated by Daniel) have the same "waking through > the entire range-space" behavior as the original when that range-space > is externally (by my application) truncated to those 6 digits? If not, > would it be as efficient in conflict avoidance as used with original > 32-bit range-space? An encryption function never has collisions ( do not confuse with a hash ). If it had you would be unable to decrypt it. The problem is the value domain for you. i.e., for your example you could choose a bit stream cipher applied to a 20 bit value. This is a moderately complex prolem to find or build ( from the classic cryptographic primitives nearly every language includes ). This will map every different 20 bit input value to a different 20 bit output value, so your value domain will be 20 bit numbers, your inputs will be the 10^6 6 digit numbers and the outputs will be 10^6 DIFFERENT 20bit numbers, of wich you could expect about 4.8% of them ( 2^20-10^6)/10^6 to have 7 digits ( with a leading one in this case ). To solve that problem you could use 19 digit input/output numbers or try to fin a decimal cypher which uses exactly 10^6 input digits. If you use a 32 bit block cypher it will not have collisions, but if you TRUNCATE the 32 bit ~ 9.5 digits output to 6 digits, you are no longer encrypting. You may call it hashing or whatever, but that is NOTt encryption, you would have collisions. > Then again. Is it really a "good practice" to rely on a programmer to > peek "proper/correct encryption helper" instead of providing him/her > with a database-integrated tool for a "well defined" and not so rare > usage scenario as "random default" for UNIQUE/PK column? Many of us are too old to get caught by this. This question is like asking "Is it good practice to hit a person with a 10 pound hammer in the head instead of giving a cookie?". There are other options. IMO NOT modifying a very complex chunk of code ( the server code doing the inserts and checking the collision cases and acting on them, plus the parser for insert queries plus .... ) and risking all the bugs it may introduce to help with inserting random pk is good practice. It doesn't matter if the requesting programmer peeks a bad encryption methods, keeps his old code for inserting random ids or introduces bugs in his program, the potential harm to the rest of the users is too great. > So my conclusion from this thread is that as this usage scenario does > not seem to be foreseen by current implementation of ON CONFLICT > transaction, a workaround exists (like: cryptographic range-walker). > Being it a workaround, I'd vote for some direct supported of that > scenario in the future at database level. Bear in mind your problem is totally ill defined. I mean, you want to insert a random 6 digits ID, and want the database to keep trying until it finds an unique one. What should it do if there already are 10^6 records in the db? Stall until a free one is found? abort? This kind of uses is very narrow, and very difficult to get right , and normally confined to the application domain. Even if you choose a totally correct encryption function for collision avoidance, like identity, you are going to have problems in your scheme. You are not voting for anything, you need a feature proposal to vote upon. So far the only one I could extract from this thread is "something which magically solves the current Rafal problem". I would vote against that. Francisco Olarte.
On 7/20/2015 7:01 AM, Geoff Winkless wrote:
Some web research suggests that random sequences are not great for indexes because of the resultant "keyspace fragmentation". I'm assuming that means a low number of nodes in the btree leafs, so an increase in memory usage for the index?
that suggests some folks overthink their indexing strategies and end up 'overoptimized'.
anyways, a simple REINDEX fixes all sorts of index fragmentation
-- john r pierce, recycling bits in santa cruz
Hi Rafal: On Mon, Jul 20, 2015 at 3:33 PM, Rafal Pietrak <rafal@ztk-rp.eu> wrote: > Regarding the last point. Usually, I implement one-time used vouchers as > rows in table like: > CREATE TABLE (voucher int not null, consumed bool, expire timestamp not > null default timestamp_pl_interval(now()::timestamp, '2 > min'::interval),..., unique (voucher,consumed) ); > with CONSUMED column NULLyfied when voucher is used. The entire row of > consumed voucher is purged after clearence and verification, which > happen significantly later. > Such short lived (when active) voucher is usually just 6-digit long, to > help people enter it. In this case I think you are mixing vouchers with voucher-numbers. IMO you could get a better dessign by using an auxiliary table and not nullifying the number after been consumed. Having only 6 digits I tould try: 1.- Add a serial PK column to voucher table if needed to link it with the rest of the system. 2.- Create an index on voucher where consumed is true. 3.- Add another table, voucher_nums, with columns voucher, order, used. Populate it with the 10^6 vouchers and a random order value. Also, this lets you switch to alphanumeric vouchers, or zap the ones with two consecutive equal digits, or whatever. 4.- Make a function to select a free voucher, you can do 'select from voucher_nums where not used order by order limit 1¡', if yout put this into a with clause of an update-returning setting used to true to you get a one shot way of getting a free voucher. If you add a partial index on order where not used, you get a fast way of getting it. 5.- Make another function to free a voucher num, which sets consumed to true on vouchers, used to false and order to a random number on voucher_nums. This way you keep the old voucher numbers, and you get no collisions. If you run for some years, you can see which vouchers have been used, so you can debug potential problems. Francisco Olarte.
Franscisco, W dniu 21.07.2015 o 09:34, Francisco Olarte pisze: > Hi Rafal: > > On Mon, Jul 20, 2015 at 3:33 PM, Rafal Pietrak <rafal@ztk-rp.eu> wrote: >> Regarding the last point. Usually, I implement one-time used vouchers as >> rows in table like: >> CREATE TABLE (voucher int not null, consumed bool, expire timestamp not >> null default timestamp_pl_interval(now()::timestamp, '2 >> min'::interval),..., unique (voucher,consumed) ); >> with CONSUMED column NULLyfied when voucher is used. The entire row of >> consumed voucher is purged after clearence and verification, which >> happen significantly later. >> Such short lived (when active) voucher is usually just 6-digit long, to >> help people enter it. > > In this case I think you are mixing vouchers with voucher-numbers. IMO > you could get a better dessign by using an auxiliary table and not > nullifying the number after been consumed. Having only 6 digits I Hmmm. I don't think so. 1. I'm not nullifying the number, just the CONSUMED flag. The row stays otherwise pretty much untouched untill clearing time, when it's removed from the table. 2. And I don't thing I mix vouchers with voucher-numbers.... since there is no distinction. Bringing some real live examples of "vouchers" to back that later statement, we have: 1) a 6-digit authorization code (a voucher) used by payment system to confirm payment authorization. 2) 4-8digit one-time PIN delivered by SMS used to open "some accounts". 3) 6-digit SMS confirmation code used by internet banking. 4) 14-digit voucher used to topup mobile pre-paied accounts. 5) 4-8 digit vouchers used as lunch tickets at conferences. (this could possibly used as printed qr-code of UUID, since cafeterias usually have bar-code readers; but having it as "human-size" 6-digit pin has it's benefits too). In all those cases "the physical problem" needs just a single N-digit number (a voucher), which is as short as it's lifespan/population allows for while keeping it relatively safe. The application just needs to create a unique (for a period of time) number, and "consume" it at certain point. Everything else would be "implementation burden", which should be kept to minimum. > tould try: > > 1.- Add a serial PK column to voucher table if needed to link it with > the rest of the system. > 2.- Create an index on voucher where consumed is true. > 3.- Add another table, voucher_nums, with columns voucher, order, > used. Populate it with the 10^6 vouchers and a random order value. > Also, this lets you switch to alphanumeric vouchers, or zap the ones > with two consecutive equal digits, or whatever. > 4.- Make a function to select a free voucher, you can do 'select from > voucher_nums where not used order by order limit 1¡', if yout put this > into a with clause of an update-returning setting used to true to you > get a one shot way of getting a free voucher. If you add a partial > index on order where not used, you get a fast way of getting it. > 5.- Make another function to free a voucher num, which sets consumed > to true on vouchers, used to false and order to a random number on > voucher_nums. This looks a bit like an overkill for the above examples. But I have other thoughts on the use of cryptographic sequences here. It has the pitfall of being sensitive to out-of-the-sequence poisoning, I mean: When another instance of an application starts issuing another sequence of vouchers, at certain point those sequences collide and applications despite using "guaranteed lack of collisions" will have a collision. So the application *will have to have* a re-issuing of an INSERT implemented anyway. If so, the whole point of using cryptographic sequence is missing. So, even though this collision is not statistically significant, but just its possibility results in that application have to take care of re-issuing of an INSERT. Using database.sequence() function to seed the cypher is not secure enough. On the other hand, the "ON CONFLICT RETRY" has a nice feature for an application programmer (like myself) that it leaves us free of the implementation of the re-issue of an INSERT. One database-schema designer does that for all of us. But knowing if that usage scenario is too rare to match the heavy lifting the implementation required, is beyond my experience. -R
On 21 July 2015 at 11:43, Rafal Pietrak <rafal@ztk-rp.eu> wrote:
On the other hand, the "ON CONFLICT RETRY" has a nice feature for an
application programmer (like myself) that it leaves us free of the
implementation of the re-issue of an INSERT. One database-schema
designer does that for all of us.
But knowing if that usage scenario is too rare to match the heavy
lifting the implementation required, is beyond my experience.
The usage scenario is rare and I'm sure that doesn't help you but the point is that it's very very easy to write a function that does what you want. It's not easy at all to write a function that does UPSERT or DO NOTHING consistently and efficiently.
The fact that you refuse to use the mechanism provided to you by the database developers doesn't invalidate the fact that that's the simplest and easiest way to achieve what you want.
I'm sorry to be harsh (again) about this but adding extra complexity to the PG system to achieve something that is _easily_ achieved through the existing mechanisms isn't something that is likely to get widespread support.
Geoff
Hi Rafal: On Tue, Jul 21, 2015 at 12:43 PM, Rafal Pietrak <rafal@ztk-rp.eu> wrote: > W dniu 21.07.2015 o 09:34, Francisco Olarte pisze: >> In this case I think you are mixing vouchers with voucher-numbers. IMO >> you could get a better dessign by using an auxiliary table and not >> nullifying the number after been consumed. Having only 6 digits I > Hmmm. I don't think so. .... > The application just needs to create a unique (for a period of time) > number, and "consume" it at certain point. Everything else would be > "implementation burden", which should be kept to minimum. I see your points, totally opposite opinions, so no point in discussing it, discard my sugestions as not aplicable SVP. .... > This looks a bit like an overkill for the above examples. It certainly is for your style of dessign, workng target, discard it. > But I have other thoughts on the use of cryptographic sequences here. I wouldn't call it that, its misleading. It's just encrypted sequences. > It > has the pitfall of being sensitive to out-of-the-sequence poisoning, I > mean: When another instance of an application starts issuing another > sequence of vouchers, at certain point those sequences collide and > applications despite using "guaranteed lack of collisions" will have a > collision. Well, if you have aplication instance specific sequences, of course you have. But in this case even plain unencrypted sequences hae them. > So the application *will have to have* a re-issuing of an > INSERT implemented anyway. Of course, because the only point of using instance specific sequences instead of serial like you normally do must be having the possibility of collisions to justify a the existence of a re-issuing code and exercise it. > If so, the whole point of using cryptographic > sequence is missing. No. The whole point of using a global sequence ( in the db ) is avoiding collisions, be it encrypted or plain. The whole point of using crypto is to make it look random. If you use an application specific cryptographic sequence is because you want colisions ( success, as told above ) which looks random ( success too ). If you do not want colisions, use a global sequence. > So, even though this collision is not statistically > significant, but just its possibility results in that application have > to take care of re-issuing of an INSERT. I use to tell people there are three meaninful cardinalities in computing, zero, one and many. And three probabilities ( NOT possibilities ), zero, one and other==(0,1). Except in some lucky domains you have to trat every (0,1) probability as been possible ( in fact my three probability values map nicely to impossible, possible and certain ). > Using database.sequence() function to seed the cypher is not secure enough. What are you talking about? Where did you get that seeding idea? You do not seed the cipher, you use the ciphered sequence as voucher. In fact I've done this with session ids. I use a sequence for the ID and send the user the ciphered val. When it comes back I just decipher it and search. I did not have your 6-digit problems, so I just used 128 bit blocks, and it worked nicely. And I did not have any ciphered data in the DB. > On the other hand, the "ON CONFLICT RETRY" has a nice feature for an > application programmer (like myself) that it leaves us free of the > implementation of the re-issue of an INSERT. One database-schema > designer does that for all of us. > But knowing if that usage scenario is too rare to match the heavy > lifting the implementation required, is beyond my experience. Saying OCR is a nice feature is like saying MAGIC RAINBOW OVERDRIVE is a nice feature for a car. It does not exist, and nobody has described it with enough detail so people can assses its usefulness or implementation difficulty. A careful definition of the corner case will be needed. And even with it you have the possibility of inifinite colisions ( either due to generating too many 'vouchers' or to sheer bad luck ( like collisions among application instances ). If you try to write a nicely wrapped up description of the functionality maybe someone could see the usefulness and implement it, but I think this is possible but unlikely. Francisco Olarte.