Thread: coalesce in plpgsql, and other style questions
I just wrote my first pl/pgsql function, and would appreciate any comments people have on it. I'll be writing a bunch of similar functions, with semantics "give me the id of the object if exists, otherwise create it and give me the id." My solution seems excessively procedural to me. I thought I could get the right semantics with something like select coalesce((select id from mytable where name='foo'), (insert into mytable ('name') values('foo') returning id)) but I could not get that to work in plgsql. Also, I wonder if calling a column 'name' is asking for trouble. The actual function is a little more complicated because the table has the possibility of canonical entries with other entries pointing to them, I use 'did' for database id to avoid confusion with various other message ids the system uses; it's for tracking emails. Here's the code. As I said, I'd love comments on any aspect of it. /* gom = get or make functions retrieve did if available, otherwise create record and return its did. */ create or replace function gom_hostids(IN hostname text, IN canonicalname text = NULL, OUT hostid bigint, OUT canonicalid bigint) language plpgsql as $$ DECLARE BEGIN select did, canonical into hostid, canonicalid from host where name = hostname; if FOUND then return; end if; if canonicalname is not NULL then select did into canonicalid from host where name = canonicalname; if not FOUND then insert into host (name) values(canonicalname) returning did into canonicalid; end if; end if; if hostname != canonical then insert into host (name, canonical) values(hostname, canonicalid) returning did into hostid; else hostid := canonicalid; return; END $$; Table definition: create table host ( did bigint primary key default (nextval('rb_id_seq')), ---database id, like oid name text, --IP address legal, usually enclosed in []. More commonly an internet domain. -- domain is a sql keyword and so I used host. canonical bigint references host (did) --if not null, preferred name for host --may refer to self ) inherits (RBObject); I'm using Posgresql 8.4. Thanks. Ross P.S. The recommended style of using "into", which puts the into near the start of selects but the end of inserts and deletes seems irregular to me. Also, I put the language clause of create function before the as clause because it seemed more natural to know that information before reading the body. I suppose if they are all the same language the language clause is just kind of noise, and that's why the examples have it at the end.
On Tue, 2012-06-12 at 10:46 -0700, Ross Boylan wrote: > I just wrote my first pl/pgsql function, and would appreciate any > comments people have on it. I'll be writing a bunch of similar > functions, with semantics "give me the id of the object if exists, > otherwise create it and give me the id." > > My solution seems excessively procedural to me. I think what you're trying to do is procedural, and that's not necessarily bad. You are trying to get an existing ID if available, or assign a new ID if not; and I assume you are doing so to save the space associated with storing the full names (essentially like dictionary encoding). > I thought I could get > the right semantics with something like > select coalesce((select id from mytable where name='foo'), > (insert into mytable ('name') values('foo') returning id)) > but I could not get that to work in plgsql. In pl/pgsql, you can't just use SELECT, you have to use something like SELECT INTO the_id ... or use PERFORM. Also, you have a potential race condition there, because someone might insert "foo" into mytable after the select (first arg to coalesce) and before the insert (second arg to coalesce). > Also, I wonder if calling a column 'name' is asking for trouble. I recommend against it. > The actual function is a little more complicated because the table has > the possibility of canonical entries with other entries pointing to > them, I use 'did' for database id to avoid confusion with various other > message ids the system uses; it's for tracking emails. > > Here's the code. As I said, I'd love comments on any aspect of it. > > /* gom = get or make functions retrieve did if available, otherwise create record > and return its did. */ > > create or replace function gom_hostids(IN hostname text, IN canonicalname text = NULL, > OUT hostid bigint, OUT canonicalid bigint) language plpgsql as $$ Sometimes it's good to declare your variables with names like in_hostname or out_canonicalid, to prevent confusion reading similarly-named variables coming from the tables. > DECLARE > BEGIN > select did, canonical into hostid, canonicalid from host > where name = hostname; > if FOUND then > return; > end if; > if canonicalname is not NULL then > select did into canonicalid from host where name = canonicalname; > if not FOUND then > insert into host (name) values(canonicalname) returning did into canonicalid; > end if; > end if; > if hostname != canonical then Is canonical a proper variable here? It's not in the argument list, and it's not DECLAREd. Did you mean canonicalname? > insert into host (name, canonical) values(hostname, canonicalid) > returning did into hostid; > else > hostid := canonicalid; I would recommend keeping more explicit track of NULL values and what should happen to them. It looks like a NULL canonicalname would fall into this else branch (I could be mistaken), but it's a little hard to follow exactly what you want to happen. A few comments would help. > return; > END > $$; Again, you have a potential race between checking for the canonicalname in host, and inserting it into host when it doesn't exist. If another transaction runs in between those two, you will get duplicates in the host table. > Table definition: > create table host ( > did bigint primary key default (nextval('rb_id_seq')), ---database id, like oid > name text, --IP address legal, usually enclosed in []. More commonly an internet domain. > -- domain is a sql keyword and so I used host. > canonical bigint references host (did) --if not null, preferred name for host > --may refer to self > ) inherits (RBObject); > > I'm using Posgresql 8.4. I recommend a unique constraint on host.name. That will catch the race conditions I mentioned above, and turn them into errors. You can catch those errors in plpgsql, allowing you to do what you want in that case. Regards, Jeff Davis
Thanks for your responses. Some followup below. On Tue, 2012-06-12 at 16:30 -0700, Jeff Davis wrote: > On Tue, 2012-06-12 at 10:46 -0700, Ross Boylan wrote: > > I just wrote my first pl/pgsql function, and would appreciate any > > comments people have on it. I'll be writing a bunch of similar > > functions, with semantics "give me the id of the object if exists, > > otherwise create it and give me the id." > > > > My solution seems excessively procedural to me. > > I think what you're trying to do is procedural, and that's not > necessarily bad. You are trying to get an existing ID if available, or > assign a new ID if not; and I assume you are doing so to save the space > associated with storing the full names (essentially like dictionary > encoding). > > > I thought I could get > > the right semantics with something like > > select coalesce((select id from mytable where name='foo'), > > (insert into mytable ('name') values('foo') returning id)) > > but I could not get that to work in plgsql. > > In pl/pgsql, you can't just use SELECT, you have to use something like > SELECT INTO the_id ... or use PERFORM. > > Also, you have a potential race condition there, because someone might > insert "foo" into mytable after the select (first arg to coalesce) and > before the insert (second arg to coalesce). Practically, it's just me so there shouldn't be any risk. But I'd like to understand the general issue. I thought transaction would take care of this, so that within a transaction the state of the database does not change from actions in other sessions. Then if I commit and have conflict, the commit fails. I guess the sequence I'm using for did assures that did is unique across all transactions, and so the 2 transactions would not be in conflict, since they have different primary keys. As you said later, a unique constraint on one/some of the other fields will at least prevent bogus records that are "the same" from my point of view. I was a bit on the fence about making it unique, since I thought I might have different hosts that shared the same name. But those scenarios are unlikely, and I really don't want to deal with it if it does happen. But is my whole model of how transactions are operating off? I'm basically generalizing from Gemstone, an object database. > > > Also, I wonder if calling a column 'name' is asking for trouble. > > I recommend against it. I'll change it. > > > The actual function is a little more complicated because the table has > > the possibility of canonical entries with other entries pointing to > > them, I use 'did' for database id to avoid confusion with various other > > message ids the system uses; it's for tracking emails. > > > > Here's the code. As I said, I'd love comments on any aspect of it. > > > > /* gom = get or make functions retrieve did if available, otherwise create record > > and return its did. */ > > > > create or replace function gom_hostids(IN hostname text, IN canonicalname text = NULL, > > OUT hostid bigint, OUT canonicalid bigint) language plpgsql as $$ > > Sometimes it's good to declare your variables with names like > in_hostname or out_canonicalid, to prevent confusion reading > similarly-named variables coming from the tables. I'm slowly discovering the problems of name clashes; thanks for the suggestion. > > > DECLARE BTW, is DECLARE necessary if there are no declarations? > > BEGIN > > select did, canonical into hostid, canonicalid from host > > where name = hostname; > > if FOUND then > > return; > > end if; > > if canonicalname is not NULL then > > select did into canonicalid from host where name = canonicalname; > > if not FOUND then > > insert into host (name) values(canonicalname) returning did into canonicalid; > > end if; > > end if; > > if hostname != canonical then > > Is canonical a proper variable here? It's not in the argument list, and > it's not DECLAREd. Did you mean canonicalname? canonical is a column name in the table. Perhaps canonical_did would be more appropriate for it (and rename the output parameter out_canonical_did from canonicalid). > > > insert into host (name, canonical) values(hostname, canonicalid) > > returning did into hostid; > > else > > hostid := canonicalid; > > I would recommend keeping more explicit track of NULL values and what > should happen to them. It looks like a NULL canonicalname would fall > into this else branch (I could be mistaken), that's right. > but it's a little hard to > follow exactly what you want to happen. A few comments would help. I trimmed a comment before the start of the function; it doesn't exactly hit that issue, but it does discuss NULL's: /* Note that if this function creates a canonical host, the canonical field of that host will be NULL rather than pointing to itself. This function may return canonicalid = NULL; it does not currently set canonicalid to hostid in that case. Also if caller supplies an existing hostname with a bogus canonicalname the canonicalid on file of that hostname will be returned. */ > > > return; > > END > > $$; > > Again, you have a potential race between checking for the canonicalname > in host, and inserting it into host when it doesn't exist. If another > transaction runs in between those two, you will get duplicates in the > host table. > > > Table definition: > > create table host ( > > did bigint primary key default (nextval('rb_id_seq')), ---database id, like oid > > name text, --IP address legal, usually enclosed in []. More commonly an internet domain. > > -- domain is a sql keyword and so I used host. > > canonical bigint references host (did) --if not null, preferred name for host > > --may refer to self > > ) inherits (RBObject); > > > > I'm using Posgresql 8.4. > > I recommend a unique constraint on host.name. That will catch the race > conditions I mentioned above, and turn them into errors. You can catch > those errors in plpgsql, allowing you to do what you want in that case. > > Regards, > Jeff Davis >
On Tue, 2012-06-12 at 17:42 -0700, Ross Boylan wrote: > Practically, it's just me so there shouldn't be any risk. But I'd like > to understand the general issue. I thought transaction would take care > of this, so that within a transaction the state of the database does not > change from actions in other sessions. Then if I commit and have > conflict, the commit fails. That is true in SERIALIZABLE mode in version 9.1 or later. Perhaps that will eventually be the default. But a UNIQUE constraint on host.name will fix your particular problem nicely and efficiently, so you don't have to be on 9.1 (though I encourage you to upgrade to 9.1.4 if you can). And it's just good practice to declare such constraints unless there's a reason not to. > I guess the sequence I'm using for did assures that did is unique across > all transactions, and so the 2 transactions would not be in conflict, > since they have different primary keys. You are using sequences (nextval() operates on a sequence) and a sequence will never produce the same value from nextval() (unless you explicitly reset it). You have a UNIQUE constraint on did, but even if you didn't, then the values should be unique. However, I still recommend keeping the UNIQUE constraint on host.did unless there's a reason not to. > But is my whole model of how transactions are operating off? I'm > basically generalizing from Gemstone, an object database. I think you have the right idea, and 9.1 SERIALIZABLE mode makes what you say true. But, unfortunately, that is not true in 9.0 or before (nor in lower isolation modes, like the READ COMMITTED default). If you're interested in the details, please read the 8.4 explanation here: http://www.postgresql.org/docs/8.4/static/transaction-iso.html thoroughly, and then see the 9.1 version: http://www.postgresql.org/docs/8.4/static/transaction-iso.html Each of these isolation modes has a purpose, but in my opinion, you should use SERIALIZABLE in version 9.1+ (a.k.a. "true serializability") unless you understand the other modes and have a reason to use one of the other ones. I say this because true serializability matches your intuitive understanding, while the other modes have some subtle behaviors that might surprise you. > > > DECLARE > BTW, is DECLARE necessary if there are no declarations? Nope. > > > BEGIN > > > select did, canonical into hostid, canonicalid from host > > > where name = hostname; > > > if FOUND then > > > return; > > > end if; > > > if canonicalname is not NULL then > > > select did into canonicalid from host where name = canonicalname; > > > if not FOUND then > > > insert into host (name) values(canonicalname) returning did into canonicalid; > > > end if; > > > end if; > > > if hostname != canonical then > > > > Is canonical a proper variable here? It's not in the argument list, and > > it's not DECLAREd. Did you mean canonicalname? > canonical is a column name in the table. Perhaps canonical_did would be > more appropriate for it (and rename the output parameter > out_canonical_did from canonicalid). Does the function actually work like it is now? It looks like "canonical" is being used outside of the scope of any query, so I don't see how it comes from the table. Maybe I'm still confused. Regards, Jeff Davis
On Tue, 2012-06-12 at 18:17 -0700, Jeff Davis wrote: > On Tue, 2012-06-12 at 17:42 -0700, Ross Boylan wrote: > > > > BEGIN > > > > select did, canonical into hostid, canonicalid from host > > > > where name = hostname; > > > > if FOUND then > > > > return; > > > > end if; > > > > if canonicalname is not NULL then > > > > select did into canonicalid from host where name = canonicalname; > > > > if not FOUND then > > > > insert into host (name) values(canonicalname) returning did into canonicalid; > > > > end if; > > > > end if; > > > > if hostname != canonical then > > > > > > Is canonical a proper variable here? It's not in the argument list, and > > > it's not DECLAREd. Did you mean canonicalname? > > canonical is a column name in the table. Perhaps canonical_did would be > > more appropriate for it (and rename the output parameter > > out_canonical_did from canonicalid). > > Does the function actually work like it is now? No :) Thanks for your persistence. I made some changes after I tested it. That last bit should be if hostname != canonicalname then insert into host (name, canonical) values(hostname, canonicalid) returning did into hostid; else hostid := canonicalid; end if; return; There was a missing end if too. I may rethink my indenting the "end if" with the previous block, at least if I have an "else". I'm also not crazy about how emacs is handling indentation. Ross > It looks like > "canonical" is being used outside of the scope of any query, so I don't > see how it comes from the table. Maybe I'm still confused. > > Regards, > Jeff Davis >
On 2012-06-12 20:42, Ross Boylan wrote: > Thanks for your responses. Some followup below. > On Tue, 2012-06-12 at 16:30 -0700, Jeff Davis wrote: >> On Tue, 2012-06-12 at 10:46 -0700, Ross Boylan wrote: >> > I just wrote my first pl/pgsql function, and would appreciate any >> > comments people have on it. I'll be writing a bunch of similar >> > functions, with semantics "give me the id of the object if exists, >> > otherwise create it and give me the id." >> > >> > My solution seems excessively procedural to me. >> >> I think what you're trying to do is procedural, and that's not >> necessarily bad. You are trying to get an existing ID if available, >> or >> assign a new ID if not; and I assume you are doing so to save the >> space >> associated with storing the full names (essentially like dictionary >> encoding). >> >> > I thought I could get >> > the right semantics with something like >> > select coalesce((select id from mytable where name='foo'), >> > (insert into mytable ('name') values('foo') returning id)) >> > but I could not get that to work in plgsql. >> >> In pl/pgsql, you can't just use SELECT, you have to use something >> like >> SELECT INTO the_id ... or use PERFORM. >> >> Also, you have a potential race condition there, because someone >> might >> insert "foo" into mytable after the select (first arg to coalesce) >> and >> before the insert (second arg to coalesce). > Practically, it's just me so there shouldn't be any risk. But I'd > like > to understand the general issue. I thought transaction would take > care > of this, so that within a transaction the state of the database does > not > change from actions in other sessions. Then if I commit and have > conflict, the commit fails. > > I guess the sequence I'm using for did assures that did is unique > across > all transactions, and so the 2 transactions would not be in conflict, > since they have different primary keys. > > As you said later, a unique constraint on one/some of the other > fields > will at least prevent bogus records that are "the same" from my point > of > view. I was a bit on the fence about making it unique, since I > thought > I might have different hosts that shared the same name. But those > scenarios are unlikely, and I really don't want to deal with it if it > does happen. > > But is my whole model of how transactions are operating off? I'm > basically generalizing from Gemstone, an object database. A transaction means you see a consistent view of the data in the database - that view is consistent and valid through the life of the transaction, but no other transaction needs to see the same view. Other transactions may be going on at the same time, and changing data as they see fit - you will see those changes within your transaction, but only once they've finished their transaction. Transactions do not lock tables or rows, even at the serializable level. What you appear to be thinking is that the transaction locks the tables, and then tries to do a resolution of the lock at the end of the transaction - serializable comes close to that, if everyone is modifying the same table, but not quite. Even under serializable, you'd still be able to insert invalid data using the 'select, then insert if not in select' logic. (The difference is that under serializable, you couldn't check to see if it had happened afterwards - at least not in the same transaction.) You need either a unique constraint - so the database checks as part of the insert operation whether there is another of the same value - or a table lock. A table lock would allow you to set that no one else is allowed to modify the table until you are done with it. Your basic assumption for how transactions work is close to how simple databases work, that basically lock the whole table or database while you are working on it. Which is simple and cheap to do - as long as you aren't doing a whole lot in the database. If you have a lot of action going on in the database, the lock resolution eventually overwhelms the simplicity - which is the whole reason why there are databases like Postgresql, which can maintain good performance and data integrity without that locking. Daniel T. Staal --------------------------------------------------------------- This email copyright the author. Unless otherwise noted, you are expressly allowed to retransmit, quote, or otherwise use the contents for non-commercial purposes. This copyright will expire 5 years after the author's death, or in 30 years, whichever is longer, unless such a period is in excess of local copyright law. ---------------------------------------------------------------
On Tue, Jun 12, 2012 at 12:46 PM, Ross Boylan <ross@biostat.ucsf.edu> wrote: > I just wrote my first pl/pgsql function, and would appreciate any > comments people have on it. I'll be writing a bunch of similar > functions, with semantics "give me the id of the object if exists, > otherwise create it and give me the id." > > My solution seems excessively procedural to me. I thought I could get > the right semantics with something like > select coalesce((select id from mytable where name='foo'), > (insert into mytable ('name') values('foo') returning id)) > but I could not get that to work in plgsql. for posterity: with a as (select id from mytable where name='foo'), b as ( insert into mytable (name) select 'foo' where not exists (select 1 from a) returning id ) select * from a union all select * from b; I definitely appreciate the desire to avoid procedural implementations of things like this. Just be advised that this is a postgresql-ism (data modifying 'with' is not standard syntax). This also (as Jeff notes) has no bearing on the race to the id: you must be prepared to retry the above statement in face of concurrent attempts to insert to the same unique value unless you have taken a lock to guard against this. I don't think it's possible to work that lock into the CTE. merlin
On Wed, 2012-06-13 at 09:02 -0500, Merlin Moncure wrote: > On Tue, Jun 12, 2012 at 12:46 PM, Ross Boylan <ross@biostat.ucsf.edu> wrote: > > I just wrote my first pl/pgsql function, and would appreciate any > > comments people have on it. I'll be writing a bunch of similar > > functions, with semantics "give me the id of the object if exists, > > otherwise create it and give me the id." > > > > My solution seems excessively procedural to me. I thought I could get > > the right semantics with something like > > select coalesce((select id from mytable where name='foo'), > > (insert into mytable ('name') values('foo') returning id)) > > but I could not get that to work in plgsql. > > for posterity: > > with a as (select id from mytable where name='foo'), > b as > ( > insert into mytable (name) > select 'foo' where not exists (select 1 from a) > returning id > ) > select * from a union all select * from b; Oh my! Is that legal plpgsql code, or just regular (postgres) sql? Also, what's CTE (below)? Thanks. Ross > > I definitely appreciate the desire to avoid procedural implementations > of things like this. Just be advised that this is a postgresql-ism > (data modifying 'with' is not standard syntax). This also (as Jeff > notes) has no bearing on the race to the id: you must be prepared to > retry the above statement in face of concurrent attempts to insert to > the same unique value unless you have taken a lock to guard against > this. I don't think it's possible to work that lock into the CTE. > > merlin
On Wed, Jun 13, 2012 at 1:34 PM, Ross Boylan <ross@biostat.ucsf.edu> wrote: > On Wed, 2012-06-13 at 09:02 -0500, Merlin Moncure wrote: >> > My solution seems excessively procedural to me. I thought I could get >> > the right semantics with something like >> > select coalesce((select id from mytable where name='foo'), >> > (insert into mytable ('name') values('foo') returning id)) >> > but I could not get that to work in plgsql. >> >> for posterity: >> >> with a as (select id from mytable where name='foo'), >> b as >> ( >> insert into mytable (name) >> select 'foo' where not exists (select 1 from a) >> returning id >> ) >> select * from a union all select * from b; > Oh my! > > Is that legal plpgsql code, or just regular (postgres) sql? It is both (but only in 9.1+, sorry!). It's plain SQL, so is acceptable in any place sql is allowed -- directly from the client, sql functions, plpgsql functions, etc. The ability to chain 'returning' into other queries via 'with' was a new feature which we call 'data modifying with' added as of postgresql 9.1. Vanilla CTEs aka common table exrpressions aka WITH statements -- were added in 8.4 but you can only use them with select statements. Aside: I encourage you to continue with pl/pgsql. It's the secret sauce. merlin
On Wed, 2012-06-13 at 09:52 -0400, Daniel Staal wrote: > What you appear to be thinking is that the transaction locks the > tables, and then tries to do a resolution of the lock at the end of the > transaction - serializable comes close to that, if everyone is modifying > the same table, but not quite. Even under serializable, you'd still be > able to insert invalid data using the 'select, then insert if not in > select' logic. (The difference is that under serializable, you couldn't > check to see if it had happened afterwards - at least not in the same > transaction.) Your statement was true in 9.0 and before, but in 9.1, SERIALIZABLE means *truly serializable*. Try it, and see. The 'select, then insert if not in select' logic will throw a serialization error if another transaction races it, even if there is no UNIQUE. > You need either a unique constraint I agree that a unique constraint is the right way to do it, because it's a declarative constraint. > Your basic assumption for how transactions work is close to how simple > databases work, that basically lock the whole table or database while > you are working on it. Which is simple and cheap to do - as long as you > aren't doing a whole lot in the database. If you have a lot of action > going on in the database, the lock resolution eventually overwhelms the > simplicity - which is the whole reason why there are databases like > Postgresql, which can maintain good performance and data integrity > without that locking. I strongly encourage you to do some experimentation on 9.1+ with serializable transactions (all transactions must be serializable for it to work). See if you can find any anomalies, or any performance degradation. The only expected performance degradation (aside from very strange cases) is that there will be serialization errors, and you'll need to retry those transactions. It does not cause any transactions to block that wouldn't otherwise. Think of it as a magic mode that turns SQL race conditions into errors. You should still use appropriate locking and declarative constraints, because those will allow more transactions to succeed (obviously, under intense workloads you don't want a high rollback rate). And declarative constraints also add to the readability/maintainability. Regards, Jeff Davis
--As of June 13, 2012 12:10:18 PM -0700, Jeff Davis is alleged to have said: > Your statement was true in 9.0 and before, but in 9.1, SERIALIZABLE > means *truly serializable*. You're right; sorry. Bad me for relying on faulty memory and an old manual. ;) >> You need either a unique constraint > > I agree that a unique constraint is the right way to do it, because it's > a declarative constraint. This would be my preference as well, regardless of the rest. If you find yourself trying to implement something that's solvable as an SQL constraint with application-level code, you are doing something wrong. >> Your basic assumption for how transactions work is close to how simple >> databases work, that basically lock the whole table or database while >> you are working on it. Which is simple and cheap to do - as long as you >> aren't doing a whole lot in the database. If you have a lot of action >> going on in the database, the lock resolution eventually overwhelms the >> simplicity - which is the whole reason why there are databases like >> Postgresql, which can maintain good performance and data integrity >> without that locking. > > I strongly encourage you to do some experimentation on 9.1+ with > serializable transactions (all transactions must be serializable for it > to work). > > See if you can find any anomalies, or any performance degradation. The > only expected performance degradation (aside from very strange cases) is > that there will be serialization errors, and you'll need to retry those > transactions. It does not cause any transactions to block that wouldn't > otherwise. I would in no way expect Postgresql to implement that change unless they were sure it didn't cause any major issues. I'm impressed though that they managed it in the first place. ;) My statement above was to address why you would tend to find this type of behavior on low-end databases as the default: Because it's cheap and easy to implement poorly. Just lock everything. Implementing it well is another thing entirely. Daniel T. Staal --------------------------------------------------------------- This email copyright the author. Unless otherwise noted, you are expressly allowed to retransmit, quote, or otherwise use the contents for non-commercial purposes. This copyright will expire 5 years after the author's death, or in 30 years, whichever is longer, unless such a period is in excess of local copyright law. ---------------------------------------------------------------