Thread: index(fct(primary key)) kills INSERTs
see attached bug-report
Frank Miles <fpm@u.washington.edu> writes: > If an index is created based on a function of the primary key, > you cannot insert new entries into the database. I think the critical point here is that your "function of the primary key" is actually doing a SELECT from the table: SELECT INTO prec * FROM test_table WHERE tt_id = dum_int; IF NOT FOUND THEN RAISE EXCEPTION ''project % not found'', dum_int; END IF; When I try your example, I get play=> INSERT INTO test_table (tt_descr) VALUES ('third - will fail'); ERROR: project 3 not found which surprises me not at all, because at the point where this function is invoked, the new record with tt_id 3 hasn't been entered into the table yet. I'm not sure what you are really trying to accomplish here --- as you say, it's a stripped-down example and not too intelligible. As far as the example goes, you could skip the SELECT and just use the passed-in parameter value. What was the original goal that made you feel you needed to SELECT the about-to-be-inserted row? regards, tom lane
On Fri, 10 Nov 2000, Tom Lane wrote: > Frank Miles <fpm@u.washington.edu> writes: > > If an index is created based on a function of the primary key, > > you cannot insert new entries into the database. > > I think the critical point here is that your "function of the primary > key" is actually doing a SELECT from the table: > > SELECT INTO prec * FROM test_table WHERE tt_id = dum_int; > IF NOT FOUND THEN > RAISE EXCEPTION ''project % not found'', dum_int; > END IF; > > When I try your example, I get > > play=> INSERT INTO test_table (tt_descr) VALUES ('third - will fail'); > ERROR: project 3 not found > > which surprises me not at all, because at the point where this function > is invoked, the new record with tt_id 3 hasn't been entered into the > table yet. This makes sense if the index is updated before the new record is completely added. (showing my ignorance here:) Is that necessary? Can't the new index be updated after entry? How can it be that the table has the item (it's attempting to update the index based on a table that includes the record '3'), yet the table doesn't have it (project 3 not found). This smells like a race condition. > I'm not sure what you are really trying to accomplish here --- as you > say, it's a stripped-down example and not too intelligible. As far > as the example goes, you could skip the SELECT and just use the > passed-in parameter value. What was the original goal that made you > feel you needed to SELECT the about-to-be-inserted row? I don't need the function to work for the insertion. That's only used for queries. But having the index clearly makes insertion impossible. In the real tables, I have several indexes which depend on a variety of values in the table. A part of the database (eventually) will be used to track projects done by my group for researchers at this University and elsewhere. Unfortunately different people at different times have used different enumeration schemes for tracking these projects. The indexes allow me to have a simple SERIAL-based enumeration scheme, while allowing the records to be accessed using older-format enumerations. It also keeps response time to queries reasonable for these old formats. The conversion is roughly a two-step process. I have a set of python routines that convert the old data formats into a form suitable for postgres. In this part of the process, the indicies are critical in avoiding sequential scans, as the python routines query the database in order to do their part of the conversion. This has been especially important when converting 'large numbers' (hundreds) of records from various legacy data sources. Without these indicies, processing can take on the order of a second per record. Once a set has been prepared, the whole array is inserted as a single transaction. The original data have inconsistencies that are trapped by constraints set up in the database. While this is good -- we will end up with a cleaner database -- it means that we have to go back to the data source, diagnose the problem, make corrections, re-run the python scripts (need the indicies), then try the insert transaction again (can't have the indicies). The whole process is ugly. Clearly when I'm doing the inserts, these extra indicies aren't necessary. In fact, my (hopefully temporary) workaround is to drop the indicies when I need to insert new records; then restore the indicies. In the long term, it's not critical for me, since eventually we won't be importing 'large numbers' of records -- the functional index can be killed, and people can simply wait the second or so for a query to be processed. I regard this bug as a 'wishlist'-level item : it seems that you should be able to do this without the database choking. It's not critical for me -- I have a workaround. I clearly know little about databases, you are undoubtedly more knowledgable regarding whether my use of this index might be something someone else might want. Thank you for your attention to this matter. -frank
Frank Miles <fpm@u.washington.edu> writes: >> which surprises me not at all, because at the point where this function >> is invoked, the new record with tt_id 3 hasn't been entered into the >> table yet. > This makes sense if the index is updated before the new record is completely > added. (showing my ignorance here:) Is that necessary? Can't the new > index be updated after entry? How can it be that the table has the > item (it's attempting to update the index based on a table that includes > the record '3'), yet the table doesn't have it (project 3 not found). > This smells like a race condition. It's not a race condition, it's a visibility condition, and it's entirely intentional. The tuple is indeed physically there in the table (else we wouldn't know where to make the index entry point), but you can't see it until the command that's inserting it is completed. This is just like the normal rule that you can't see the results of an uncommitted transaction, except it applies to individual commands within a transaction. Without this rule we'd have all sorts of problems with ghost tuples, infinite loops during UPDATE, etc. You still haven't explained why your index function should need to use a SELECT to retrieve the tuple it's going to index. It's supposed to get all the info it needs from the passed parameter(s). Doing a SELECT is not only extremely inefficient, but opens up all sorts of interesting questions about whether your function is depending only on the tuple it's allegedly indexing. (A "functional index" that depends on more inputs than just the indexed tuple is a logical contradiction that will cause all kinds of headaches; for example, the system won't know it needs to update the index entries when those other inputs change.) Why don't you just make the function receive all the columns it needs as passed parameters? regards, tom lane
On Fri, 10 Nov 2000, Tom Lane wrote: > Frank Miles <fpm@u.washington.edu> writes: > >> which surprises me not at all, because at the point where this function > >> is invoked, the new record with tt_id 3 hasn't been entered into the > >> table yet. > > > This makes sense if the index is updated before the new record is completely > > added. (showing my ignorance here:) Is that necessary? Can't the new > > index be updated after entry? How can it be that the table has the > > item (it's attempting to update the index based on a table that includes > > the record '3'), yet the table doesn't have it (project 3 not found). > > This smells like a race condition. > > It's not a race condition, it's a visibility condition, and it's > entirely intentional. The tuple is indeed physically there in the table > (else we wouldn't know where to make the index entry point), but you > can't see it until the command that's inserting it is completed. > This is just like the normal rule that you can't see the results of an > uncommitted transaction, except it applies to individual commands within > a transaction. Without this rule we'd have all sorts of problems with > ghost tuples, infinite loops during UPDATE, etc. Good. Makes sense. > You still haven't explained why your index function should need to use > a SELECT to retrieve the tuple it's going to index. It's supposed to > get all the info it needs from the passed parameter(s). Doing a SELECT > is not only extremely inefficient, but opens up all sorts of interesting > questions about whether your function is depending only on the tuple > it's allegedly indexing. (A "functional index" that depends on more > inputs than just the indexed tuple is a logical contradiction that > will cause all kinds of headaches; for example, the system won't know > it needs to update the index entries when those other inputs change.) > Why don't you just make the function receive all the columns it needs as > passed parameters? Ok, maybe the light slowly is dawning. I need the capabilities of the function analogous to what I've described, but it is a misuse of it to use it for the purposes of indexing. For that, I should generate an additional (new) function which has the arguments of the columns much as you describe. It will do the same processing, but won't require the SELECT. Thanks again! Sorry for being dense... -frank