Thread: Avoiding rewrite in ALTER TABLE ALTER TYPE
ALTER TABLE ALTER TYPE always rewrites the table heap and its indexes. In some cases, we can determine that doing so is unhelpful, and that the conversion shall always succeed: CREATE DOMAIN loosedom AS text; CREATE TABLE t (c varchar(2)); ALTER TABLE t ALTER c TYPE varchar(4); ALTER TABLE t ALTER c TYPE text; ALTER TABLE t ALTER c TYPE loosedom; In other cases, we can determine that the rewrite is unhelpful, but a cast could still throw an error: CREATE DOMAIN tightdom AS text CHECK (value LIKE '<%/>'); CREATE TABLE t (c text); ALTER TABLE t ALTER c TYPE xml USING c::xml; ALTER TABLE t ALTER c TYPE varchar(64); ALTER TABLE t ALTER c TYPE tightdom; I wish to replace table rewrites with table verification scans where possible, then skip those verification scans where possible. I've noted three subproblems that require distinct treatment. In the following discussion, "tuple" is the tuple on disk, and "tuple'" is tuple we would form during a rewriting ALTER TABLE ALTER TYPE. Likewise "col" and "col'" for individual columns. 1. The table heap rewrite itself To legitimately leave the table heap intact, we must confirm datumIsEqual(col, col') for every tuple and column. Without any specific configuration to suggest when this might be true, we can always determine its truth through a _verification scan_ of the table. If the datumIsEqual test fails for any tuple during the verification scan, terminate the scan and proceed with a table rewrite. When should we attempt the verification scan in the first place, as opposed to proceeding directly to a rewrite? For some time, I thought it best to automatically attempt it when the type change degenerates to a single binary coercion. Cases like the text->xml and timestamp->timestamptz [timezone=UTC] are more difficult to identify mechanically, so I supposed to let the user identify them via a new keyword to ALTER TABLE ALTER TYPE. Having thought on it more, though, it actually seems best to attempt the verification scan *every* time. In most ineligible conversions, an inequality will appear very early, so the scan is effectively O(1) in the negative case. A notable exception is something like char(6)->varchar(6) in a table with a billion tuples having length(col) = 6 and one with length(col) = 5. The verification scan might read most of the table before finding the one tuple that forces a rewrite. That isn't a particularly regular scenario in my experience, so the "just do the right thing" aspect of preceding every potential rewrite with a verification scan seems to win out. Certain very popular type changes (see introduction) can be _exempt_ from the verification scan: we can determine that they will always succeed. To capture that, I propose extending CREATE CAST with the notion of an exemptor function: CREATE CAST (source_type AS target_type) { WITH FUNCTION function_name (argument_type [, ...]) [ WITH EXEMPTORfunction_name ] | WITHOUT FUNCTION | WITH INOUT } [ AS ASSIGNMENT | AS IMPLICIT ] The exemptor shall have this signature: exemptor_func(integer, -- source_typmodinteger -- dest_typmod ) RETURNS boolean The exemptor shall return true iff datumIsEqual(x, x::target_type(dest_typmod)) for every x in source_type most recently coerced to source_type(source_typmod). When the cast is WITHOUT FUNCTION (a binary coercion) and target_type lacks a length coercion cast, the cast has an implicit exemption, and an exemptor is superfluous: code can assume an exemptor that always returns true. Use of WITH EXEMPTOR mainly makes sense alongside multi-arg WITH FUNCTION (should probably raise an error upon other uses). When ATPrepAlterColumnType determines that a transformation expression is merely a series of casts (or perhaps some stronger condition), and every such cast has an exemptor or implicit exemption, it will call those exemptors with the old and new typmods. If the exemptors all return true and this continues to hold for all other ALTER TYPE subcommands of this ALTER TABLE, we will skip the verification scan and table rewrite. I will add exemptors to length coercion casts of types other than "character" and "bit", which are exempt only in the no-change case. The candidates that are not pure length coercions, int8->bit and int4->bit, would not benefit. 2. CHECK constraints, index predicates, and expression indexes CHECK constraints, index predicates, and indexed expressions can change behavior without limitation when the types of columns they reference change. An exempt typmod change alone will not affect them. While there are some other safe cases ("CHECK (length(col) = 2", over a varchar->text conversion), I don't have a general framework in mind for identifying them. For now, an exempt change that includes a type change, such as varchar(2)->text, will still require a verification scan whenever a CHECK constraint, index predicate, or indexed expression references the column. The verification scan will confirm that datumIsEqual(expr(tuple), expr(tuple')) for all tuples and all affected expressions. Are there any nice ways to detect a broader selection of the expressions that do not require verification? 3. reindex when an operator class change crosses operator families Having avoided a heap rewrite, we may still need to reindex indexes that just received new operator classes. Consider a type "reverseint4" that stores just like an int4, but its default btree operator class sorts opposite int4_ops. Converting int4->reverseint4 would not touch the table heap, but we would need to reindex any indexes on the changing column. I'm not aware of any affected core data types (albeit not having investigated systematically), but we still need to get this case right. Conveniently, operator families cover exactly this notion of operator class compatibility. When the old and new operator classes share an operator family, we can skip the reindex without verification. This covers the primary case of varchar/text, plus a bonus of timestamp<->timestamptz and inet<->cidr. Indexes on expressions can use the same check; we will have verified the actual bits of the computed expressions during the verification scan. Index predicates need no special handling at this stage. For changes having no operator family to vindicate them, we can just reindex. Though we could introduce an index variant of the verification scan, confirming that v_n::target_type <= v_n+1::target_type for all values, it seems relatively unimportant: operator families cover the notable use cases nicely. 4. NOTE: other constraint types UNIQUE, PRIMARY KEY and EXCLUDE constraints remain valid as the indexes implementing them remain valid. This proposal does not affect foreign key constraints. I see this breaking down into various patches: 1. Add the verification scan to the vicinity of ATRewriteTable and skip the table rewrite when it finds no changes. Full reindex remains. 2. In the no-rewrite case, narrow us from reindexing the entire table to merely reindexing those indexes touched by the operator class changes. 3. Further skip reindex operations when the operator class has changed but the operator family has not. 4. Modify ATPrepAlterColumnType to detect implicit exemptions, and modify ATRewriteTable to skip the verification scan in light of those discoveries. 5. Add pg_cast.castexemptor and modify CREATE CAST to populate it. Define an exemptor for the varchar length coercion cast. Modify ATPrepAlterColumnType to use pg_cast.castexemptor and combine their outcomes with implicit exemptions. 6. Define exemptors for time, timetz, timestamp, timestamptz, and interval. 7. Define exemptor for bit varying. 8. Define exemptor for numeric. These threads, referenced from the TODO entry concerning a subset of this plan, were most helpful in revealing the scope of problems to address: http://archives.postgresql.org/message-id/200903040137.n241bAUV035002@wwwmaster.postgresql.org http://archives.postgresql.org/pgsql-patches/2006-10/msg00154.php What have I missed? I welcome any corrections/unaddressed problems/questions. Thanks, nm
On Dec 29, 2010, at 7:56 AM, Noah Misch <noah@leadboat.com> wrote: > ALTER TABLE ALTER TYPE always rewrites the table heap and its indexes. In some > cases, we can determine that doing so is unhelpful, and that the conversion > shall always succeed: > > CREATE DOMAIN loosedom AS text; > CREATE TABLE t (c varchar(2)); > ALTER TABLE t ALTER c TYPE varchar(4); > ALTER TABLE t ALTER c TYPE text; > ALTER TABLE t ALTER c TYPE loosedom; > > In other cases, we can determine that the rewrite is unhelpful, but a cast could > still throw an error: > > CREATE DOMAIN tightdom AS text CHECK (value LIKE '<%/>'); > CREATE TABLE t (c text); > ALTER TABLE t ALTER c TYPE xml USING c::xml; > ALTER TABLE t ALTER c TYPE varchar(64); > ALTER TABLE t ALTER c TYPE tightdom; > > I wish to replace table rewrites with table verification scans where possible, > then skip those verification scans where possible. Seems like a good idea. > Having thought on it > more, though, it actually seems best to attempt the verification scan *every* > time. In most ineligible conversions, an inequality will appear very early, so > the scan is effectively O(1) in the negative case. A notable exception is > something like char(6)->varchar(6) in a table with a billion tuples having > length(col) = 6 and one with length(col) = 5. The verification scan might read > most of the table before finding the one tuple that forces a rewrite. That > isn't a particularly regular scenario in my experience, so the "just do the > right thing" aspect of preceding every potential rewrite with a verification > scan seems to win out. I think this scenario will be more common than you might think. Tables don't contain random data; they contain data thatthe DBA thinks is valid. The situation where the data is mostly as you expect but with a few kooky rows is, in my experience,extremely common. And it makes the worst case a LOT worse. I really doubt this is worth the complexity anyway- converting between two types that are mostly-but-not-quite binary compatible seems like an edge case in every sense. > Certain very popular type changes (see introduction) can be _exempt_ from the > verification scan: we can determine that they will always succeed. To capture > that, I propose extending CREATE CAST with the notion of an exemptor function: > > CREATE CAST (source_type AS target_type) > { WITH FUNCTION function_name (argument_type [, ...]) > [ WITH EXEMPTOR function_name ] | > WITHOUT FUNCTION | > WITH INOUT } > [ AS ASSIGNMENT | AS IMPLICIT ] > > The exemptor shall have this signature: > > exemptor_func( > integer, -- source_typmod > integer -- dest_typmod > ) RETURNS boolean > > The exemptor shall return true iff datumIsEqual(x, x::target_type(dest_typmod)) > for every x in source_type most recently coerced to source_type(source_typmod). > When the cast is WITHOUT FUNCTION (a binary coercion) and target_type lacks a > length coercion cast, the cast has an implicit exemption, and an exemptor is > superfluous: code can assume an exemptor that always returns true. Use of WITH > EXEMPTOR mainly makes sense alongside multi-arg WITH FUNCTION (should probably > raise an error upon other uses). I am not sure whether it's worth trying to be general here. Maybe we should just hard-code the known cases involving coredatatypes. > ...Robert
Noah Misch <noah@leadboat.com> writes: > ALTER TABLE ALTER TYPE always rewrites the table heap and its indexes. In some > cases, we can determine that doing so is unhelpful, and that the conversion > shall always succeed: > I wish to replace table rewrites with table verification scans where possible, > then skip those verification scans where possible. This has been discussed before; have you read the previous threads? I really really dislike the notion of a "verification scan": it's basically work that is going to be useless if it fails. I think your argument that it will usually fail quickly is quite unconvincing, and in any case the situations where it is useful at all are too thin on the ground to be worth the code space to implement it. It seems sufficient to me to skip the rewrite in cases of provable binary compatibility, with possibly an extra check for "safe" changes of typmod. With respect to the latter, I agree a type-specific function to compare the typmods would be the way to go, although "exemptor" seems a pretty badly chosen name for it. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Dec 29, 2010, at 7:56 AM, Noah Misch <noah@leadboat.com> wrote: >> The exemptor shall have this signature: >> >> exemptor_func( >> integer, -- source_typmod >> integer -- dest_typmod >> ) RETURNS boolean >> >> The exemptor shall return true iff datumIsEqual(x, x::target_type(dest_typmod)) >> for every x in source_type most recently coerced to source_type(source_typmod). > I am not sure whether it's worth trying to be general here. Maybe we should just hard-code the known cases involving coredatatypes. I find the idea of hard coding to be pretty icky. However, the elephant in the room here is the possibility of typmod getting replaced by some other representation. It would make life simpler if we didn't invent this additional type-specific API until that dust has settled. So maybe the plan should be hard-coding in the short term and add an API later. regards, tom lane
On Wed, Dec 29, 2010 at 11:16:23AM -0500, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > ALTER TABLE ALTER TYPE always rewrites the table heap and its indexes. In some > > cases, we can determine that doing so is unhelpful, and that the conversion > > shall always succeed: > > I wish to replace table rewrites with table verification scans where possible, > > then skip those verification scans where possible. > > This has been discussed before; have you read the previous threads? It would help a lot if, in future, you include references to the previous threads you have in mind rather than simply mention that they exist. Saying, in effect, "search the archives with our not-super-great search technology using keywords you didn't think of," comes off as pretty dismissive if not downright hostile. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Dec 29, 2010, at 11:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I really really dislike the notion of a "verification scan": it's > basically work that is going to be useless if it fails. I think it has potential in cases like text to xml. In that case it'll either work or fail, with no possibility of requiringa do-over. Scanning the whole table is a whole lot cheaper than rewriting it. But I agree with your assessmentof the "optimistic" verification scan case. ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > On Dec 29, 2010, at 11:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I really really dislike the notion of a "verification scan": it's >> basically work that is going to be useless if it fails. > I think it has potential in cases like text to xml. In that case it'll either work or fail, with no possibility of requiringa do-over. Scanning the whole table is a whole lot cheaper than rewriting it. I don't believe avoiding the write part (but not the read part, nor the XML syntax verification part) is a sufficiently compelling argument to justify having that code path. There are not enough distinct datatypes sharing binary representations to make this a worthwhile thing to worry over. Basically, I believe that the only use-case that will have more than epsilon number of users is "I want to make this varchar(5) into varchar(10), or possibly text". We can fix that case without adding a boatload more code that we'll have to maintain. I do have some interest in the idea of having a type-specific function that can recognize no-op typmod changes, but I would envision that as being an expression evaluation optimization: let the planner throw away the call to the length-checking function when it isn't going to do anything. It's not by any means only useful in ALTER COLUMN TYPE --- and in fact probably doesn't even need any bespoke code there, if we put it into expression_planner() instead. regards, tom lane
On Wed, Dec 29, 2010 at 10:56:39AM -0500, Robert Haas wrote: > On Dec 29, 2010, at 7:56 AM, Noah Misch <noah@leadboat.com> wrote: > > Having thought on it > > more, though, it actually seems best to attempt the verification scan *every* > > time. In most ineligible conversions, an inequality will appear very early, so > > the scan is effectively O(1) in the negative case. A notable exception is > > something like char(6)->varchar(6) in a table with a billion tuples having > > length(col) = 6 and one with length(col) = 5. The verification scan might read > > most of the table before finding the one tuple that forces a rewrite. That > > isn't a particularly regular scenario in my experience, so the "just do the > > right thing" aspect of preceding every potential rewrite with a verification > > scan seems to win out. > > I think this scenario will be more common than you might think. Tables don't contain random data; they contain data thatthe DBA thinks is valid. The situation where the data is mostly as you expect but with a few kooky rows is, in my experience,extremely common. Perhaps. A few kooky rows is indeed common, but we're talking about a specific breed of kookiness: 99.9% of the rows have identical bits after an ALTER TYPE transformation expression, and 0.1% have different bits. Is that common? In case it was not obvious, I'll note that any error thrown by a transformation expression during the verification scan still aborts the ALTER TABLE. A varchar(20)->varchar(10) that finds an 11-char string will fail permanently during the verification scan. Indeed, the primary value of the verification scan is to distinguish positive and error, not positive and negative. Expanding on my introduction, none of the following can yield a negative verification scan; the result is always positive or an error: CREATE DOMAIN loosedom AS text; CREATE DOMAIN tightdom AS text CHECK (value LIKE '<%/>'); CREATE TABLE t (c varchar(6)); INSERT INTO t VALUES ('<abc/>'),('<de/>'); ALTER TABLE t ALTER c TYPE varchar(8); ALTER TABLE t ALTER c TYPE text; ALTER TABLE t ALTER c TYPE loosedom; ALTER TABLE t ALTER c TYPE xml USING c::xml; ALTER TABLE t ALTER c TYPE varchar(64); ALTER TABLE t ALTER c TYPE tightdom; Adding a bpchar into the mix makes a negative verification scan possible, as does a USING clause having a truncating effect. Continuing the example, these can and would get a negative verification scan: ALTER TABLE t ALTER c TYPE character(6); ALTER TABLE t ALTER c TYPE varchar(5) USING c::varchar(5); Plenty of academic USING clause examples exist: ALTER TABLE t ALTER c TYPE varchar(8) USING CASE c WHEN '<de/>' THEN 'foo' ELSE c END; Verification scans for conversions between fundamentally different types will generally end in the negative at the first tuple. Consider {timestamp,bigint, real,numeric,varbit}->text, int->bigint, interval->reltime, etc. I can't think of a decent non-academic example where heterogeneous conversions like these will dupe the verification scan for even a handful of tuples. Granted, one can flip that around as an argument for declaring the conversions that have a chance. > And it makes the worst case a LOT worse. I suppose the absolute worst case would involve an ALTER TABLE adding brutally expensive CHECK constraints, such that the cost of computing those constraints would dominate the cost of both the verification scan and the rewrite, yielding a 100% slower ALTER TABLE run. A more realistic bad case might be a table much larger than memory with no indexes, and the verification scan adds a full seq scan for nothing. A crude test here has rewriting such a table taking 7x as long as a seq scan on it. By the back of that envelope, we'll take about 15% more time. The right mix of expensive expressions will raise that percentage, and the presence of indexes will drop it. Remember though, we're still only talking about the relatively-rare cases that even can get a negative verification scan. > I really doubt this is worth the complexity anyway - We'd have the verification scan regardless of how we choose when to use it, because how else would we implement no-rewrite varchar(8)->varchar(4) or text->xml? An unconditional verification scan is merely the most trivial algorithm for deciding when to employ it. Those conversions are semantically similar to adding CHECK constraints, and in that sense we already have an initial verification scan implementation: ATRewriteTable(..., InvalidOid, ...). > converting between two types that are mostly-but-not-quite binary compatible seems like an edge case in every sense. Yes. Indeed, that's the intuitive basis for my hypothesis that the verification scan will usually either fail early. I don't advocate this approach to pick up edge cases, but to pick up reasonable cases _without explicit annotations_ showing them to be achievable. Take the text->xml example, certainly of genuine value if not top-frequency. I see three ways to ensure we do a verification scan for it: 1. Have the user tell us: ALTER TABLE t ALTER c TYPE xml USING c::xml IMPLICIT 2. Mark the text->xml cast as "possibly no-rewrite" and look for that 3. Do a verification scan every time (1) gets the job done, but quality of user experience is lacking. (2) is best for the user, as long as the annotations are well-maintained, but it adds the most maintenance burden. (3) has the lowest maintenance burden and, for common use cases, the user-experience quality of (2), but it can significantly add to the ALTER TABLE runtime in rare cases. One can also do (1) with (2) to provide an override when the annotations are incomplete. Ultimately, any of these would work for my own needs. What's your preference? Are there other notable options that preserve full functionality? Thanks, nm
On Wed, Dec 29, 2010 at 11:16:23AM -0500, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > ALTER TABLE ALTER TYPE always rewrites the table heap and its indexes. In some > > cases, we can determine that doing so is unhelpful, and that the conversion > > shall always succeed: > > I wish to replace table rewrites with table verification scans where possible, > > then skip those verification scans where possible. > > This has been discussed before; have you read the previous threads? I cited two threads I had read on the subject. Were there other important ones? > I really really dislike the notion of a "verification scan": it's > basically work that is going to be useless if it fails. I think your > argument that it will usually fail quickly is quite unconvincing, and in > any case the situations where it is useful at all are too thin on the > ground to be worth the code space to implement it. It seems sufficient > to me to skip the rewrite in cases of provable binary compatibility, with > possibly an extra check for "safe" changes of typmod. With respect to > the latter, I agree a type-specific function to compare the typmods > would be the way to go, although "exemptor" seems a pretty badly chosen > name for it. I have attempted to expand on these problems in my reply to Robert.
On Wed, Dec 29, 2010 at 02:01:28PM -0500, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Dec 29, 2010, at 11:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I really really dislike the notion of a "verification scan": it's > >> basically work that is going to be useless if it fails. > > > I think it has potential in cases like text to xml. In that case it'll either work or fail, with no possibility of requiringa do-over. Scanning the whole table is a whole lot cheaper than rewriting it. > > I don't believe avoiding the write part (but not the read part, nor the > XML syntax verification part) is a sufficiently compelling argument to > justify having that code path. There are not enough distinct datatypes > sharing binary representations to make this a worthwhile thing to worry > over. > > Basically, I believe that the only use-case that will have more than > epsilon number of users is "I want to make this varchar(5) into > varchar(10), or possibly text". We can fix that case without adding a > boatload more code that we'll have to maintain. Those are certainly the big ones, but I've also hit these in the field: SET timezone = 'UTC'; CREATE DOMAIN state AS text CHECK (VALUE ~ '[A-Z]{2}'); CREATE TABLE t (c0 text, c1 text, c2 timestamp); ALTER TABLE t ALTER c0 TYPE state, ALTER c1 TYPE varchar(6), ALTER c2 TYPE timestamptz; It felt normal, but I very well may have been in that epsilon unawares. I would be disappointed to end the project with no way to avoid rewrites for them. That being said, there are certainly more ways to achieve all of those than the one I have proposed. > I do have some interest in the idea of having a type-specific function > that can recognize no-op typmod changes, but I would envision that as > being an expression evaluation optimization: let the planner throw away > the call to the length-checking function when it isn't going to do > anything. It's not by any means only useful in ALTER COLUMN TYPE --- > and in fact probably doesn't even need any bespoke code there, if we put > it into expression_planner() instead. That sounds significantly better. Ignorant question: how often will expression_planner have the old typmod available? Also, would this be worth attempting unconditionally, or might some callers use the expression too few times and suffer from the overhead of deciding whether to use it? Perhaps the order of patches I proposed was faulty and should have placed the most important use cases first, like this: 1. Modify ATPrepAlterColumnType to detect a bare RelabelType transformation expression and conditionally skip/downgrade ATRewriteTable. This would cover varchar->text and similar domain changes, but a full reindex remains. 2. In the no-rewrite case, modify the vicinity of finish_heap_swap to narrow us from reindexing the entire table to merely reindexing those indexes touched by the operator class changes. 3. Further skip reindex operations when the operator class has changed but the operator family has not. 4. Add pg_cast.castexemptor and modify CREATE CAST to populate it. Define an exemptor (or however we name it) for the varchar length coercion cast. Modify expression_planner to use it to strip out superfluous length coercion casts. This would cover varchar(10)->varchar(20) and similar domain changes. 5. Define exemptors for time, timetz, timestamp, timestamptz, and interval. 6. Define exemptor for bit varying. 7. Define exemptor for numeric. 8. (subject to further discussion) Add the verification scan to the vicinity of ATRewriteTable and skip the table rewrite when it finds no changes. This covers varchar(20)->varchar(10) and similar domain changes, timestamp->timestamptz, and text->xml. Thanks, nm
On Wed, Dec 29, 2010 at 6:46 PM, Noah Misch <noah@leadboat.com> wrote: >> I think this scenario will be more common than you might think. Tables don't contain random data; they contain data thatthe DBA thinks is valid. The situation where the data is mostly as you expect but with a few kooky rows is, in my experience,extremely common. > > Perhaps. A few kooky rows is indeed common, but we're talking about a specific > breed of kookiness: 99.9% of the rows have identical bits after an ALTER TYPE > transformation expression, and 0.1% have different bits. Is that common? I think it's common enough to be worth worrying about. > Expanding on my introduction, none of the following can yield a negative > verification scan; the result is always positive or an error: > > CREATE DOMAIN loosedom AS text; > CREATE DOMAIN tightdom AS text CHECK (value LIKE '<%/>'); > CREATE TABLE t (c varchar(6)); > INSERT INTO t VALUES ('<abc/>'),('<de/>'); > ALTER TABLE t ALTER c TYPE varchar(8); > ALTER TABLE t ALTER c TYPE text; > ALTER TABLE t ALTER c TYPE loosedom; > ALTER TABLE t ALTER c TYPE xml USING c::xml; > ALTER TABLE t ALTER c TYPE varchar(64); > ALTER TABLE t ALTER c TYPE tightdom; +1 for trying to optimize these cases (but maybe after we optimize the varchar -> text and varchar(less) -> varchar(more) cases to skip the scan altogether). > Adding a bpchar into the mix makes a negative verification scan possible, as > does a USING clause having a truncating effect. Continuing the example, these > can and would get a negative verification scan: > ALTER TABLE t ALTER c TYPE character(6); > ALTER TABLE> In case it was not obvious, I'll note that any error thrown by a transformationt ALTER c TYPE varchar(5) USINGc::varchar(5); > Plenty of academic USING clause examples exist: > ALTER TABLE t ALTER c TYPE varchar(8) USING CASE c WHEN '<de/>' THEN 'foo' ELSE c END; I am not really convinced that there's much value in optimizing these cases. They're not likely to arise very often in practice, and DBAs like predictability. There's tangible value in being able to say "this is going to scan your table at most once - it might rewrite it, or it might just verify what's there, or it might decide no scan is necessary, but the absolute worst case is one scan with rewrite". That's simple to understand and simple to document and probably simpler to code too, and I think it covers very nearly all of the cases people are likely to care about in practice. > Yes. Indeed, that's the intuitive basis for my hypothesis that the verification > scan will usually either fail early. I don't advocate this approach to pick up > edge cases, but to pick up reasonable cases _without explicit annotations_ > showing them to be achievable. Take the text->xml example, certainly of genuine > value if not top-frequency. I see three ways to ensure we do a verification > scan for it: > > 1. Have the user tell us: ALTER TABLE t ALTER c TYPE xml USING c::xml IMPLICIT > 2. Mark the text->xml cast as "possibly no-rewrite" and look for that > 3. Do a verification scan every time I think for any pair of types (T1, T2) we should first determine whether we can skip the scan altogether. If yes, we're done. If no, then we should have a way of determining whether a verify-only scan is guaranteed to be sufficient (in your terminology, the verification scan is guaranteed to return either positive or error, not negative). If yes, then we do a verification scan. If no, we do a rewrite. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Dec 29, 2010 at 11:14:37PM -0500, Robert Haas wrote: > On Wed, Dec 29, 2010 at 6:46 PM, Noah Misch <noah@leadboat.com> wrote: > > Perhaps. ?A few kooky rows is indeed common, but we're talking about a specific > > breed of kookiness: 99.9% of the rows have identical bits after an ALTER TYPE > > transformation expression, and 0.1% have different bits. ?Is that common? > > I think it's common enough to be worth worrying about. Okay. Could you give an example of a specific ALTER TABLE recipe worth worrying about and subject to degradation under my proposal? > > Adding a bpchar into the mix makes a negative verification scan possible, as > > does a USING clause having a truncating effect. ?Continuing the example, these > > can and would get a negative verification scan: > > ALTER TABLE t ALTER c TYPE character(6); > > ALTER TABLE> In case it was not obvious, I'll note that any error thrown by a transformation > t ALTER c TYPE varchar(5) USING c::varchar(5); > > Plenty of academic USING clause examples exist: > > ALTER TABLE t ALTER c TYPE varchar(8) USING CASE c WHEN '<de/>' THEN 'foo' ELSE c END; > > I am not really convinced that there's much value in optimizing these > cases. They're not likely to arise very often in practice, Just to make sure we're clear: those were examples of what I had intended to pessimize for the sake of simplicity. > and DBAs > like predictability. There's tangible value in being able to say > "this is going to scan your table at most once - it might rewrite it, > or it might just verify what's there, or it might decide no scan is > necessary, but the absolute worst case is one scan with rewrite". > That's simple to understand and simple to document and probably > simpler to code too, and I think it covers very nearly all of the > cases people are likely to care about in practice. > > 1. Have the user tell us: ALTER TABLE t ALTER c TYPE xml USING c::xml IMPLICIT > > 2. Mark the text->xml cast as "possibly no-rewrite" and look for that > > 3. Do a verification scan every time > > I think for any pair of types (T1, T2) we should first determine > whether we can skip the scan altogether. If yes, we're done. If no, > then we should have a way of determining whether a verify-only scan is > guaranteed to be sufficient (in your terminology, the verification > scan is guaranteed to return either positive or error, not negative). > If yes, then we do a verification scan. If no, we do a rewrite. How would we answer the second question in general? Thanks, nm
On Thu, Dec 30, 2010 at 12:24 AM, Noah Misch <noah@leadboat.com> wrote: > On Wed, Dec 29, 2010 at 11:14:37PM -0500, Robert Haas wrote: >> On Wed, Dec 29, 2010 at 6:46 PM, Noah Misch <noah@leadboat.com> wrote: >> > Perhaps. ?A few kooky rows is indeed common, but we're talking about a specific >> > breed of kookiness: 99.9% of the rows have identical bits after an ALTER TYPE >> > transformation expression, and 0.1% have different bits. ?Is that common? >> >> I think it's common enough to be worth worrying about. > > Okay. Could you give an example of a specific ALTER TABLE recipe worth worrying > about and subject to degradation under my proposal? Any of the ones you listed in your second set of examples, e.g.: ALTER TABLE t ALTER c TYPE character(6); Under your proposal, this can scan the whole table once in read-only mode, and then realize that it needs to go back and rewrite the whole table. >> I think for any pair of types (T1, T2) we should first determine >> whether we can skip the scan altogether. If yes, we're done. If no, >> then we should have a way of determining whether a verify-only scan is >> guaranteed to be sufficient (in your terminology, the verification >> scan is guaranteed to return either positive or error, not negative). >> If yes, then we do a verification scan. If no, we do a rewrite. > > How would we answer the second question in general? I am not sure - I guess we'd need to design some sort of mechanism for that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Dec 29, 2010, at 10:14 PM, Robert Haas wrote: > +1 for trying to optimize these cases (but maybe after we optimize the > varchar -> text and varchar(less) -> varchar(more) cases to skip the > scan altogether). +1 on getting the obvious cases of varchar and numeric done first; we run into those a lot at work and would be willing tosponsor work on a patch that makes those operations as fast as just adding a new column. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Thu, Dec 30, 2010 at 12:57:45AM -0500, Robert Haas wrote: > On Thu, Dec 30, 2010 at 12:24 AM, Noah Misch <noah@leadboat.com> wrote: > > On Wed, Dec 29, 2010 at 11:14:37PM -0500, Robert Haas wrote: > >> I think for any pair of types (T1, T2) we should first determine > >> whether we can skip the scan altogether. ?If yes, we're done. ?If no, > >> then we should have a way of determining whether a verify-only scan is > >> guaranteed to be sufficient (in your terminology, the verification > >> scan is guaranteed to return either positive or error, not negative). > >> If yes, then we do a verification scan. ?If no, we do a rewrite. > > > > How would we answer the second question in general? > > I am not sure - I guess we'd need to design some sort of mechanism for that. Okay, here goes. Given a Var "varexpr" representing the column we're changing and an expression tree "expr" we need to answer two questions (argument lists simplified -- assume the same RTEs in all cases): always-noop: "Will datumIsEquals(ExecEvalExpr(varexpr), ExecEvalExpr(expr)) return true or yield an error for all possible tuples?" never-error: "Will ExecEvalExpr(expr) never throw an error?" Currently we're only interested in the second question when the first is also true; I'm not sure if there's something fundamental there, or just an artifact of current needs. To support answering these questions, extend the CREATE CAST changes from my earlier proposal, modifying the exemptor signature to return an int, a bitmask containing one bit for each of these two questions. Call the function in find_typmod_coercion_function. If its return value answers "yes" to both questions, return COERCION_PATH_NONE, resulting in omission of the length coercion node. For other verdicts, generate the FuncExpr as normal and insert the verdict in a new FuncExpr field "funcexempt". (That need not increase the size of FuncExpr, if that's a concern.) ATPrepAlterColumnType, having generated its transformation expression, will call a new function that recursively walks the tree to answer the two questions. The walker will apply these rules: 1. For a Var with the varno/varattno in question, intrinsically "yes" to both. 2. A RelabelType node inherits the answers of its sole argument. 3. A CoerceToDomain node inherits the always-noop answer of its sole argument. When GetDomainConstraints() == NIL, it also inherits the never-error answer. Otherwise, never-error becomes "no". 4. A FuncExpr node has answers given by the bitwise-AND of its funcexempt field and the answers from its first argument. 5. Any other node answers "no" to both questions. If the transformation expression root has "yes" to both questions, we're done with no scan. If only always-noop is true, we do a verification scan only. Otherwise, we optimize nothing and do a rewrite. Thoughts? Thanks, nm
On Thu, Dec 30, 2010 at 8:35 PM, Noah Misch <noah@leadboat.com> wrote: > 4. A FuncExpr node has answers given by the bitwise-AND of its funcexempt field > and the answers from its first argument. Why its first argument? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Dec 31, 2010 at 12:34:50AM -0500, Robert Haas wrote: > On Thu, Dec 30, 2010 at 8:35 PM, Noah Misch <noah@leadboat.com> wrote: > > 4. A FuncExpr node has answers given by the bitwise-AND of its funcexempt field > > and the answers from its first argument. > > Why its first argument? funcexempt would only be nonzero for FuncExpr of length coercion casts. Those have the subject datum as a first argument, typmod as second, and is-explicit boolean as third. The other arguments are effectively already validated. That brings up a point -- the exemptor function also needs an is-explicit argument, as that affects the decision for some types.
On Thu, Dec 30, 2010 at 08:35:34PM -0500, Noah Misch wrote: > On Thu, Dec 30, 2010 at 12:57:45AM -0500, Robert Haas wrote: > > On Thu, Dec 30, 2010 at 12:24 AM, Noah Misch <noah@leadboat.com> wrote: > > > On Wed, Dec 29, 2010 at 11:14:37PM -0500, Robert Haas wrote: > > >> I think for any pair of types (T1, T2) we should first determine > > >> whether we can skip the scan altogether. ?If yes, we're done. ?If no, > > >> then we should have a way of determining whether a verify-only scan is > > >> guaranteed to be sufficient (in your terminology, the verification > > >> scan is guaranteed to return either positive or error, not negative). > > >> If yes, then we do a verification scan. ?If no, we do a rewrite. > > > > > > How would we answer the second question in general? > > > > I am not sure - I guess we'd need to design some sort of mechanism for that. > > Okay, here goes. Given...<snip> That seems to be working decently. However, It turns out that changes like text->varchar(8) and varchar(8)->varchar(4) don't fall into either of those optimization categories. An implicit varchar length coercion will truncate trailing blanks to make the string fit, so this legitimately requires a rewrite: CREATE TEMP TABLE t (c) AS SELECT 'foo '::text; SELECT c || '<-' FROM t; ALTER TABLE t ALTER c TYPE varchar(4); SELECT c || '<-' FROM t; In light of that, I'm increasingly thinking we'll want a way for the user to request a scan in place of a rewrite. The scan would throw an error if a rewrite ends up being necessary. Adding a keyword for that purpose, the syntax would resemble: ALTER TABLE <name> ALTER [COLUMN] <colname> [SET DATA] TYPE <typename>[IMPLICIT] [ USING <expression> ] I had wished to avoid this as something of a UI wart, but I don't see a way to cover all important conversions automatically and with a single-pass guarantee. This would cover the rest. Thoughts?
On Wed, Jan 5, 2011 at 11:26 PM, Noah Misch <noah@leadboat.com> wrote: > On Thu, Dec 30, 2010 at 08:35:34PM -0500, Noah Misch wrote: >> On Thu, Dec 30, 2010 at 12:57:45AM -0500, Robert Haas wrote: >> > On Thu, Dec 30, 2010 at 12:24 AM, Noah Misch <noah@leadboat.com> wrote: >> > > On Wed, Dec 29, 2010 at 11:14:37PM -0500, Robert Haas wrote: >> > >> I think for any pair of types (T1, T2) we should first determine >> > >> whether we can skip the scan altogether. ?If yes, we're done. ?If no, >> > >> then we should have a way of determining whether a verify-only scan is >> > >> guaranteed to be sufficient (in your terminology, the verification >> > >> scan is guaranteed to return either positive or error, not negative). >> > >> If yes, then we do a verification scan. ?If no, we do a rewrite. >> > > >> > > How would we answer the second question in general? >> > >> > I am not sure - I guess we'd need to design some sort of mechanism for that. >> >> Okay, here goes. Given...<snip> > > That seems to be working decently. However, It turns out that changes like > text->varchar(8) and varchar(8)->varchar(4) don't fall into either of those > optimization categories. An implicit varchar length coercion will truncate > trailing blanks to make the string fit, so this legitimately requires a rewrite: > > CREATE TEMP TABLE t (c) AS SELECT 'foo '::text; > SELECT c || '<-' FROM t; > ALTER TABLE t ALTER c TYPE varchar(4); > SELECT c || '<-' FROM t; > > In light of that, I'm increasingly thinking we'll want a way for the user to > request a scan in place of a rewrite. The scan would throw an error if a > rewrite ends up being necessary. Adding a keyword for that purpose, the syntax > would resemble: > > ALTER TABLE <name> ALTER [COLUMN] <colname> [SET DATA] TYPE <typename> > [IMPLICIT] [ USING <expression> ] > > I had wished to avoid this as something of a UI wart, but I don't see a way to > cover all important conversions automatically and with a single-pass guarantee. > This would cover the rest. > > Thoughts? I still think you're better off focusing first on the case where we can skip the whole nine yards, and doing this stuff as a follow-on patch. Trying to do too many things, especially possibly controversial stuff, especially in the last CommitFest, often ends up with the whole patch getting rejected, which makes no one happy. Submitting the smallest useful, self-contained change you can and then work up from there. Or at least split out the patch into parts that can be applied independently, so that if the eventual committer likes A but not B you at least get A in. Don't take any of this as a rejection of any of what you're proposing; I haven't really made up my mind yet, and there are plenty of other people who would have a say even if I had. Rather, I'd like to maximize the chances of us at least part of this work committed to 9.1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 06, 2011 at 12:24:19AM -0500, Robert Haas wrote: > I still think you're better off focusing first on the case where we > can skip the whole nine yards, and doing this stuff as a follow-on > patch. Trying to do too many things, especially possibly > controversial stuff, especially in the last CommitFest, often ends up > with the whole patch getting rejected, which makes no one happy. > Submitting the smallest useful, self-contained change you can and then > work up from there. Or at least split out the patch into parts that > can be applied independently, so that if the eventual committer likes > A but not B you at least get A in. Will do. Thanks.