Thread: pg_migrator and an 8.3-compatible tsvector data type
I found out at PGCon that the internal format of tsvector changed slightly from 8.3 to 8.4. Teodor gave me a conversion function and I have adjusted pg_migrator to install a v8_3_tsvector data type to be used during the load so the old user tables use that data type. You can see the code here at the bottom of this file: http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/pg-migrator/pg_migrator/src/pg_migrator.c?rev=1.36&content-type=text/x-cvsweb-markup I am not done yet because eventually v8_3_tsvector will be moved into a separate schema and the tsvector data type restored. The problem I have is that while I have created v8_3_tsvector (with casts), I have not created index routines for it, so I get this error when restoring a database that uses a GIN index on tsvector: Restoring database schema psql:/u/postgres/pg_migrator_dump_db.sql:7006: ERROR: data typetsvectorhas no default operator class for access method "gin"HINT: You must specify an operator class for the indexor define adefault operator class for the data type. So, how do I create a compatible set of hooks for v8_3_tsvector? Because tsvector is a native data type I can't figure out how to set up a similar type. Also, will the indexes be OK if they are created from types that don't have the 8.3 format? What about new indexes created after the migration is done? Ideally the cast would just be called during index activity and the 8.4 tsvector index routines used. I am told by Teodor that GiST indexes would be fine with the 8.3 data format, and we are forcing the reindexing of GIN indexes anyway in 8.4. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Because no one has responed --------------------------------------------------------------------------- Bruce Momjian wrote: > I found out at PGCon that the internal format of tsvector changed > slightly from 8.3 to 8.4. Teodor gave me a conversion function and I > have adjusted pg_migrator to install a v8_3_tsvector data type to be > used during the load so the old user tables use that data type. You can > see the code here at the bottom of this file: > > http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/pg-migrator/pg_migrator/src/pg_migrator.c?rev=1.36&content-type=text/x-cvsweb-markup > > I am not done yet because eventually v8_3_tsvector will be moved into a > separate schema and the tsvector data type restored. > > The problem I have is that while I have created v8_3_tsvector (with > casts), I have not created index routines for it, so I get this error > when restoring a database that uses a GIN index on tsvector: > > Restoring database schema > psql:/u/postgres/pg_migrator_dump_db.sql:7006: ERROR: data type > tsvector has no default operator class for access method "gin" > HINT: You must specify an operator class for the index or define a > default operator class for the data type. > > So, how do I create a compatible set of hooks for v8_3_tsvector? > Because tsvector is a native data type I can't figure out how to set up > a similar type. Also, will the indexes be OK if they are created from > types that don't have the 8.3 format? What about new indexes created > after the migration is done? > > Ideally the cast would just be called during index activity and the 8.4 > tsvector index routines used. I am told by Teodor that GiST indexes > would be fine with the 8.3 data format, and we are forcing the > reindexing of GIN indexes anyway in 8.4. > > -- > Bruce Momjian <bruce@momjian.us> http://momjian.us > EnterpriseDB http://enterprisedb.com > > + If your life is a hard drive, Christ can be your backup. + > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Because no one has responded, I am going to prevent pg_migrator from working with a cluster that uses tsvector. I realize this limits pg_migrator's usefulness, but I have to move on. --------------------------------------------------------------------------- Bruce Momjian wrote: > I found out at PGCon that the internal format of tsvector changed > slightly from 8.3 to 8.4. Teodor gave me a conversion function and I > have adjusted pg_migrator to install a v8_3_tsvector data type to be > used during the load so the old user tables use that data type. You can > see the code here at the bottom of this file: > > http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/pg-migrator/pg_migrator/src/pg_migrator.c?rev=1.36&content-type=text/x-cvsweb-markup > > I am not done yet because eventually v8_3_tsvector will be moved into a > separate schema and the tsvector data type restored. > > The problem I have is that while I have created v8_3_tsvector (with > casts), I have not created index routines for it, so I get this error > when restoring a database that uses a GIN index on tsvector: > > Restoring database schema > psql:/u/postgres/pg_migrator_dump_db.sql:7006: ERROR: data type > tsvector has no default operator class for access method "gin" > HINT: You must specify an operator class for the index or define a > default operator class for the data type. > > So, how do I create a compatible set of hooks for v8_3_tsvector? > Because tsvector is a native data type I can't figure out how to set up > a similar type. Also, will the indexes be OK if they are created from > types that don't have the 8.3 format? What about new indexes created > after the migration is done? > > Ideally the cast would just be called during index activity and the 8.4 > tsvector index routines used. I am told by Teodor that GiST indexes > would be fine with the 8.3 data format, and we are forcing the > reindexing of GIN indexes anyway in 8.4. > > -- > Bruce Momjian <bruce@momjian.us> http://momjian.us > EnterpriseDB http://enterprisedb.com > > + If your life is a hard drive, Christ can be your backup. + > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On 5/28/09 2:30 PM, Bruce Momjian wrote: > Because no one has responded, I am going to prevent pg_migrator from > working with a cluster that uses tsvector. I realize this limits > pg_migrator's usefulness, but I have to move on. I don't know how to fix the issue, certainly. Why was there a format change to tsvector anyway? Was there an important reason for this? -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
Josh Berkus wrote: > On 5/28/09 2:30 PM, Bruce Momjian wrote: > > Because no one has responded, I am going to prevent pg_migrator from > > working with a cluster that uses tsvector. I realize this limits > > pg_migrator's usefulness, but I have to move on. > > I don't know how to fix the issue, certainly. > > Why was there a format change to tsvector anyway? Was there an > important reason for this? The ordering of the lexems was changed: /* * Convert 8.3 tsvector to 8.4 * * 8.3 sorts lexemes by its length and if lengths are the same then it uses * alphabeticorder; 8.4 sorts lexemes in lexicographical order, e.g. * * => SELECT 'c bb aaa'::tsvector; * tsvector * ----------------* 'aaa' 'bb' 'c' -- 8.4 * 'c' 'bb' 'aaa' -- 8.3 */ -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce, > The ordering of the lexems was changed: > > /* > * Convert 8.3 tsvector to 8.4 > * > * 8.3 sorts lexemes by its length and if lengths are the same then it uses > * alphabetic order; 8.4 sorts lexemes in lexicographical order, e.g. > * > * => SELECT 'c bb aaa'::tsvector; > * tsvector > * ---------------- > * 'aaa' 'bb' 'c' -- 8.4 > * 'c' 'bb' 'aaa' -- 8.3 > */ What does that get us in terms of performance etc.? Given that it's going to prevent users of Tsearch from upgrading-in-place, the format change ought to be giving us some serious gains. Otherwise we should put it off until we need to make other datatype changes. -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
Josh Berkus wrote: > Bruce, > > > The ordering of the lexems was changed: > > > > /* > > * Convert 8.3 tsvector to 8.4 > > * > > * 8.3 sorts lexemes by its length and if lengths are the same then it uses > > * alphabetic order; 8.4 sorts lexemes in lexicographical order, e.g. > > * > > * => SELECT 'c bb aaa'::tsvector; > > * tsvector > > * ---------------- > > * 'aaa' 'bb' 'c' -- 8.4 > > * 'c' 'bb' 'aaa' -- 8.3 > > */ > > What does that get us in terms of performance etc.? > > Given that it's going to prevent users of Tsearch from > upgrading-in-place, the format change ought to be giving us some serious > gains. Otherwise we should put it off until we need to make other > datatype changes. No idea but now that we are in beta we would then be invalidating beta tester data. The area I got stuck on is that there is no CAST behavior when creating an index. We are already invalidating GIN indexes, but I can't even create an index to support the old data type. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Josh Berkus <josh@agliodbs.com> writes: > Bruce, >> The ordering of the lexems was changed: > What does that get us in terms of performance etc.? It was changed to support partial-match tsvector queries. Without it, a partial match query would have to scan entire tsvectors instead of applying binary search. I don't know if Oleg and Teodor did any actual performance tests on the size of the hit, but it seems like it could be pretty awful for large documents. regards, tom lane
Bruce Momjian <bruce@momjian.us> writes: > Josh Berkus wrote: >> Given that it's going to prevent users of Tsearch from >> upgrading-in-place, the format change ought to be giving us some serious >> gains. Otherwise we should put it off until we need to make other >> datatype changes. > No idea but now that we are in beta we would then be invalidating beta > tester data. The area I got stuck on is that there is no CAST behavior > when creating an index. We are already invalidating GIN indexes, but I > can't even create an index to support the old data type. It's certainly doable. Bruce is just applying the strategy he mentioned in our talk ;-) regards, tom lane
Tom, > It was changed to support partial-match tsvector queries. Without it, > a partial match query would have to scan entire tsvectors instead > of applying binary search. I don't know if Oleg and Teodor did any > actual performance tests on the size of the hit, but it seems like > it could be pretty awful for large documents. Ah, ok, that's pretty significant. It makes a huge difference for some users, one of whom will be quoted on the press release. -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Josh Berkus wrote: > >> Given that it's going to prevent users of Tsearch from > >> upgrading-in-place, the format change ought to be giving us some serious > >> gains. Otherwise we should put it off until we need to make other > >> datatype changes. > > > No idea but now that we are in beta we would then be invalidating beta > > tester data. The area I got stuck on is that there is no CAST behavior > > when creating an index. We are already invalidating GIN indexes, but I > > can't even create an index to support the old data type. > > It's certainly doable. Bruce is just applying the strategy he mentioned > in our talk ;-) Kind of --- I am stuck because we don't auto-cast for index usage like we do for function calls and operators, so I will just keep going and disable tsvector. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Thu, May 28, 2009 at 11:26 PM, Bruce Momjian <bruce@momjian.us> wrote: > Kind of --- I am stuck because we don't auto-cast for index usage like > we do for function calls and operators, so I will just keep going and > disable tsvector. I was pretty stumped by what you meant by "auto-cast" but now I'm beginning to think what you're looking for are cross-type comparison operators? You could define a bunch of such operators which manually call the cast on one side and add them to the operator class. I'm not sure how to set up the operator class entries for GIN though. -- greg
Greg Stark <stark@enterprisedb.com> writes: > On Thu, May 28, 2009 at 11:26 PM, Bruce Momjian <bruce@momjian.us> wrote: >> Kind of --- I am stuck because we don't auto-cast for index usage like >> we do for function calls and operators, so I will just keep going and >> disable tsvector. > I was pretty stumped by what you meant by "auto-cast" but now I'm > beginning to think what you're looking for are cross-type comparison > operators? You could define a bunch of such operators which manually > call the cast on one side and add them to the operator class. I'm not > sure how to set up the operator class entries for GIN though. No, what we need is a GIN (and probably GIST) opclass to support old-style tsvectors. The user-level operators aren't really an issue because an implicit cast will get inserted in front of them when necessary, but we don't support that for indexes. People who want decent performance on partial match queries (or maybe even *any* TS queries) are going to have to convert to the new format anyway, though, so maybe this isn't something worth putting a whole lot of work into. It might be sufficient if the database conversion succeeds but leaves the tsvector columns unusable until they're converted. regards, tom lane
Tom Lane wrote: > People who want decent performance on partial match queries (or maybe > even *any* TS queries) are going to have to convert to the new format > anyway, though, so maybe this isn't something worth putting a whole lot > of work into. It might be sufficient if the database conversion > succeeds but leaves the tsvector columns unusable until they're > converted. There are so many caveats on pg_migrator (and things that need to be done after the migration is complete) that one starts to wonder if people is not better off just using parallel pg_restore. From Stefan's reported timings I'm not sure that pg_migrator is that much of a benefit in the first place ... unless copy mode can be made much faster. (On link mode it is so much faster that it's worth it, but then you don't have an escape hatch). -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > Tom Lane wrote: > > > People who want decent performance on partial match queries (or maybe > > even *any* TS queries) are going to have to convert to the new format > > anyway, though, so maybe this isn't something worth putting a whole lot > > of work into. It might be sufficient if the database conversion > > succeeds but leaves the tsvector columns unusable until they're > > converted. > > There are so many caveats on pg_migrator (and things that need to be > done after the migration is complete) that one starts to wonder if > people is not better off just using parallel pg_restore. From Stefan's > reported timings I'm not sure that pg_migrator is that much of a benefit > in the first place ... unless copy mode can be made much faster. (On > link mode it is so much faster that it's worth it, but then you don't > have an escape hatch). That is accurate. I doubt copy mode speed can be improved. I think doing a backup, then using --link mode will be the most common usage pattern. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Fri, May 29, 2009 at 12:53 AM, Bruce Momjian <bruce@momjian.us> wrote: > That is accurate. I doubt copy mode speed can be improved. I think > doing a backup, then using --link mode will be the most common usage > pattern. Well doing a backup will take more or less the same time as copy mode, no? Actually there's a big win for people who can take a snapshot using their SAN or filesystem, but that's not everyone. -- greg
Bruce Momjian wrote: > Alvaro Herrera wrote: > > There are so many caveats on pg_migrator (and things that need to be > > done after the migration is complete) that one starts to wonder if > > people is not better off just using parallel pg_restore. From Stefan's > > reported timings I'm not sure that pg_migrator is that much of a benefit > > in the first place ... unless copy mode can be made much faster. (On > > link mode it is so much faster that it's worth it, but then you don't > > have an escape hatch). > > That is accurate. I doubt copy mode speed can be improved. Why not? Right now it's single-threaded. Would it be faster if it ran several copies in parallel? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera wrote: > Bruce Momjian wrote: > > Alvaro Herrera wrote: > > > > There are so many caveats on pg_migrator (and things that need to be > > > done after the migration is complete) that one starts to wonder if > > > people is not better off just using parallel pg_restore. From Stefan's > > > reported timings I'm not sure that pg_migrator is that much of a benefit > > > in the first place ... unless copy mode can be made much faster. (On > > > link mode it is so much faster that it's worth it, but then you don't > > > have an escape hatch). > > > > That is accurate. I doubt copy mode speed can be improved. > > Why not? Right now it's single-threaded. Would it be faster if it ran > several copies in parallel? Sure, but that assumes you have parallel I/O channels; I assume right now it is I/O limited. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Alvaro Herrera wrote: > > Why not? Right now it's single-threaded. Would it be faster if it ran > > several copies in parallel? > > Sure, but that assumes you have parallel I/O channels; I assume right > now it is I/O limited. But so does parallel pg_restore, no? -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Bruce, > Sure, but that assumes you have parallel I/O channels; I assume right > now it is I/O limited. Even a simple cp will be speeded up between 50% and 150% (depending on your storage) by doing several files in parallel. However, I'm not sure that's what you should be spending your time on. -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
Alvaro Herrera wrote: > Bruce Momjian wrote: > >> Alvaro Herrera wrote: >> > > >>> Why not? Right now it's single-threaded. Would it be faster if it ran >>> several copies in parallel? >>> >> Sure, but that assumes you have parallel I/O channels; I assume right >> now it is I/O limited. >> > > But so does parallel pg_restore, no? > > Well, no, it assumes you're not saturating your I/O bandwidth, which turns out to be the case more often than you might expect, even without multiple I/O channels. cheers andrew
On Thu, 28 May 2009, Alvaro Herrera wrote: > There are so many caveats on pg_migrator (and things that need to be > done after the migration is complete) that one starts to wonder if > people is not better off just using parallel pg_restore. If you've got something close to a 24x7 shop, where planned downtime can at worst in the small number of hours, the landscape breaks down something like this: <100GB: Current dump/reload is probably fine 100GB-1TB: Aggressive pg_dump and parallel pg_restore might work >1TB: Nothing but pg_migrator will save you If you're doing a version migration, you need to block writer activity from the time you start dumping to get a perfectly consistant snapshot, and large databases can't be made available to users until their indexes are rebuilt lest every client doing a "select * from x where..." kicks off a giant sequential scan. Parallel restore only improves one section of that work. There is a significant difference between something being difficult or having caveats and being it impossible, and right now there's no good answer for TB scale upgrades unless you've lucky enough to have an app that Slony can usefully replicate. If there's somebody out there who has a bunch of tsvector data they need to get upgraded in-place, they should step up to help out here. I think Bruce is right to move on to more popular issues until that happens. This issue can all get sorted out in user space at some later time, right? > From Stefan's reported timings I'm not sure that pg_migrator is that > much of a benefit in the first place ... unless copy mode can be made > much faster. Systems administrators have all sorts of techniques to speed up this sort of thing, it's a well solved problem. The people who need in-place upgrade will figure out how to either make copy mode fast enough, or how to make link mode usable by filesystem tricks. We don't need to worry about that for them. I think it's premature and distracting to start talking about optimizing pg_migrator anyway, it's plenty fast for its intended audience. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Alvaro Herrera wrote: > Bruce Momjian wrote: >> Alvaro Herrera wrote: > >>> There are so many caveats on pg_migrator (and things that need to be >>> done after the migration is complete) that one starts to wonder if >>> people is not better off just using parallel pg_restore. From Stefan's >>> reported timings I'm not sure that pg_migrator is that much of a benefit >>> in the first place ... unless copy mode can be made much faster. (On >>> link mode it is so much faster that it's worth it, but then you don't >>> have an escape hatch). >> That is accurate. I doubt copy mode speed can be improved. > > Why not? Right now it's single-threaded. Would it be faster if it ran > several copies in parallel? I guess it would be much faster on powerful hardware - we also have to consider that copy mode now is a no-op really. If it had to do any actual page conversation too it seems entirely possible that a parallel restore might be even faster that a single threaded pg_migrator in copy mode. Stefan
Bruce Momjian píše v čt 28. 05. 2009 v 17:42 -0400: > Josh Berkus wrote: > > On 5/28/09 2:30 PM, Bruce Momjian wrote: > > > Because no one has responded, I am going to prevent pg_migrator from > > > working with a cluster that uses tsvector. I realize this limits > > > pg_migrator's usefulness, but I have to move on. > > > > I don't know how to fix the issue, certainly. > > > > Why was there a format change to tsvector anyway? Was there an > > important reason for this? > > The ordering of the lexems was changed: The biggest problem is dictionary change. I'm not sure if it happened but IIRC Teodor mentioned it in Ottawa. If it happened It hits down tsvector compatibility at all. Zdenek
Alvaro Herrera <alvherre@commandprompt.com> writes: > Bruce Momjian wrote: >> Alvaro Herrera wrote: >>> Why not? Right now it's single-threaded. Would it be faster if it ran >>> several copies in parallel? >> >> Sure, but that assumes you have parallel I/O channels; I assume right >> now it is I/O limited. > But so does parallel pg_restore, no? The point of parallel pg_restore is that COPY is frequently CPU-bound to some extent, and so you can put multiple CPUs to work by parallelizing. I find it much less probable that multiple "cp" operations can be parallelized, unless the DB is spread across multiple tablespaces and the code is smart enough to interleave by tablespace. regards, tom lane
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes: > The biggest problem is dictionary change. I'm not sure if it happened > but IIRC Teodor mentioned it in Ottawa. If it happened It hits down > tsvector compatibility at all. No more than changing dictionary behavior in an existing installation. What was stated when the issue came up during 8.3 development is that you don't normally need to worry about small changes in dictionary behavior because overall text search behavior will still be "close enough". (I seem to recall that I'd complained that any change in dictionary behavior would invalidate indexes based on the dictionary, and this was the answer.) regards, tom lane
Tom Lane píše v pá 29. 05. 2009 v 11:28 -0400: > Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes: > > The biggest problem is dictionary change. I'm not sure if it happened > > but IIRC Teodor mentioned it in Ottawa. If it happened It hits down > > tsvector compatibility at all. > > No more than changing dictionary behavior in an existing installation. > What was stated when the issue came up during 8.3 development is that > you don't normally need to worry about small changes in dictionary > behavior because overall text search behavior will still be "close > enough". (I seem to recall that I'd complained that any change in > dictionary behavior would invalidate indexes based on the dictionary, > and this was the answer.) It seems to be OK then. However, pg_migrator should inform about it. Zdenek
Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > > Bruce, > >> The ordering of the lexems was changed: > > > What does that get us in terms of performance etc.? > > It was changed to support partial-match tsvector queries. Without it, > a partial match query would have to scan entire tsvectors instead > of applying binary search. I don't know if Oleg and Teodor did any > actual performance tests on the size of the hit, but it seems like > it could be pretty awful for large documents. I started thinking about the performance issues of the tsvector changes. Teodor gave me this code for conversion that basically does: qsort_arg((void *) ARRPTR(t), t->size, sizeof(WordEntry), cmpLexeme, (void*) t); So, basically, every time there is a cast we have to do a sort, which for a large document would yield poor performance, and because we are not storing the sorted result, it happens for every access; this might be an unacceptable performance burden. So, one idea would be, instead of a cast, have pg_migrator rebuild the tsvector columns with ALTER TABLE, so then the 8.4 index code could be used. But then we might as well just tell the users to migrate the tsvector tables themselves, which is how pg_migrator behaves now. Obviously we are still trying to figure out the best way to handle data type changes; I think as soon as we figure out a plan for tsvector we can use that method for future changes. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce, > So, one idea would be, instead of a cast, have pg_migrator rebuild the > tsvector columns with ALTER TABLE, so then the 8.4 index code could be > used. But then we might as well just tell the users to migrate the > tsvector tables themselves, which is how pg_migrator behaves now. It would be nice to have pg_migrator handle this, especially if we could do it in parallel. Then we just have to warn users that migrating a database with tsvector columns takes significantly longer. That is, 1) do rest of catalog swap and link/copy of objects. 2) mark all tsvector columns as 83_tsvector and add new tsvector type (these columns will be unusable for queries) 3) bring up database 4) search for all 83_tsvector columns 5) do ALTER TABLE on each of these columns, in parallel, up to a configuration setting (default 3). However, I can also understand not having time to implement the above before 8.4 release. -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > It would be nice to have pg_migrator handle this, especially if we could > do it in parallel. Then we just have to warn users that migrating a > database with tsvector columns takes significantly longer. That is, > 1) do rest of catalog swap and link/copy of objects. > 2) mark all tsvector columns as 83_tsvector and add new tsvector type > (these columns will be unusable for queries) > 3) bring up database > 4) search for all 83_tsvector columns > 5) do ALTER TABLE on each of these columns, in parallel, up to a > configuration setting (default 3). pg_migrator is already emitting a script that is intended to be run after conversion, to handle REINDEXing of incompatible indexes. That could easily be made to do ALTER TYPE on old tsvector columns too, no? The parallel bit is pie in the sky and should not be considered even for a millisecond during this release cycle. Save it for 8.5, or suggest to people that they manually cut the script apart if they're desperate to have that. regards, tom lane
Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > > It would be nice to have pg_migrator handle this, especially if we could > > do it in parallel. Then we just have to warn users that migrating a > > database with tsvector columns takes significantly longer. That is, > > > 1) do rest of catalog swap and link/copy of objects. > > 2) mark all tsvector columns as 83_tsvector and add new tsvector type > > (these columns will be unusable for queries) > > 3) bring up database > > 4) search for all 83_tsvector columns > > 5) do ALTER TABLE on each of these columns, in parallel, up to a > > configuration setting (default 3). > > pg_migrator is already emitting a script that is intended to be run > after conversion, to handle REINDEXing of incompatible indexes. That > could easily be made to do ALTER TYPE on old tsvector columns too, no? Hmmm. Well, the problem right now with v8_3_tsvector is that it fails on index create, even when the index is empty, because I can't figure out how to simply set up the proper index catalog entries. Once that is fixed and I can bind tsvector to v8_3_tsvector on schema creation, I can easily emit ALTER TABLE to fix the issue. And, at that point the tsvector GIN indexes would be automatically created so I can skip that part. > The parallel bit is pie in the sky and should not be considered even > for a millisecond during this release cycle. Save it for 8.5, or > suggest to people that they manually cut the script apart if they're > desperate to have that. Agreed. FYI, this is 1% as hard as the Win32 port, so I am not discouraged. ;-) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Tom Lane wrote: > > Josh Berkus <josh@agliodbs.com> writes: > > > It would be nice to have pg_migrator handle this, especially if we could > > > do it in parallel. Then we just have to warn users that migrating a > > > database with tsvector columns takes significantly longer. That is, > > > > > 1) do rest of catalog swap and link/copy of objects. > > > 2) mark all tsvector columns as 83_tsvector and add new tsvector type > > > (these columns will be unusable for queries) > > > 3) bring up database > > > 4) search for all 83_tsvector columns > > > 5) do ALTER TABLE on each of these columns, in parallel, up to a > > > configuration setting (default 3). > > > > pg_migrator is already emitting a script that is intended to be run > > after conversion, to handle REINDEXing of incompatible indexes. That > > could easily be made to do ALTER TYPE on old tsvector columns too, no? > > Hmmm. Well, the problem right now with v8_3_tsvector is that it fails > on index create, even when the index is empty, because I can't figure > out how to simply set up the proper index catalog entries. Once that is > fixed and I can bind tsvector to v8_3_tsvector on schema creation, I can > easily emit ALTER TABLE to fix the issue. And, at that point the > tsvector GIN indexes would be automatically created so I can skip that > part. I have discovered a simpler solution using ALTER TABLE and calling a conversion function: test=> CREATE TABLE tsvector_test(x tsvector);CREATE TABLEtest=> ALTER TABLE tsvector_test ALTER COLUMN x TYPE tsvector test->USING conversion_func(x);ALTER TABLE No need for a fake data type and the required index infrastructure. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Sat, May 30, 2009 at 1:11 PM, Bruce Momjian <bruce@momjian.us> wrote: > I have discovered a simpler solution using ALTER TABLE and calling a > conversion function: > > test=> CREATE TABLE tsvector_test(x tsvector); > CREATE TABLE > test=> ALTER TABLE tsvector_test ALTER COLUMN x TYPE tsvector > test-> USING conversion_func(x); > ALTER TABLE > > No need for a fake data type and the required index infrastructure. I assume you're putting this in the list of commands to run post-migration along with any reindex commands etc? Because it will take a while (still faster than dump/reload i think). For this case, assuming the new tsvector's output function doesn't get confused by the old ordering, I think you can just use USING x::text::tsvector as your conversion expression. For more complex cases you might need to package up the old output function. Also note that you'll want to do any other conversions in the same table at the same time rather than doing multiple conversions. Also, one gotcha to note is that tsvector data can appear inside composite data types or arrays. I don't think that's common so perhaps just a warning in the readme would suffice, but it's something to note at least. -- greg
Hi, Le 30 mai 09 à 16:02, Greg Stark a écrit : > On Sat, May 30, 2009 at 1:11 PM, Bruce Momjian <bruce@momjian.us> > wrote: >> I have discovered a simpler solution using ALTER TABLE and calling a >> conversion function: >> >> test=> CREATE TABLE tsvector_test(x tsvector); >> CREATE TABLE >> test=> ALTER TABLE tsvector_test ALTER COLUMN x TYPE tsvector >> test-> USING conversion_func(x); >> ALTER TABLE >> >> No need for a fake data type and the required index infrastructure. > > I assume you're putting this in the list of commands to run > post-migration along with any reindex commands etc? Because it will > take a while (still faster than dump/reload i think). Just thinking some more about the idea to get all those post- processing steps running in parallel, it's occurring to me that we have all we need already: would it be possible for pg_migrator to issue a schema only script with a catalog, in the custom archive format? Then we could use pg_restore -j <whatever> post_migrator.script to run the last migration step. Of course, people will want the custom script output of pg_migrator to be optional, I guess. Regards, -- dim
Bruce Momjian <bruce@momjian.us> writes: > I have discovered a simpler solution using ALTER TABLE and calling a > conversion function: > test=> CREATE TABLE tsvector_test(x tsvector); > CREATE TABLE > test=> ALTER TABLE tsvector_test ALTER COLUMN x TYPE tsvector > test-> USING conversion_func(x); > ALTER TABLE > No need for a fake data type and the required index infrastructure. I think this is basically a large-caliber foot gun. You're going to pretend that invalid data is valid, until the user gets around to fixing it? regards, tom lane
Dimitri Fontaine wrote: > Hi, > > Le 30 mai 09 ? 16:02, Greg Stark a ?crit : > > On Sat, May 30, 2009 at 1:11 PM, Bruce Momjian <bruce@momjian.us> > > wrote: > >> I have discovered a simpler solution using ALTER TABLE and calling a > >> conversion function: > >> > >> test=> CREATE TABLE tsvector_test(x tsvector); > >> CREATE TABLE > >> test=> ALTER TABLE tsvector_test ALTER COLUMN x TYPE tsvector > >> test-> USING conversion_func(x); > >> ALTER TABLE > >> > >> No need for a fake data type and the required index infrastructure. > > > > I assume you're putting this in the list of commands to run > > post-migration along with any reindex commands etc? Because it will > > take a while (still faster than dump/reload i think). Yes. > Just thinking some more about the idea to get all those post- > processing steps running in parallel, it's occurring to me that we > have all we need already: would it be possible for pg_migrator to > issue a schema only script with a catalog, in the custom archive format? > Then we could use pg_restore -j <whatever> post_migrator.script to run > the last migration step. > > Of course, people will want the custom script output of pg_migrator to > be optional, I guess. Yes, eventually that is possible. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > I have discovered a simpler solution using ALTER TABLE and calling a > > conversion function: > > > test=> CREATE TABLE tsvector_test(x tsvector); > > CREATE TABLE > > test=> ALTER TABLE tsvector_test ALTER COLUMN x TYPE tsvector > > test-> USING conversion_func(x); > > ALTER TABLE > > > No need for a fake data type and the required index infrastructure. > > I think this is basically a large-caliber foot gun. You're going to > pretend that invalid data is valid, until the user gets around to fixing > it? What choice do we have? While we can mark indexes as invalid (which we do), how do we mark a table's contents as invalid? Should we create rules so no one can see the data and then have the ALTER TABLE script remove the rules after it is rebuilt? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > > Just thinking some more about the idea to get all those post- > > processing steps running in parallel, it's occurring to me that we > > have all we need already: would it be possible for pg_migrator to > > issue a schema only script with a catalog, in the custom archive format? > > Then we could use pg_restore -j <whatever> post_migrator.script to run > > the last migration step. > > > > Of course, people will want the custom script output of pg_migrator to > > be optional, I guess. > > Yes, eventually that is possible. I have added that as a pg_migrator TODO item. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Tom Lane wrote: > > Bruce Momjian <bruce@momjian.us> writes: > > > I have discovered a simpler solution using ALTER TABLE and calling a > > > conversion function: > > > > > test=> CREATE TABLE tsvector_test(x tsvector); > > > CREATE TABLE > > > test=> ALTER TABLE tsvector_test ALTER COLUMN x TYPE tsvector > > > test-> USING conversion_func(x); > > > ALTER TABLE > > > > > No need for a fake data type and the required index infrastructure. > > > > I think this is basically a large-caliber foot gun. You're going to Uh, isn't pg_migrator itself a large-caliber foot gun? No need to call just part of it a foot gun. ;-) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > Tom Lane wrote: >> I think this is basically a large-caliber foot gun. You're going to >> pretend that invalid data is valid, until the user gets around to fixing >> it? > What choice do we have? Create a fake data type, just as you said before. regards, tom lane
On Sat, May 30, 2009 at 6:23 PM, Bruce Momjian <bruce@momjian.us> wrote: >> I think this is basically a large-caliber foot gun. You're going to >> pretend that invalid data is valid, until the user gets around to fixing >> it? > > What choice do we have? Well you can store the data in a new fake data type (or even just mark the column as a bytea -- since any varlena is as good as any other). Then provide this conversion function to create the new data. I suppose that means you should drop the indexes since if you leave them things could get weird. But doing the conversion would have to rebuild indexes anyways so CREATE INDEX should run in the same time as that step of the conversion would have taken. It would be nice if you could leave them around so the conversion would rebuild them automatically, but that would require creating operators and an opclass for the fake data type which would be more of a pain than just marking the column as a bytea or a data type with no operators. -- greg
Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Tom Lane wrote: > >> I think this is basically a large-caliber foot gun. You're going to > >> pretend that invalid data is valid, until the user gets around to fixing > >> it? > > > What choice do we have? > > Create a fake data type, just as you said before. But I have to create fake index infastructure, and I have to then modify every reference to that data time at the end, even references in rules and expression indexes --- that seems too complex. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Greg Stark wrote: > On Sat, May 30, 2009 at 1:11 PM, Bruce Momjian <bruce@momjian.us> wrote: > > I have discovered a simpler solution using ALTER TABLE and calling a > > conversion function: > > > > ? ? ? ?test=> CREATE TABLE tsvector_test(x tsvector); > > ? ? ? ?CREATE TABLE > > ? ? ? ?test=> ALTER TABLE tsvector_test ALTER COLUMN x TYPE tsvector > > ? ? ? ?test-> USING conversion_func(x); > > ? ? ? ?ALTER TABLE > > > > No need for a fake data type and the required index infrastructure. > > I assume you're putting this in the list of commands to run > post-migration along with any reindex commands etc? Because it will > take a while (still faster than dump/reload i think). Yes, current CVS now does this. I added logic so if you rebuilt the table, the index will not be rebuilt again because it is a GIN index. > For this case, assuming the new tsvector's output function doesn't get > confused by the old ordering, I think you can just use USING > x::text::tsvector as your conversion expression. For more complex > cases you might need to package up the old output function. Wow, cool idea. I will test that. > Also note that you'll want to do any other conversions in the same > table at the same time rather than doing multiple conversions. Wow, I didn't know we supported that too: ALTER TABLE t ALTER COLUMN x TYPE tsvector USING x, ALTERCOLUMN y TYPE tsvector USING y; I will work on that also. > Also, one gotcha to note is that tsvector data can appear inside > composite data types or arrays. I don't think that's common so perhaps > just a warning in the readme would suffice, but it's something to note > at least. Wow, not sure how I would even find that. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Greg Stark wrote: > On Sat, May 30, 2009 at 6:23 PM, Bruce Momjian <bruce@momjian.us> wrote: > >> I think this is basically a large-caliber foot gun. ?You're going to > >> pretend that invalid data is valid, until the user gets around to fixing > >> it? > > > > What choice do we have? > > Well you can store the data in a new fake data type (or even just mark > the column as a bytea -- since any varlena is as good as any other). > Then provide this conversion function to create the new data. > > I suppose that means you should drop the indexes since if you leave > them things could get weird. But doing the conversion would have to > rebuild indexes anyways so CREATE INDEX should run in the same time as > that step of the conversion would have taken. It would be nice if you > could leave them around so the conversion would rebuild them > automatically, but that would require creating operators and an > opclass for the fake data type which would be more of a pain than just > marking the column as a bytea or a data type with no operators. The way the restore works, you would have to call the data type 'tsvector', and then that oid is linked around in all sort of odd places and you need to change them all, hence confusion. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Greg Stark wrote: > On Sat, May 30, 2009 at 1:11 PM, Bruce Momjian <bruce@momjian.us> wrote: > > I have discovered a simpler solution using ALTER TABLE and calling a > > conversion function: > > > > ? ? ? ?test=> CREATE TABLE tsvector_test(x tsvector); > > ? ? ? ?CREATE TABLE > > ? ? ? ?test=> ALTER TABLE tsvector_test ALTER COLUMN x TYPE tsvector > > ? ? ? ?test-> USING conversion_func(x); > > ? ? ? ?ALTER TABLE > > > > No need for a fake data type and the required index infrastructure. > > I assume you're putting this in the list of commands to run > post-migration along with any reindex commands etc? Because it will > take a while (still faster than dump/reload i think). > > For this case, assuming the new tsvector's output function doesn't get > confused by the old ordering, I think you can just use USING > x::text::tsvector as your conversion expression. For more complex > cases you might need to package up the old output function. > > Also note that you'll want to do any other conversions in the same > table at the same time rather than doing multiple conversions. I have implemented Greg's ideas using ::text::tsvector casting and it worked: $ psql testpsql (8.4beta2)Type "help" for help.test=> select * from tsvector_test; x---------------- 'c' 'bb' 'aaa'(1row)test=>test=> \q $ psql test < /u/postgres/rebuilt_tsvector_tables.txtALTER TABLEALTER TABLEALTER TABLE$ psql testpsql (8.4beta2)Type "help"for help.test=> select * from tsvector_test; x---------------- 'aaa' 'bb' 'c'(1 row) Here is the script that groups multiple tsvector columns into a single ALTER TABLE: ALTER TABLE "public"."tsvector_test" ALTER COLUMN "x" TYPE tsvector USING "x"::text::tsvector;ALTER TABLE "public"."tsvector_test2" ALTER COLUMN "x" TYPE tsvector USING "x"::text::tsvector, ALTER COLUMN "y" TYPE tsvector USING"y"::text::tsvector;ALTER TABLE "public"."tsvector_test3" ALTER COLUMN "x" TYPE tsvector USING "x"::text::tsvector, ALTER COLUMN "y" TYPE tsvector USING "y"::text::tsvector, ALTER COLUMN "z" TYPE tsvector USING "z"::text::tsvector; I knew I was going to have to deal with this optimization but didn't realize ALTER TABLE already supported multiple TYPE changes. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Tom Lane wrote: > > Bruce Momjian <bruce@momjian.us> writes: > > > I have discovered a simpler solution using ALTER TABLE and calling a > > > conversion function: > > > > > test=> CREATE TABLE tsvector_test(x tsvector); > > > CREATE TABLE > > > test=> ALTER TABLE tsvector_test ALTER COLUMN x TYPE tsvector > > > test-> USING conversion_func(x); > > > ALTER TABLE > > > > > No need for a fake data type and the required index infrastructure. > > > > I think this is basically a large-caliber foot gun. You're going to > > pretend that invalid data is valid, until the user gets around to fixing > > it? > > What choice do we have? While we can mark indexes as invalid (which we > do), how do we mark a table's contents as invalid? Should we create > rules so no one can see the data and then have the ALTER TABLE script > remove the rules after it is rebuilt? OK, what ideas do people have to prevent access to tsvector columns? I am thinking of renaming the tables or something. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Sun, May 31, 2009 at 3:04 PM, Bruce Momjian <bruce@momjian.us> wrote: >> > I think this is basically a large-caliber foot gun. You're going to >> > pretend that invalid data is valid, until the user gets around to fixing >> > it? >> >> What choice do we have? I think in this case the caliber is pretty small and this might be sufficient. It might be nice if we had a check somewhere in the tsvector data types so people get informative errors if their tsvectors are old-style rather than random incorrect results, but that's mostly gilding. In the general case of data type representation changes I think we need something like: While we can mark indexes as invalid (which we >> do), how do we mark a table's contents as invalid? Should we create >> rules so no one can see the data and then have the ALTER TABLE script >> remove the rules after it is rebuilt? > > OK, what ideas do people have to prevent access to tsvector columns? I > am thinking of renaming the tables or something. 1 Change the catalog so all the tsvector colums are bytea. 2 Include a c function like migrate_tsvector(bytea) which contains a copy of the old data type's output function and calls the new data type's input function on the result. 3 Include an ALTER TABLE command which calls the c function. The gotchas I can see with this is: 1) It only works for varlenas -- There isn't a universal fixed length data type. You would probably have to invent one. 2) I'm not sure what will happen to rules and triggers which call functions on the old data type. If you restore the schema unchanged and modify the catalog directly then they will still be there but have mismatched types. Will users get errors? Will those errors be sensible errors or nonsensical ones? Will the conversion still go ahead or will it complain that there are things which depend on the column? If the problems in (2) prove surmountable then this provides a general solution for any varlena data type representation change. However it will still be a O(n) conversion plus an index rebuild. That's unfortunate but unless we plan to ship the full set of operators, opclasses, opfamilies, cross-data-type operators, etc for the old data type I see no way around it. I haven't heard anyone suggest we should roll back the tsvector changes and give up the features the changes provide -- and that's just a performance feature. If that's all it took to convince us to give up in-place-upgrade for this data type then imagine how easy it will be to justify for actual functional features. (Personally I think we're fooling ourselves to think Postgres is mature enough that we won't come up with any new improvements which will justify a data format change. I would rather hope we'll keep coming up with massive improvements which require major changes in every release.) -- greg
Bruce Momjian wrote: > Bruce Momjian wrote: > > Tom Lane wrote: > > > Bruce Momjian <bruce@momjian.us> writes: > > > > I have discovered a simpler solution using ALTER TABLE and calling a > > > > conversion function: > > > > > > > test=> CREATE TABLE tsvector_test(x tsvector); > > > > CREATE TABLE > > > > test=> ALTER TABLE tsvector_test ALTER COLUMN x TYPE tsvector > > > > test-> USING conversion_func(x); > > > > ALTER TABLE > > > > > > > No need for a fake data type and the required index infrastructure. > > > > > > I think this is basically a large-caliber foot gun. You're going to > > > pretend that invalid data is valid, until the user gets around to fixing > > > it? > > > > What choice do we have? While we can mark indexes as invalid (which we > > do), how do we mark a table's contents as invalid? Should we create > > rules so no one can see the data and then have the ALTER TABLE script > > remove the rules after it is rebuilt? > > OK, what ideas do people have to prevent access to tsvector columns? I > am thinking of renaming the tables or something. Another idea would be to rename just the _column_ (not the table), but that doesn't block SELECT * from showing the column. Another idea would be to replace tsvector's out function, but that would affect all tables meaning it would be hard to incrementally rebuild tables using tsvector columns. Another idea would be to use column-level permissions to prevent column access, but that doesn't affect super-users. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Sat, May 30, 2009 at 11:58 PM, Bruce Momjian <bruce@momjian.us> wrote: > The way the restore works, you would have to call the data type > 'tsvector', and then that oid is linked around in all sort of odd places > and you need to change them all, hence confusion. In theory, I think it would be possible to hack around this problem by doing the following: 1. Change the OID of 8.4's "tsvector" type to an OID that was not in use in 8.3. 2. Add a type called "tsvector83" to 8.4 using the same OID that the "tsvector" type used in 8.3. 3. Hack pg_dump so that when dumping in upgrade mode, tsvector references from 8.3 are emitted as tsvector83. This would mean that you'd be on line immediately following the upgrade, because the tsvector83 type would still be usable; it just wouldn't have the improved performance of the new tsvector. Unfortunately, it would also require a catversion bump (and some programming work besides, not sure how much). Given that we are 7 months after the start of the last commitfest, which was theoretically our final feature freeze, this is probably not realistic, but if it's not too ugly it might be considered if a similar situation comes up during 8.5 development. (At the risk of beating a dead horse, note if we were upgrading the catalog tables directly via SQL, this type of scenario could be handled cleanly without hacking pg_dump; I repeat my earlier critique that the pg_migrator approach consigns us to a never-ending series of pg_dump hacks, that or having it not work very well.) ...Robert
Robert Haas wrote: > On Sat, May 30, 2009 at 11:58 PM, Bruce Momjian <bruce@momjian.us> wrote: > > The way the restore works, you would have to call the data type > > 'tsvector', and then that oid is linked around in all sort of odd places > > and you need to change them all, hence confusion. > > In theory, I think it would be possible to hack around this problem by > doing the following: > > 1. Change the OID of 8.4's "tsvector" type to an OID that was not in use in 8.3. > 2. Add a type called "tsvector83" to 8.4 using the same OID that the > "tsvector" type used in 8.3. > 3. Hack pg_dump so that when dumping in upgrade mode, tsvector > references from 8.3 are emitted as tsvector83. The pg_type oids are never transfered by pg_migrator; all schema information is transfered as pg_dump --schema stuff, i.e. ASCII. > This would mean that you'd be on line immediately following the > upgrade, because the tsvector83 type would still be usable; it just > wouldn't have the improved performance of the new tsvector. > Unfortunately, it would also require a catversion bump (and some > programming work besides, not sure how much). Given that we are 7 > months after the start of the last commitfest, which was theoretically > our final feature freeze, this is probably not realistic, but if it's > not too ugly it might be considered if a similar situation comes up > during 8.5 development. > > (At the risk of beating a dead horse, note if we were upgrading the > catalog tables directly via SQL, this type of scenario could be > handled cleanly without hacking pg_dump; I repeat my earlier critique > that the pg_migrator approach consigns us to a never-ending series of > pg_dump hacks, that or having it not work very well.) We could create infrastructure to handle the 8.3 storage format but my guess is that most people would just rather rebuild the tables. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Robert Haas <robertmhaas@gmail.com> writes: > (At the risk of beating a dead horse, note if we were upgrading the > catalog tables directly via SQL, this type of scenario could be > handled cleanly without hacking pg_dump; I repeat my earlier critique > that the pg_migrator approach consigns us to a never-ending series of > pg_dump hacks, that or having it not work very well.) "Updating the catalog tables directly via SQL"? Good luck making that work. If you ever get it to work at all, it'll require a pile of hacks that will make pg_migrator look like paradise. (Which is not to say that pg_migrator isn't a hack; it surely is. But it looks like the least painful approach available.) regards, tom lane
On Sun, May 31, 2009 at 11:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> (At the risk of beating a dead horse, note if we were upgrading the >> catalog tables directly via SQL, this type of scenario could be >> handled cleanly without hacking pg_dump; I repeat my earlier critique >> that the pg_migrator approach consigns us to a never-ending series of >> pg_dump hacks, that or having it not work very well.) > > "Updating the catalog tables directly via SQL"? Good luck making that > work. If you ever get it to work at all, it'll require a pile of hacks > that will make pg_migrator look like paradise. For clarity, I really mean "from a standalone backend", but ideally I'd like it to be SQL. > (Which is not to say that pg_migrator isn't a hack; it surely is. > But it looks like the least painful approach available.) Maybe. It seems that we don't have a good way of handling datatype conversions. The approaches that have been proposed for tsvector wouldn't work at all but for the fact that the new output function can handle the old internal representation, which is not something that we can guarantee in every case. And, even so, they leave the database in a broken state until the post-migration scripts have been run. The good news is that tsvector is not a datatype that everyone uses, and those who do probably don't use it in every table, but what happens when we want to change numeric incompatibly? We really need to figure out an approach that lets us keep the old datatypes around under a different name while making the original name be the new version of the datatype. That way people can migrate and be up, and deal with the need to rewrite their tables at a later time.Having the database come up quickly but in a statewhere queries may return wrong answers is a fairly poor excuse for in-place upgrade. This is not to say that I don't appreciate the work Bruce is putting into it, but I fear that without some more engineering it's just going to be a hack used only by the desperate. ...Robert
On Mon, Jun 1, 2009 at 4:03 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, May 31, 2009 at 11:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> "Updating the catalog tables directly via SQL"? Good luck making that >> work. If you ever get it to work at all, it'll require a pile of hacks >> that will make pg_migrator look like paradise. > > For clarity, I really mean "from a standalone backend", but ideally > I'd like it to be SQL. Keep in mind that you have catalogs in all the databases, and even in standalone mode you need those catalogs to find the, er, catalogs. There's a reason bootstrap mode is so limited and even then it needs some of the catalogs to already be in place. >> (Which is not to say that pg_migrator isn't a hack; it surely is. >> But it looks like the least painful approach available.) > > Maybe. It seems that we don't have a good way of handling datatype > conversions. The approaches that have been proposed for tsvector > wouldn't work at all but for the fact that the new output function can > handle the old internal representation, which is not something that we > can guarantee in every case. Well I think all we need for that is to have pg_migrator provide the old output function wrapped up in a migrate_foo() C function. > And, even so, they leave the database in > a broken state until the post-migration scripts have been run. The > good news is that tsvector is not a datatype that everyone uses, and > those who do probably don't use it in every table, but what happens > when we want to change numeric incompatibly? Or, say, timestamp... > We really need to figure out an approach that lets us keep the old > datatypes around under a different name while making the original name > be the new version of the datatype. That way people can migrate and > be up, and deal with the need to rewrite their tables at a later time. I do agree that having to rewrite the whole table isn't really "upgrade-in-place". But the work to support multiple versions of data types is more than you're describing. You need to be concerned about things like joins between tables when some columns are the old data type and some the new, etc. Ultimately people will have to convert the data types sometime. -- greg
Greg Stark <stark@enterprisedb.com> writes: > On Mon, Jun 1, 2009 at 4:03 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> We really need to figure out an approach that lets us keep the old >> datatypes around under a different name while making the original name >> be the new version of the datatype. �That way people can migrate and >> be up, and deal with the need to rewrite their tables at a later time. > I do agree that having to rewrite the whole table isn't really > "upgrade-in-place". It's certainly the case that there is a lot more work to do before pg_migrator could support everything that we reasonably want to be able to do in a version update. As I see it, the reason it's getting revived now is that 8.3->8.4 happens to be an update where most of what it can't (yet) do isn't necessary. That means we can get it out there, get the bugs out of the functionality it does have, and most importantly try to set an expectation that future updates will also have some degree of update-in-place capability. If we wait till it's perfect then nothing will ever happen at all in this space. regards, tom lane
On Mon, Jun 1, 2009 at 11:10 AM, Greg Stark <stark@enterprisedb.com> wrote: > On Mon, Jun 1, 2009 at 4:03 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Sun, May 31, 2009 at 11:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> "Updating the catalog tables directly via SQL"? Good luck making that >>> work. If you ever get it to work at all, it'll require a pile of hacks >>> that will make pg_migrator look like paradise. >> >> For clarity, I really mean "from a standalone backend", but ideally >> I'd like it to be SQL. > > Keep in mind that you have catalogs in all the databases, and even in > standalone mode you need those catalogs to find the, er, catalogs. > There's a reason bootstrap mode is so limited and even then it needs > some of the catalogs to already be in place. Yep. Changes to the schema of the bootstrap tables are definitely the toughest nut to crack. I haven't been convinced that it's impossible, but given my relative level of knowledge of the code compared to Tom, that could well be an indication that I'm overly optimistic. >>> (Which is not to say that pg_migrator isn't a hack; it surely is. >>> But it looks like the least painful approach available.) >> >> Maybe. It seems that we don't have a good way of handling datatype >> conversions. The approaches that have been proposed for tsvector >> wouldn't work at all but for the fact that the new output function can >> handle the old internal representation, which is not something that we >> can guarantee in every case. > > Well I think all we need for that is to have pg_migrator provide the > old output function wrapped up in a migrate_foo() C function Well that would be better, but it still leaves the database temporarily broken. >> And, even so, they leave the database in >> a broken state until the post-migration scripts have been run. The >> good news is that tsvector is not a datatype that everyone uses, and >> those who do probably don't use it in every table, but what happens >> when we want to change numeric incompatibly? > > Or, say, timestamp... Yeah. >> We really need to figure out an approach that lets us keep the old >> datatypes around under a different name while making the original name >> be the new version of the datatype. That way people can migrate and >> be up, and deal with the need to rewrite their tables at a later time. > > I do agree that having to rewrite the whole table isn't really > "upgrade-in-place". > > But the work to support multiple versions of data types is more than > you're describing. You need to be concerned about things like joins > between tables when some columns are the old data type and some the > new, etc. True. > Ultimately people will have to convert the data types sometime. Yes they will, but not having to do it as part of the upgrade is important. What particularly bothers me is the possibility that the database comes on line and starts letting clients in (who don't know that it's broken) while the breakage is still present. ...Robert
On Mon, Jun 1, 2009 at 11:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <stark@enterprisedb.com> writes: >> On Mon, Jun 1, 2009 at 4:03 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> We really need to figure out an approach that lets us keep the old >>> datatypes around under a different name while making the original name >>> be the new version of the datatype. That way people can migrate and >>> be up, and deal with the need to rewrite their tables at a later time. > >> I do agree that having to rewrite the whole table isn't really >> "upgrade-in-place". > > It's certainly the case that there is a lot more work to do before > pg_migrator could support everything that we reasonably want to be > able to do in a version update. As I see it, the reason it's getting > revived now is that 8.3->8.4 happens to be an update where most of what > it can't (yet) do isn't necessary. That means we can get it out there, > get the bugs out of the functionality it does have, and most importantly > try to set an expectation that future updates will also have some degree > of update-in-place capability. If we wait till it's perfect then > nothing will ever happen at all in this space. I agree. I remain doubtful that dumping and reloading the schema is the best way to go, but it's certainly a worthwhile experiment, because (a) I might easily be wrong and (b) we'll hopefully learn some things that will be useful going forward. ...Robert
Robert Haas wrote: > > It's certainly the case that there is a lot more work to do before > > pg_migrator could support everything that we reasonably want to be > > able to do in a version update. ?As I see it, the reason it's getting > > revived now is that 8.3->8.4 happens to be an update where most of what > > it can't (yet) do isn't necessary. ?That means we can get it out there, > > get the bugs out of the functionality it does have, and most importantly > > try to set an expectation that future updates will also have some degree > > of update-in-place capability. ?If we wait till it's perfect then > > nothing will ever happen at all in this space. > > I agree. I remain doubtful that dumping and reloading the schema is > the best way to go, but it's certainly a worthwhile experiment, > because (a) I might easily be wrong and (b) we'll hopefully learn some > things that will be useful going forward. The amount of work to do an upgrade without dump/reloading the schema is overly complex and would have to be adjusted for every major release, with the high likelihood it would not be done or fail in random ways. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Greg Stark wrote: > On Mon, Jun 1, 2009 at 4:03 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Sun, May 31, 2009 at 11:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> "Updating the catalog tables directly via SQL"? ?Good luck making that > >> work. ?If you ever get it to work at all, it'll require a pile of hacks > >> that will make pg_migrator look like paradise. > > > > For clarity, I really mean "from a standalone backend", but ideally > > I'd like it to be SQL. > > Keep in mind that you have catalogs in all the databases, and even in > standalone mode you need those catalogs to find the, er, catalogs. > There's a reason bootstrap mode is so limited and even then it needs > some of the catalogs to already be in place. > > >> (Which is not to say that pg_migrator isn't a hack; it surely is. > >> But it looks like the least painful approach available.) > > > > Maybe. ?It seems that we don't have a good way of handling datatype > > conversions. ?The approaches that have been proposed for tsvector > > wouldn't work at all but for the fact that the new output function can > > handle the old internal representation, which is not something that we > > can guarantee in every case. > > Well I think all we need for that is to have pg_migrator provide the > old output function wrapped up in a migrate_foo() C function. Yep, it is just a matter of using a custom function call rather than a cast. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +