Thread: pg_migrator and an 8.3-compatible tsvector data type

pg_migrator and an 8.3-compatible tsvector data type

From
Bruce Momjian
Date:
I found out at PGCon that the internal format of tsvector changed
slightly from 8.3 to 8.4.  Teodor gave me a conversion function and I
have adjusted pg_migrator to install a v8_3_tsvector data type to be
used during the load so the old user tables use that data type.  You can
see the code here at the bottom of this file:

http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/pg-migrator/pg_migrator/src/pg_migrator.c?rev=1.36&content-type=text/x-cvsweb-markup

I am not done yet because eventually v8_3_tsvector will be moved into a
separate schema and the tsvector data type restored.

The problem I have is that while I have created v8_3_tsvector (with
casts), I have not created index routines for it, so I get this error
when restoring a database that uses a GIN index on tsvector:
Restoring database schema                                  psql:/u/postgres/pg_migrator_dump_db.sql:7006: ERROR:  data
typetsvectorhas no default operator class for access method "gin"HINT:  You must specify an operator class for the
indexor define adefault operator class for the data type.
 

So, how do I create a compatible set of hooks for v8_3_tsvector? 
Because tsvector is a native data type I can't figure out how to set up
a similar type.  Also, will the indexes be OK if they are created from
types that don't have the 8.3 format?  What about new indexes created
after the migration is done?

Ideally the cast would just be called during index activity and the 8.4
tsvector index routines used.  I am told by Teodor that GiST indexes
would be fine with the 8.3 data format, and we are forcing the
reindexing of GIN indexes anyway in 8.4.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Bruce Momjian
Date:
Because no one has responed

---------------------------------------------------------------------------

Bruce Momjian wrote:
> I found out at PGCon that the internal format of tsvector changed
> slightly from 8.3 to 8.4.  Teodor gave me a conversion function and I
> have adjusted pg_migrator to install a v8_3_tsvector data type to be
> used during the load so the old user tables use that data type.  You can
> see the code here at the bottom of this file:
> 
>
http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/pg-migrator/pg_migrator/src/pg_migrator.c?rev=1.36&content-type=text/x-cvsweb-markup
> 
> I am not done yet because eventually v8_3_tsvector will be moved into a
> separate schema and the tsvector data type restored.
> 
> The problem I have is that while I have created v8_3_tsvector (with
> casts), I have not created index routines for it, so I get this error
> when restoring a database that uses a GIN index on tsvector:
> 
>     Restoring database schema                                  
>     psql:/u/postgres/pg_migrator_dump_db.sql:7006: ERROR:  data type
>     tsvector has no default operator class for access method "gin"
>     HINT:  You must specify an operator class for the index or define a
>     default operator class for the data type.
> 
> So, how do I create a compatible set of hooks for v8_3_tsvector? 
> Because tsvector is a native data type I can't figure out how to set up
> a similar type.  Also, will the indexes be OK if they are created from
> types that don't have the 8.3 format?  What about new indexes created
> after the migration is done?
> 
> Ideally the cast would just be called during index activity and the 8.4
> tsvector index routines used.  I am told by Teodor that GiST indexes
> would be fine with the 8.3 data format, and we are forcing the
> reindexing of GIN indexes anyway in 8.4.
> 
> -- 
>   Bruce Momjian  <bruce@momjian.us>        http://momjian.us
>   EnterpriseDB                             http://enterprisedb.com
> 
>   + If your life is a hard drive, Christ can be your backup. +
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Bruce Momjian
Date:
Because no one has responded, I am going to prevent pg_migrator from
working with a cluster that uses tsvector.  I realize this limits
pg_migrator's usefulness, but I have to move on.

---------------------------------------------------------------------------

Bruce Momjian wrote:
> I found out at PGCon that the internal format of tsvector changed
> slightly from 8.3 to 8.4.  Teodor gave me a conversion function and I
> have adjusted pg_migrator to install a v8_3_tsvector data type to be
> used during the load so the old user tables use that data type.  You can
> see the code here at the bottom of this file:
> 
>
http://cvs.pgfoundry.org/cgi-bin/cvsweb.cgi/pg-migrator/pg_migrator/src/pg_migrator.c?rev=1.36&content-type=text/x-cvsweb-markup
> 
> I am not done yet because eventually v8_3_tsvector will be moved into a
> separate schema and the tsvector data type restored.
> 
> The problem I have is that while I have created v8_3_tsvector (with
> casts), I have not created index routines for it, so I get this error
> when restoring a database that uses a GIN index on tsvector:
> 
>     Restoring database schema                                  
>     psql:/u/postgres/pg_migrator_dump_db.sql:7006: ERROR:  data type
>     tsvector has no default operator class for access method "gin"
>     HINT:  You must specify an operator class for the index or define a
>     default operator class for the data type.
> 
> So, how do I create a compatible set of hooks for v8_3_tsvector? 
> Because tsvector is a native data type I can't figure out how to set up
> a similar type.  Also, will the indexes be OK if they are created from
> types that don't have the 8.3 format?  What about new indexes created
> after the migration is done?
> 
> Ideally the cast would just be called during index activity and the 8.4
> tsvector index routines used.  I am told by Teodor that GiST indexes
> would be fine with the 8.3 data format, and we are forcing the
> reindexing of GIN indexes anyway in 8.4.
> 
> -- 
>   Bruce Momjian  <bruce@momjian.us>        http://momjian.us
>   EnterpriseDB                             http://enterprisedb.com
> 
>   + If your life is a hard drive, Christ can be your backup. +
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Josh Berkus
Date:
On 5/28/09 2:30 PM, Bruce Momjian wrote:
> Because no one has responded, I am going to prevent pg_migrator from
> working with a cluster that uses tsvector.  I realize this limits
> pg_migrator's usefulness, but I have to move on.

I don't know how to fix the issue, certainly.

Why was there a format change to tsvector anyway?  Was there an 
important reason for this?

-- 
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Bruce Momjian
Date:
Josh Berkus wrote:
> On 5/28/09 2:30 PM, Bruce Momjian wrote:
> > Because no one has responded, I am going to prevent pg_migrator from
> > working with a cluster that uses tsvector.  I realize this limits
> > pg_migrator's usefulness, but I have to move on.
> 
> I don't know how to fix the issue, certainly.
> 
> Why was there a format change to tsvector anyway?  Was there an 
> important reason for this?

The ordering of the lexems was changed:
/* * Convert 8.3 tsvector to 8.4 * * 8.3 sorts lexemes by its length and if lengths are the same then it uses *
alphabeticorder;  8.4 sorts lexemes in lexicographical order, e.g. * * => SELECT 'c bb aaa'::tsvector; *     tsvector *
----------------*  'aaa' 'bb' 'c'         -- 8.4 *  'c' 'bb' 'aaa'         -- 8.3 */
 

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Josh Berkus
Date:
Bruce,

> The ordering of the lexems was changed:
>
>     /*
>      * Convert 8.3 tsvector to 8.4
>      *
>      * 8.3 sorts lexemes by its length and if lengths are the same then it uses
>      * alphabetic order;  8.4 sorts lexemes in lexicographical order, e.g.
>      *
>      * =>  SELECT 'c bb aaa'::tsvector;
>      *     tsvector
>      * ----------------
>      *  'aaa' 'bb' 'c'         -- 8.4
>      *  'c' 'bb' 'aaa'         -- 8.3
>      */

What does that get us in terms of performance etc.?

Given that it's going to prevent users of Tsearch from 
upgrading-in-place, the format change ought to be giving us some serious 
gains.  Otherwise we should put it off until we need to make other 
datatype changes.

-- 
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Bruce Momjian
Date:
Josh Berkus wrote:
> Bruce,
> 
> > The ordering of the lexems was changed:
> >
> >     /*
> >      * Convert 8.3 tsvector to 8.4
> >      *
> >      * 8.3 sorts lexemes by its length and if lengths are the same then it uses
> >      * alphabetic order;  8.4 sorts lexemes in lexicographical order, e.g.
> >      *
> >      * =>  SELECT 'c bb aaa'::tsvector;
> >      *     tsvector
> >      * ----------------
> >      *  'aaa' 'bb' 'c'         -- 8.4
> >      *  'c' 'bb' 'aaa'         -- 8.3
> >      */
> 
> What does that get us in terms of performance etc.?
> 
> Given that it's going to prevent users of Tsearch from 
> upgrading-in-place, the format change ought to be giving us some serious 
> gains.  Otherwise we should put it off until we need to make other 
> datatype changes.

No idea but now that we are in beta we would then be invalidating beta
tester data.  The area I got stuck on is that there is no CAST behavior
when creating an index.  We are already invalidating GIN indexes, but I
can't even create an index to support the old data type.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Bruce,
>> The ordering of the lexems was changed:

> What does that get us in terms of performance etc.?

It was changed to support partial-match tsvector queries.  Without it,
a partial match query would have to scan entire tsvectors instead
of applying binary search.  I don't know if Oleg and Teodor did any
actual performance tests on the size of the hit, but it seems like
it could be pretty awful for large documents.
        regards, tom lane


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Josh Berkus wrote:
>> Given that it's going to prevent users of Tsearch from 
>> upgrading-in-place, the format change ought to be giving us some serious 
>> gains.  Otherwise we should put it off until we need to make other 
>> datatype changes.

> No idea but now that we are in beta we would then be invalidating beta
> tester data.  The area I got stuck on is that there is no CAST behavior
> when creating an index.  We are already invalidating GIN indexes, but I
> can't even create an index to support the old data type.

It's certainly doable.  Bruce is just applying the strategy he mentioned
in our talk ;-)
        regards, tom lane


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Josh Berkus
Date:
Tom,

> It was changed to support partial-match tsvector queries.  Without it,
> a partial match query would have to scan entire tsvectors instead
> of applying binary search.  I don't know if Oleg and Teodor did any
> actual performance tests on the size of the hit, but it seems like
> it could be pretty awful for large documents.

Ah, ok, that's pretty significant.  It makes a huge difference for some 
users, one of whom will be quoted on the press release.

-- 
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Josh Berkus wrote:
> >> Given that it's going to prevent users of Tsearch from 
> >> upgrading-in-place, the format change ought to be giving us some serious 
> >> gains.  Otherwise we should put it off until we need to make other 
> >> datatype changes.
> 
> > No idea but now that we are in beta we would then be invalidating beta
> > tester data.  The area I got stuck on is that there is no CAST behavior
> > when creating an index.  We are already invalidating GIN indexes, but I
> > can't even create an index to support the old data type.
> 
> It's certainly doable.  Bruce is just applying the strategy he mentioned
> in our talk ;-)

Kind of --- I am stuck because we don't auto-cast for index usage like
we do for function calls and operators, so I will just keep going and
disable tsvector.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Greg Stark
Date:
On Thu, May 28, 2009 at 11:26 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Kind of --- I am stuck because we don't auto-cast for index usage like
> we do for function calls and operators, so I will just keep going and
> disable tsvector.

I was pretty stumped by what you meant by "auto-cast" but now I'm
beginning to think what you're looking for are cross-type comparison
operators? You could define a bunch of such operators which manually
call the cast on one side and add them to the operator class. I'm not
sure how to set up the operator class entries for GIN though.

-- 
greg


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Tom Lane
Date:
Greg Stark <stark@enterprisedb.com> writes:
> On Thu, May 28, 2009 at 11:26 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> Kind of --- I am stuck because we don't auto-cast for index usage like
>> we do for function calls and operators, so I will just keep going and
>> disable tsvector.

> I was pretty stumped by what you meant by "auto-cast" but now I'm
> beginning to think what you're looking for are cross-type comparison
> operators? You could define a bunch of such operators which manually
> call the cast on one side and add them to the operator class. I'm not
> sure how to set up the operator class entries for GIN though.

No, what we need is a GIN (and probably GIST) opclass to support
old-style tsvectors.  The user-level operators aren't really an issue
because an implicit cast will get inserted in front of them when
necessary, but we don't support that for indexes.

People who want decent performance on partial match queries (or maybe
even *any* TS queries) are going to have to convert to the new format
anyway, though, so maybe this isn't something worth putting a whole lot
of work into.  It might be sufficient if the database conversion
succeeds but leaves the tsvector columns unusable until they're
converted.
        regards, tom lane


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Alvaro Herrera
Date:
Tom Lane wrote:

> People who want decent performance on partial match queries (or maybe
> even *any* TS queries) are going to have to convert to the new format
> anyway, though, so maybe this isn't something worth putting a whole lot
> of work into.  It might be sufficient if the database conversion
> succeeds but leaves the tsvector columns unusable until they're
> converted.

There are so many caveats on pg_migrator (and things that need to be
done after the migration is complete) that one starts to wonder if
people is not better off just using parallel pg_restore.  From Stefan's
reported timings I'm not sure that pg_migrator is that much of a benefit
in the first place ... unless copy mode can be made much faster.  (On
link mode it is so much faster that it's worth it, but then you don't
have an escape hatch).

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Bruce Momjian
Date:
Alvaro Herrera wrote:
> Tom Lane wrote:
> 
> > People who want decent performance on partial match queries (or maybe
> > even *any* TS queries) are going to have to convert to the new format
> > anyway, though, so maybe this isn't something worth putting a whole lot
> > of work into.  It might be sufficient if the database conversion
> > succeeds but leaves the tsvector columns unusable until they're
> > converted.
> 
> There are so many caveats on pg_migrator (and things that need to be
> done after the migration is complete) that one starts to wonder if
> people is not better off just using parallel pg_restore.  From Stefan's
> reported timings I'm not sure that pg_migrator is that much of a benefit
> in the first place ... unless copy mode can be made much faster.  (On
> link mode it is so much faster that it's worth it, but then you don't
> have an escape hatch).

That is accurate.  I doubt copy mode speed can be improved.  I think
doing a backup, then using --link mode will be the most common usage
pattern.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Greg Stark
Date:
On Fri, May 29, 2009 at 12:53 AM, Bruce Momjian <bruce@momjian.us> wrote:
> That is accurate.  I doubt copy mode speed can be improved.  I think
> doing a backup, then using --link mode will be the most common usage
> pattern.

Well doing a backup will take more or less the same time as copy mode, no?

Actually there's a big win for people who can take a snapshot using
their SAN or filesystem, but that's not everyone.

--
greg


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Alvaro Herrera
Date:
Bruce Momjian wrote:
> Alvaro Herrera wrote:

> > There are so many caveats on pg_migrator (and things that need to be
> > done after the migration is complete) that one starts to wonder if
> > people is not better off just using parallel pg_restore.  From Stefan's
> > reported timings I'm not sure that pg_migrator is that much of a benefit
> > in the first place ... unless copy mode can be made much faster.  (On
> > link mode it is so much faster that it's worth it, but then you don't
> > have an escape hatch).
> 
> That is accurate.  I doubt copy mode speed can be improved.

Why not?  Right now it's single-threaded.  Would it be faster if it ran
several copies in parallel?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Bruce Momjian
Date:
Alvaro Herrera wrote:
> Bruce Momjian wrote:
> > Alvaro Herrera wrote:
> 
> > > There are so many caveats on pg_migrator (and things that need to be
> > > done after the migration is complete) that one starts to wonder if
> > > people is not better off just using parallel pg_restore.  From Stefan's
> > > reported timings I'm not sure that pg_migrator is that much of a benefit
> > > in the first place ... unless copy mode can be made much faster.  (On
> > > link mode it is so much faster that it's worth it, but then you don't
> > > have an escape hatch).
> > 
> > That is accurate.  I doubt copy mode speed can be improved.
> 
> Why not?  Right now it's single-threaded.  Would it be faster if it ran
> several copies in parallel?

Sure, but that assumes you have parallel I/O channels;  I assume right
now it is I/O limited.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Alvaro Herrera
Date:
Bruce Momjian wrote:
> Alvaro Herrera wrote:

> > Why not?  Right now it's single-threaded.  Would it be faster if it ran
> > several copies in parallel?
> 
> Sure, but that assumes you have parallel I/O channels;  I assume right
> now it is I/O limited.

But so does parallel pg_restore, no?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Josh Berkus
Date:
Bruce,

> Sure, but that assumes you have parallel I/O channels;  I assume right
> now it is I/O limited.

Even a simple cp will be speeded up between 50% and 150% (depending on 
your storage) by doing several files in parallel.  However, I'm not sure 
that's what you should be spending your time on.

-- 
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Andrew Dunstan
Date:

Alvaro Herrera wrote:
> Bruce Momjian wrote:
>   
>> Alvaro Herrera wrote:
>>     
>
>   
>>> Why not?  Right now it's single-threaded.  Would it be faster if it ran
>>> several copies in parallel?
>>>       
>> Sure, but that assumes you have parallel I/O channels;  I assume right
>> now it is I/O limited.
>>     
>
> But so does parallel pg_restore, no?
>
>   

Well, no, it assumes you're not saturating your I/O bandwidth, which 
turns out to be the case more often than you might expect, even without 
multiple I/O channels.

cheers

andrew


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Greg Smith
Date:
On Thu, 28 May 2009, Alvaro Herrera wrote:

> There are so many caveats on pg_migrator (and things that need to be
> done after the migration is complete) that one starts to wonder if
> people is not better off just using parallel pg_restore.

If you've got something close to a 24x7 shop, where planned downtime can 
at worst in the small number of hours, the landscape breaks down something 
like this:

<100GB:  Current dump/reload is probably fine
100GB-1TB:  Aggressive pg_dump and parallel pg_restore might work
>1TB:  Nothing but pg_migrator will save you

If you're doing a version migration, you need to block writer activity 
from the time you start dumping to get a perfectly consistant snapshot, 
and large databases can't be made available to users until their indexes 
are rebuilt lest every client doing a "select * from x where..." kicks off 
a giant sequential scan.  Parallel restore only improves one section of 
that work.

There is a significant difference between something being difficult or 
having caveats and being it impossible, and right now there's no good 
answer for TB scale upgrades unless you've lucky enough to have an app 
that Slony can usefully replicate.  If there's somebody out there who has 
a bunch of tsvector data they need to get upgraded in-place, they should 
step up to help out here.  I think Bruce is right to move on to more 
popular issues until that happens.  This issue can all get sorted out in 
user space at some later time, right?

> From Stefan's reported timings I'm not sure that pg_migrator is that 
> much of a benefit in the first place ... unless copy mode can be made 
> much faster.

Systems administrators have all sorts of techniques to speed up this sort 
of thing, it's a well solved problem.  The people who need in-place 
upgrade will figure out how to either make copy mode fast enough, or how 
to make link mode usable by filesystem tricks.  We don't need to worry 
about that for them.  I think it's premature and distracting to start 
talking about optimizing pg_migrator anyway, it's plenty fast for its 
intended audience.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Stefan Kaltenbrunner
Date:
Alvaro Herrera wrote:
> Bruce Momjian wrote:
>> Alvaro Herrera wrote:
> 
>>> There are so many caveats on pg_migrator (and things that need to be
>>> done after the migration is complete) that one starts to wonder if
>>> people is not better off just using parallel pg_restore.  From Stefan's
>>> reported timings I'm not sure that pg_migrator is that much of a benefit
>>> in the first place ... unless copy mode can be made much faster.  (On
>>> link mode it is so much faster that it's worth it, but then you don't
>>> have an escape hatch).
>> That is accurate.  I doubt copy mode speed can be improved.
> 
> Why not?  Right now it's single-threaded.  Would it be faster if it ran
> several copies in parallel?

I guess it would be much faster on powerful hardware - we also have to 
consider that copy mode now is a no-op really.
If it had to do any actual page conversation too it seems entirely 
possible that a parallel restore might be even faster that a single 
threaded pg_migrator in copy mode.


Stefan


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Zdenek Kotala
Date:
Bruce Momjian píše v čt 28. 05. 2009 v 17:42 -0400:
> Josh Berkus wrote:
> > On 5/28/09 2:30 PM, Bruce Momjian wrote:
> > > Because no one has responded, I am going to prevent pg_migrator from
> > > working with a cluster that uses tsvector.  I realize this limits
> > > pg_migrator's usefulness, but I have to move on.
> > 
> > I don't know how to fix the issue, certainly.
> > 
> > Why was there a format change to tsvector anyway?  Was there an 
> > important reason for this?
> 
> The ordering of the lexems was changed:

The biggest problem is dictionary change. I'm not sure if it happened
but IIRC Teodor mentioned it in Ottawa. If it happened It hits down
tsvector compatibility at all.  
Zdenek



Re: pg_migrator and an 8.3-compatible tsvector data type

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Bruce Momjian wrote:
>> Alvaro Herrera wrote:
>>> Why not?  Right now it's single-threaded.  Would it be faster if it ran
>>> several copies in parallel?
>> 
>> Sure, but that assumes you have parallel I/O channels;  I assume right
>> now it is I/O limited.

> But so does parallel pg_restore, no?

The point of parallel pg_restore is that COPY is frequently CPU-bound to
some extent, and so you can put multiple CPUs to work by parallelizing.
I find it much less probable that multiple "cp" operations can be
parallelized, unless the DB is spread across multiple tablespaces and
the code is smart enough to interleave by tablespace.
        regards, tom lane


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Tom Lane
Date:
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
> The biggest problem is dictionary change. I'm not sure if it happened
> but IIRC Teodor mentioned it in Ottawa. If it happened It hits down
> tsvector compatibility at all.  

No more than changing dictionary behavior in an existing installation.
What was stated when the issue came up during 8.3 development is that
you don't normally need to worry about small changes in dictionary
behavior because overall text search behavior will still be "close
enough".  (I seem to recall that I'd complained that any change in
dictionary behavior would invalidate indexes based on the dictionary,
and this was the answer.)
        regards, tom lane


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Zdenek Kotala
Date:
Tom Lane píše v pá 29. 05. 2009 v 11:28 -0400:
> Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
> > The biggest problem is dictionary change. I'm not sure if it happened
> > but IIRC Teodor mentioned it in Ottawa. If it happened It hits down
> > tsvector compatibility at all.  
> 
> No more than changing dictionary behavior in an existing installation.
> What was stated when the issue came up during 8.3 development is that
> you don't normally need to worry about small changes in dictionary
> behavior because overall text search behavior will still be "close
> enough".  (I seem to recall that I'd complained that any change in
> dictionary behavior would invalidate indexes based on the dictionary,
> and this was the answer.)

It seems to be OK then. However, pg_migrator should inform about it.
Zdenek



Re: pg_migrator and an 8.3-compatible tsvector data type

From
Bruce Momjian
Date:
Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> > Bruce,
> >> The ordering of the lexems was changed:
> 
> > What does that get us in terms of performance etc.?
> 
> It was changed to support partial-match tsvector queries.  Without it,
> a partial match query would have to scan entire tsvectors instead
> of applying binary search.  I don't know if Oleg and Teodor did any
> actual performance tests on the size of the hit, but it seems like
> it could be pretty awful for large documents.

I started thinking about the performance issues of the tsvector changes.
Teodor gave me this code for conversion that basically does:
qsort_arg((void *) ARRPTR(t), t->size, sizeof(WordEntry), cmpLexeme, (void*) t);

So, basically, every time there is a cast we have to do a sort, which
for a large document would yield poor performance, and because we are
not storing the sorted result, it happens for every access;  this might
be an unacceptable performance burden.

So, one idea would be, instead of a cast, have pg_migrator rebuild the
tsvector columns with ALTER TABLE, so then the 8.4 index code could be
used.  But then we might as well just tell the users to migrate the
tsvector tables themselves, which is how pg_migrator behaves now.

Obviously we are still trying to figure out the best way to handle data
type changes;  I think as soon as we figure out a plan for tsvector we
can use that method for future changes.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Josh Berkus
Date:
Bruce,

> So, one idea would be, instead of a cast, have pg_migrator rebuild the
> tsvector columns with ALTER TABLE, so then the 8.4 index code could be
> used.  But then we might as well just tell the users to migrate the
> tsvector tables themselves, which is how pg_migrator behaves now.

It would be nice to have pg_migrator handle this, especially if we could 
do it in parallel.  Then we just have to warn users that migrating a 
database with tsvector columns takes significantly longer.  That is,

1) do rest of catalog swap and link/copy of objects.
2) mark all tsvector columns as 83_tsvector and add new tsvector type   (these columns will be unusable for queries)
3) bring up database
4) search for all 83_tsvector columns
5) do ALTER TABLE on each of these columns, in parallel, up to a 
configuration setting (default 3).

However, I can also understand not having time to implement the above 
before 8.4 release.


-- 
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> It would be nice to have pg_migrator handle this, especially if we could 
> do it in parallel.  Then we just have to warn users that migrating a 
> database with tsvector columns takes significantly longer.  That is,

> 1) do rest of catalog swap and link/copy of objects.
> 2) mark all tsvector columns as 83_tsvector and add new tsvector type
>     (these columns will be unusable for queries)
> 3) bring up database
> 4) search for all 83_tsvector columns
> 5) do ALTER TABLE on each of these columns, in parallel, up to a 
> configuration setting (default 3).

pg_migrator is already emitting a script that is intended to be run
after conversion, to handle REINDEXing of incompatible indexes.  That
could easily be made to do ALTER TYPE on old tsvector columns too, no?

The parallel bit is pie in the sky and should not be considered even
for a millisecond during this release cycle.  Save it for 8.5, or
suggest to people that they manually cut the script apart if they're
desperate to have that.
        regards, tom lane


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Bruce Momjian
Date:
Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> > It would be nice to have pg_migrator handle this, especially if we could 
> > do it in parallel.  Then we just have to warn users that migrating a 
> > database with tsvector columns takes significantly longer.  That is,
> 
> > 1) do rest of catalog swap and link/copy of objects.
> > 2) mark all tsvector columns as 83_tsvector and add new tsvector type
> >     (these columns will be unusable for queries)
> > 3) bring up database
> > 4) search for all 83_tsvector columns
> > 5) do ALTER TABLE on each of these columns, in parallel, up to a 
> > configuration setting (default 3).
> 
> pg_migrator is already emitting a script that is intended to be run
> after conversion, to handle REINDEXing of incompatible indexes.  That
> could easily be made to do ALTER TYPE on old tsvector columns too, no?

Hmmm.  Well, the problem right now with v8_3_tsvector is that it fails
on index create, even when the index is empty, because I can't figure
out how to simply set up the proper index catalog entries.  Once that is
fixed and I can bind tsvector to v8_3_tsvector on schema creation, I can
easily emit ALTER TABLE to fix the issue.  And, at that point the
tsvector GIN indexes would be automatically created so I can skip that
part.

> The parallel bit is pie in the sky and should not be considered even
> for a millisecond during this release cycle.  Save it for 8.5, or
> suggest to people that they manually cut the script apart if they're
> desperate to have that.

Agreed.

FYI, this is 1% as hard as the Win32 port, so I am not discouraged.  ;-)

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Bruce Momjian
Date:
Bruce Momjian wrote:
> Tom Lane wrote:
> > Josh Berkus <josh@agliodbs.com> writes:
> > > It would be nice to have pg_migrator handle this, especially if we could 
> > > do it in parallel.  Then we just have to warn users that migrating a 
> > > database with tsvector columns takes significantly longer.  That is,
> > 
> > > 1) do rest of catalog swap and link/copy of objects.
> > > 2) mark all tsvector columns as 83_tsvector and add new tsvector type
> > >     (these columns will be unusable for queries)
> > > 3) bring up database
> > > 4) search for all 83_tsvector columns
> > > 5) do ALTER TABLE on each of these columns, in parallel, up to a 
> > > configuration setting (default 3).
> > 
> > pg_migrator is already emitting a script that is intended to be run
> > after conversion, to handle REINDEXing of incompatible indexes.  That
> > could easily be made to do ALTER TYPE on old tsvector columns too, no?
> 
> Hmmm.  Well, the problem right now with v8_3_tsvector is that it fails
> on index create, even when the index is empty, because I can't figure
> out how to simply set up the proper index catalog entries.  Once that is
> fixed and I can bind tsvector to v8_3_tsvector on schema creation, I can
> easily emit ALTER TABLE to fix the issue.  And, at that point the
> tsvector GIN indexes would be automatically created so I can skip that
> part.

I have discovered a simpler solution using ALTER TABLE and calling a
conversion function:
test=> CREATE TABLE tsvector_test(x tsvector);CREATE TABLEtest=> ALTER TABLE tsvector_test ALTER COLUMN x TYPE tsvector
test->USING conversion_func(x);ALTER TABLE
 

No need for a fake data type and the required index infrastructure.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Greg Stark
Date:
On Sat, May 30, 2009 at 1:11 PM, Bruce Momjian <bruce@momjian.us> wrote:
> I have discovered a simpler solution using ALTER TABLE and calling a
> conversion function:
>
>        test=> CREATE TABLE tsvector_test(x tsvector);
>        CREATE TABLE
>        test=> ALTER TABLE tsvector_test ALTER COLUMN x TYPE tsvector
>        test-> USING conversion_func(x);
>        ALTER TABLE
>
> No need for a fake data type and the required index infrastructure.

I assume you're putting this in the list of commands to run
post-migration along with any reindex commands etc? Because it will
take a while (still faster than dump/reload i think).

For this case, assuming the new tsvector's output function doesn't get
confused by the old ordering, I think you can just use USING
x::text::tsvector as your conversion expression. For more complex
cases you might need to package up the old output function.

Also note that you'll want to do any other conversions in the same
table at the same time rather than doing multiple conversions.

Also, one gotcha to note is that tsvector data can appear inside
composite data types or arrays. I don't think that's common so perhaps
just a warning in the readme would suffice, but it's something to note
at least.

--
greg


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Dimitri Fontaine
Date:
Hi,

Le 30 mai 09 à 16:02, Greg Stark a écrit :
> On Sat, May 30, 2009 at 1:11 PM, Bruce Momjian <bruce@momjian.us>
> wrote:
>> I have discovered a simpler solution using ALTER TABLE and calling a
>> conversion function:
>>
>>        test=> CREATE TABLE tsvector_test(x tsvector);
>>        CREATE TABLE
>>        test=> ALTER TABLE tsvector_test ALTER COLUMN x TYPE tsvector
>>        test-> USING conversion_func(x);
>>        ALTER TABLE
>>
>> No need for a fake data type and the required index infrastructure.
>
> I assume you're putting this in the list of commands to run
> post-migration along with any reindex commands etc? Because it will
> take a while (still faster than dump/reload i think).

Just thinking some more about the idea to get all those post-
processing steps running in parallel, it's occurring to me that we
have all we need already: would it be possible for pg_migrator to
issue a schema only script with a catalog, in the custom archive format?
Then we could use pg_restore -j <whatever> post_migrator.script to run
the last migration step.

Of course, people will want the custom script output of pg_migrator to
be optional, I guess.

Regards,
--
dim

Re: pg_migrator and an 8.3-compatible tsvector data type

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> I have discovered a simpler solution using ALTER TABLE and calling a
> conversion function:

>     test=> CREATE TABLE tsvector_test(x tsvector);
>     CREATE TABLE
>     test=> ALTER TABLE tsvector_test ALTER COLUMN x TYPE tsvector 
>     test-> USING conversion_func(x);
>     ALTER TABLE

> No need for a fake data type and the required index infrastructure.

I think this is basically a large-caliber foot gun.  You're going to
pretend that invalid data is valid, until the user gets around to fixing
it?
        regards, tom lane


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Bruce Momjian
Date:
Dimitri Fontaine wrote:
> Hi,
> 
> Le 30 mai 09 ? 16:02, Greg Stark a ?crit :
> > On Sat, May 30, 2009 at 1:11 PM, Bruce Momjian <bruce@momjian.us>  
> > wrote:
> >> I have discovered a simpler solution using ALTER TABLE and calling a
> >> conversion function:
> >>
> >>        test=> CREATE TABLE tsvector_test(x tsvector);
> >>        CREATE TABLE
> >>        test=> ALTER TABLE tsvector_test ALTER COLUMN x TYPE tsvector
> >>        test-> USING conversion_func(x);
> >>        ALTER TABLE
> >>
> >> No need for a fake data type and the required index infrastructure.
> >
> > I assume you're putting this in the list of commands to run
> > post-migration along with any reindex commands etc? Because it will
> > take a while (still faster than dump/reload i think).

Yes.

> Just thinking some more about the idea to get all those post- 
> processing steps running in parallel, it's occurring to me that we  
> have all we need already: would it be possible for pg_migrator to  
> issue a schema only script with a catalog, in the custom archive format?
> Then we could use pg_restore -j <whatever> post_migrator.script to run  
> the last migration step.
> 
> Of course, people will want the custom script output of pg_migrator to  
> be optional, I guess.

Yes, eventually that is possible.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > I have discovered a simpler solution using ALTER TABLE and calling a
> > conversion function:
> 
> >     test=> CREATE TABLE tsvector_test(x tsvector);
> >     CREATE TABLE
> >     test=> ALTER TABLE tsvector_test ALTER COLUMN x TYPE tsvector 
> >     test-> USING conversion_func(x);
> >     ALTER TABLE
> 
> > No need for a fake data type and the required index infrastructure.
> 
> I think this is basically a large-caliber foot gun.  You're going to
> pretend that invalid data is valid, until the user gets around to fixing
> it?

What choice do we have?  While we can mark indexes as invalid (which we
do), how do we mark a table's contents as invalid?  Should we create
rules so no one can see the data and then have the ALTER TABLE script
remove the rules after it is rebuilt?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Bruce Momjian
Date:
Bruce Momjian wrote:
> > Just thinking some more about the idea to get all those post- 
> > processing steps running in parallel, it's occurring to me that we  
> > have all we need already: would it be possible for pg_migrator to  
> > issue a schema only script with a catalog, in the custom archive format?
> > Then we could use pg_restore -j <whatever> post_migrator.script to run  
> > the last migration step.
> > 
> > Of course, people will want the custom script output of pg_migrator to  
> > be optional, I guess.
> 
> Yes, eventually that is possible.

I have added that as a pg_migrator TODO item.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Bruce Momjian
Date:
Bruce Momjian wrote:
> Tom Lane wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> > > I have discovered a simpler solution using ALTER TABLE and calling a
> > > conversion function:
> > 
> > >     test=> CREATE TABLE tsvector_test(x tsvector);
> > >     CREATE TABLE
> > >     test=> ALTER TABLE tsvector_test ALTER COLUMN x TYPE tsvector 
> > >     test-> USING conversion_func(x);
> > >     ALTER TABLE
> > 
> > > No need for a fake data type and the required index infrastructure.
> > 
> > I think this is basically a large-caliber foot gun.  You're going to

Uh, isn't pg_migrator itself a large-caliber foot gun?  No need to call
just part of it a foot gun.  ;-)

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Tom Lane wrote:
>> I think this is basically a large-caliber foot gun.  You're going to
>> pretend that invalid data is valid, until the user gets around to fixing
>> it?

> What choice do we have?

Create a fake data type, just as you said before.
        regards, tom lane


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Greg Stark
Date:
On Sat, May 30, 2009 at 6:23 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> I think this is basically a large-caliber foot gun.  You're going to
>> pretend that invalid data is valid, until the user gets around to fixing
>> it?
>
> What choice do we have?

Well you can store the data in a new fake data type (or even just mark
the column as a bytea -- since any varlena is as good as any other).
Then provide this conversion function to create the new data.

I suppose that means you should drop the indexes since if you leave
them things could get weird. But doing the conversion would have to
rebuild indexes anyways so CREATE INDEX should run in the same time as
that step of the conversion would have taken. It would be nice if you
could leave them around so the conversion would rebuild them
automatically, but that would require creating operators and an
opclass for the fake data type which would be more of a pain than just
marking the column as a bytea or a data type with no operators.



--
greg


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Tom Lane wrote:
> >> I think this is basically a large-caliber foot gun.  You're going to
> >> pretend that invalid data is valid, until the user gets around to fixing
> >> it?
> 
> > What choice do we have?
> 
> Create a fake data type, just as you said before.

But I have to create fake index infastructure, and I have to then modify
every reference to that data time at the end, even references in rules
and expression indexes --- that seems too complex.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Bruce Momjian
Date:
Greg Stark wrote:
> On Sat, May 30, 2009 at 1:11 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > I have discovered a simpler solution using ALTER TABLE and calling a
> > conversion function:
> >
> > ? ? ? ?test=> CREATE TABLE tsvector_test(x tsvector);
> > ? ? ? ?CREATE TABLE
> > ? ? ? ?test=> ALTER TABLE tsvector_test ALTER COLUMN x TYPE tsvector
> > ? ? ? ?test-> USING conversion_func(x);
> > ? ? ? ?ALTER TABLE
> >
> > No need for a fake data type and the required index infrastructure.
> 
> I assume you're putting this in the list of commands to run
> post-migration along with any reindex commands etc? Because it will
> take a while (still faster than dump/reload i think).

Yes, current CVS now does this.  I added logic so if you rebuilt the
table, the index will not be rebuilt again because it is a GIN index.

> For this case, assuming the new tsvector's output function doesn't get
> confused by the old ordering, I think you can just use USING
> x::text::tsvector as your conversion expression. For more complex
> cases you might need to package up the old output function.

Wow, cool idea.  I will test that.

> Also note that you'll want to do any other conversions in the same
> table at the same time rather than doing multiple conversions.

Wow, I didn't know we supported that too:
 ALTER TABLE t ALTER COLUMN x TYPE tsvector USING x,  ALTERCOLUMN y TYPE tsvector USING y;

I will work on that also.

> Also, one gotcha to note is that tsvector data can appear inside
> composite data types or arrays. I don't think that's common so perhaps
> just a warning in the readme would suffice, but it's something to note
> at least.

Wow, not sure how I would even find that.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Bruce Momjian
Date:
Greg Stark wrote:
> On Sat, May 30, 2009 at 6:23 PM, Bruce Momjian <bruce@momjian.us> wrote:
> >> I think this is basically a large-caliber foot gun. ?You're going to
> >> pretend that invalid data is valid, until the user gets around to fixing
> >> it?
> >
> > What choice do we have?
> 
> Well you can store the data in a new fake data type (or even just mark
> the column as a bytea -- since any varlena is as good as any other).
> Then provide this conversion function to create the new data.
> 
> I suppose that means you should drop the indexes since if you leave
> them things could get weird. But doing the conversion would have to
> rebuild indexes anyways so CREATE INDEX should run in the same time as
> that step of the conversion would have taken. It would be nice if you
> could leave them around so the conversion would rebuild them
> automatically, but that would require creating operators and an
> opclass for the fake data type which would be more of a pain than just
> marking the column as a bytea or a data type with no operators.

The way the restore works, you would have to call the data type
'tsvector', and then that oid is linked around in all sort of odd places
and you need to change them all, hence confusion.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Bruce Momjian
Date:
Greg Stark wrote:
> On Sat, May 30, 2009 at 1:11 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > I have discovered a simpler solution using ALTER TABLE and calling a
> > conversion function:
> >
> > ? ? ? ?test=> CREATE TABLE tsvector_test(x tsvector);
> > ? ? ? ?CREATE TABLE
> > ? ? ? ?test=> ALTER TABLE tsvector_test ALTER COLUMN x TYPE tsvector
> > ? ? ? ?test-> USING conversion_func(x);
> > ? ? ? ?ALTER TABLE
> >
> > No need for a fake data type and the required index infrastructure.
> 
> I assume you're putting this in the list of commands to run
> post-migration along with any reindex commands etc? Because it will
> take a while (still faster than dump/reload i think).
> 
> For this case, assuming the new tsvector's output function doesn't get
> confused by the old ordering, I think you can just use USING
> x::text::tsvector as your conversion expression. For more complex
> cases you might need to package up the old output function.
> 
> Also note that you'll want to do any other conversions in the same
> table at the same time rather than doing multiple conversions.

I have implemented Greg's ideas using ::text::tsvector casting and it
worked:
$ psql testpsql (8.4beta2)Type "help" for help.test=> select * from tsvector_test;       x---------------- 'c' 'bb'
'aaa'(1row)test=>test=> \q
 
$ psql test < /u/postgres/rebuilt_tsvector_tables.txtALTER TABLEALTER TABLEALTER TABLE$ psql testpsql (8.4beta2)Type
"help"for help.test=> select * from tsvector_test;       x---------------- 'aaa' 'bb' 'c'(1 row)
 

Here is the script that groups multiple tsvector columns into a single
ALTER TABLE:
ALTER TABLE "public"."tsvector_test"  ALTER COLUMN "x" TYPE tsvector USING "x"::text::tsvector;ALTER TABLE
"public"."tsvector_test2" ALTER COLUMN "x" TYPE tsvector USING "x"::text::tsvector,  ALTER COLUMN "y" TYPE tsvector
USING"y"::text::tsvector;ALTER TABLE "public"."tsvector_test3"  ALTER COLUMN "x" TYPE tsvector USING
"x"::text::tsvector, ALTER COLUMN "y" TYPE tsvector USING "y"::text::tsvector,  ALTER COLUMN "z" TYPE tsvector USING
"z"::text::tsvector;

I knew I was going to have to deal with this optimization but didn't
realize ALTER TABLE already supported multiple TYPE changes.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Bruce Momjian
Date:
Bruce Momjian wrote:
> Tom Lane wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> > > I have discovered a simpler solution using ALTER TABLE and calling a
> > > conversion function:
> > 
> > >     test=> CREATE TABLE tsvector_test(x tsvector);
> > >     CREATE TABLE
> > >     test=> ALTER TABLE tsvector_test ALTER COLUMN x TYPE tsvector 
> > >     test-> USING conversion_func(x);
> > >     ALTER TABLE
> > 
> > > No need for a fake data type and the required index infrastructure.
> > 
> > I think this is basically a large-caliber foot gun.  You're going to
> > pretend that invalid data is valid, until the user gets around to fixing
> > it?
> 
> What choice do we have?  While we can mark indexes as invalid (which we
> do), how do we mark a table's contents as invalid?  Should we create
> rules so no one can see the data and then have the ALTER TABLE script
> remove the rules after it is rebuilt?

OK, what ideas do people have to prevent access to tsvector columns?  I
am thinking of renaming the tables or something.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Greg Stark
Date:
On Sun, May 31, 2009 at 3:04 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> > I think this is basically a large-caliber foot gun.  You're going to
>> > pretend that invalid data is valid, until the user gets around to fixing
>> > it?
>>
>> What choice do we have?

I think in this case the caliber is pretty small and this might be
sufficient. It might be nice if we had a check somewhere in the
tsvector data types so people get informative errors if their
tsvectors are old-style rather than random incorrect results, but
that's mostly gilding.

In the general case of data type representation changes I think we
need something like: While we can mark indexes as invalid (which we
>> do), how do we mark a table's contents as invalid?  Should we create
>> rules so no one can see the data and then have the ALTER TABLE script
>> remove the rules after it is rebuilt?
>
> OK, what ideas do people have to prevent access to tsvector columns?  I
> am thinking of renaming the tables or something.

1 Change the catalog so all the tsvector colums are bytea.

2 Include a c function like migrate_tsvector(bytea) which contains a
copy of the old data type's output function and calls the new  data
type's input function on the result.

3 Include an ALTER TABLE command which calls the c function.

The gotchas I can see with this is:

1) It only works for varlenas -- There isn't a universal fixed length
data  type. You would probably have to invent one.

2) I'm not sure what will happen to rules and triggers which call
functions on the old data type. If you restore the schema unchanged
and modify the catalog directly then they will still be there but have
mismatched types. Will users get errors? Will those errors be sensible
errors or nonsensical ones? Will the conversion still go ahead or will
it complain that there are things which depend on the column?

If the problems in (2) prove surmountable then this provides a general
solution for any varlena data type representation change. However it
will still be a O(n) conversion plus an index rebuild. That's
unfortunate but unless we plan to ship the full set of operators,
opclasses, opfamilies, cross-data-type operators, etc for the old data
type I see no way around it.

I haven't heard anyone suggest we should roll back the tsvector
changes and give up the features the changes  provide -- and that's
just a performance feature. If that's all it took to convince us to
give up in-place-upgrade for this data type then imagine how easy it
will be to justify for actual functional features.

(Personally I think we're fooling ourselves to think Postgres is
mature enough that we won't come up with any new improvements which
will justify a data format change. I would rather hope we'll keep
coming up with massive improvements which require major changes in
every release.)

--
greg


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Bruce Momjian
Date:
Bruce Momjian wrote:
> Bruce Momjian wrote:
> > Tom Lane wrote:
> > > Bruce Momjian <bruce@momjian.us> writes:
> > > > I have discovered a simpler solution using ALTER TABLE and calling a
> > > > conversion function:
> > > 
> > > >     test=> CREATE TABLE tsvector_test(x tsvector);
> > > >     CREATE TABLE
> > > >     test=> ALTER TABLE tsvector_test ALTER COLUMN x TYPE tsvector 
> > > >     test-> USING conversion_func(x);
> > > >     ALTER TABLE
> > > 
> > > > No need for a fake data type and the required index infrastructure.
> > > 
> > > I think this is basically a large-caliber foot gun.  You're going to
> > > pretend that invalid data is valid, until the user gets around to fixing
> > > it?
> > 
> > What choice do we have?  While we can mark indexes as invalid (which we
> > do), how do we mark a table's contents as invalid?  Should we create
> > rules so no one can see the data and then have the ALTER TABLE script
> > remove the rules after it is rebuilt?
> 
> OK, what ideas do people have to prevent access to tsvector columns?  I
> am thinking of renaming the tables or something.

Another idea would be to rename just the _column_ (not the table), but
that doesn't block SELECT * from showing the column.  Another idea would
be to replace tsvector's out function, but that would affect all tables
meaning it would be hard to incrementally rebuild tables using tsvector
columns.  Another idea would be to use column-level permissions to
prevent column access, but that doesn't affect super-users.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Robert Haas
Date:
On Sat, May 30, 2009 at 11:58 PM, Bruce Momjian <bruce@momjian.us> wrote:
> The way the restore works, you would have to call the data type
> 'tsvector', and then that oid is linked around in all sort of odd places
> and you need to change them all, hence confusion.

In theory, I think it would be possible to hack around this problem by
doing the following:

1. Change the OID of 8.4's "tsvector" type to an OID that was not in use in 8.3.
2. Add a type called "tsvector83" to 8.4 using the same OID that the
"tsvector" type used in 8.3.
3. Hack pg_dump so that when dumping in upgrade mode, tsvector
references from 8.3 are emitted as tsvector83.

This would mean that you'd be on line immediately following the
upgrade, because the tsvector83 type would still be usable; it just
wouldn't have the improved performance of the new tsvector.
Unfortunately, it would also require a catversion bump (and some
programming work besides, not sure how much).  Given that we are 7
months after the start of the last commitfest, which was theoretically
our final feature freeze, this is probably not realistic, but if it's
not too ugly it might be considered if a similar situation comes up
during 8.5 development.

(At the risk of beating a dead horse, note if we were upgrading the
catalog tables directly via SQL, this type of scenario could be
handled cleanly without hacking pg_dump; I repeat my earlier critique
that the pg_migrator approach consigns us to a never-ending series of
pg_dump hacks, that or having it not work very well.)

...Robert


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Bruce Momjian
Date:
Robert Haas wrote:
> On Sat, May 30, 2009 at 11:58 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > The way the restore works, you would have to call the data type
> > 'tsvector', and then that oid is linked around in all sort of odd places
> > and you need to change them all, hence confusion.
> 
> In theory, I think it would be possible to hack around this problem by
> doing the following:
> 
> 1. Change the OID of 8.4's "tsvector" type to an OID that was not in use in 8.3.
> 2. Add a type called "tsvector83" to 8.4 using the same OID that the
> "tsvector" type used in 8.3.
> 3. Hack pg_dump so that when dumping in upgrade mode, tsvector
> references from 8.3 are emitted as tsvector83.

The pg_type oids are never transfered by pg_migrator; all schema
information is transfered as pg_dump --schema stuff, i.e. ASCII.

> This would mean that you'd be on line immediately following the
> upgrade, because the tsvector83 type would still be usable; it just
> wouldn't have the improved performance of the new tsvector.
> Unfortunately, it would also require a catversion bump (and some
> programming work besides, not sure how much).  Given that we are 7
> months after the start of the last commitfest, which was theoretically
> our final feature freeze, this is probably not realistic, but if it's
> not too ugly it might be considered if a similar situation comes up
> during 8.5 development.
> 
> (At the risk of beating a dead horse, note if we were upgrading the
> catalog tables directly via SQL, this type of scenario could be
> handled cleanly without hacking pg_dump; I repeat my earlier critique
> that the pg_migrator approach consigns us to a never-ending series of
> pg_dump hacks, that or having it not work very well.)

We could create infrastructure to handle the 8.3 storage format but my
guess is that most people would just rather rebuild the tables.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> (At the risk of beating a dead horse, note if we were upgrading the
> catalog tables directly via SQL, this type of scenario could be
> handled cleanly without hacking pg_dump; I repeat my earlier critique
> that the pg_migrator approach consigns us to a never-ending series of
> pg_dump hacks, that or having it not work very well.)

"Updating the catalog tables directly via SQL"?  Good luck making that
work.  If you ever get it to work at all, it'll require a pile of hacks
that will make pg_migrator look like paradise.

(Which is not to say that pg_migrator isn't a hack; it surely is.
But it looks like the least painful approach available.)
        regards, tom lane


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Robert Haas
Date:
On Sun, May 31, 2009 at 11:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> (At the risk of beating a dead horse, note if we were upgrading the
>> catalog tables directly via SQL, this type of scenario could be
>> handled cleanly without hacking pg_dump; I repeat my earlier critique
>> that the pg_migrator approach consigns us to a never-ending series of
>> pg_dump hacks, that or having it not work very well.)
>
> "Updating the catalog tables directly via SQL"?  Good luck making that
> work.  If you ever get it to work at all, it'll require a pile of hacks
> that will make pg_migrator look like paradise.

For clarity, I really mean "from a standalone backend", but ideally
I'd like it to be SQL.

> (Which is not to say that pg_migrator isn't a hack; it surely is.
> But it looks like the least painful approach available.)

Maybe.  It seems that we don't have a good way of handling datatype
conversions.  The approaches that have been proposed for tsvector
wouldn't work at all but for the fact that the new output function can
handle the old internal representation, which is not something that we
can guarantee in every case.  And, even so, they leave the database in
a broken state until the post-migration scripts have been run.  The
good news is that tsvector is not a datatype that everyone uses, and
those who do probably don't use it in every table, but what happens
when we want to change numeric incompatibly?

We really need to figure out an approach that lets us keep the old
datatypes around under a different name while making the original name
be the new version of the datatype.  That way people can migrate and
be up, and deal with the need to rewrite their tables at a later time.Having the database come up quickly but in a
statewhere queries may 
return wrong answers is a fairly poor excuse for in-place upgrade.
This is not to say that I don't appreciate the work Bruce is putting
into it, but I fear that without some more engineering it's just going
to be a hack used only by the desperate.

...Robert


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Greg Stark
Date:
On Mon, Jun 1, 2009 at 4:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, May 31, 2009 at 11:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> "Updating the catalog tables directly via SQL"?  Good luck making that
>> work.  If you ever get it to work at all, it'll require a pile of hacks
>> that will make pg_migrator look like paradise.
>
> For clarity, I really mean "from a standalone backend", but ideally
> I'd like it to be SQL.

Keep in mind that you have catalogs in all the databases, and even in
standalone mode you need those catalogs to find the, er, catalogs.
There's a reason bootstrap mode is so limited and even then it needs
some of the catalogs to already be in place.

>> (Which is not to say that pg_migrator isn't a hack; it surely is.
>> But it looks like the least painful approach available.)
>
> Maybe.  It seems that we don't have a good way of handling datatype
> conversions.  The approaches that have been proposed for tsvector
> wouldn't work at all but for the fact that the new output function can
> handle the old internal representation, which is not something that we
> can guarantee in every case.

Well I think all we need for that is to have pg_migrator provide the
old output function wrapped up in a migrate_foo() C function.


> And, even so, they leave the database in
> a broken state until the post-migration scripts have been run.  The
> good news is that tsvector is not a datatype that everyone uses, and
> those who do probably don't use it in every table, but what happens
> when we want to change numeric incompatibly?

Or, say, timestamp...

> We really need to figure out an approach that lets us keep the old
> datatypes around under a different name while making the original name
> be the new version of the datatype.  That way people can migrate and
> be up, and deal with the need to rewrite their tables at a later time.

I do agree that having to rewrite the whole table isn't really
"upgrade-in-place".

But the work to support multiple versions of data types is more than
you're describing. You need to be concerned about things like joins
between tables when some columns are the old data type and some the
new, etc.

Ultimately people will have to convert the data types sometime.

--
greg


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Tom Lane
Date:
Greg Stark <stark@enterprisedb.com> writes:
> On Mon, Jun 1, 2009 at 4:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> We really need to figure out an approach that lets us keep the old
>> datatypes around under a different name while making the original name
>> be the new version of the datatype. �That way people can migrate and
>> be up, and deal with the need to rewrite their tables at a later time.

> I do agree that having to rewrite the whole table isn't really
> "upgrade-in-place".

It's certainly the case that there is a lot more work to do before
pg_migrator could support everything that we reasonably want to be
able to do in a version update.  As I see it, the reason it's getting
revived now is that 8.3->8.4 happens to be an update where most of what
it can't (yet) do isn't necessary.  That means we can get it out there,
get the bugs out of the functionality it does have, and most importantly
try to set an expectation that future updates will also have some degree
of update-in-place capability.  If we wait till it's perfect then
nothing will ever happen at all in this space.
        regards, tom lane


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Robert Haas
Date:
On Mon, Jun 1, 2009 at 11:10 AM, Greg Stark <stark@enterprisedb.com> wrote:
> On Mon, Jun 1, 2009 at 4:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sun, May 31, 2009 at 11:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> "Updating the catalog tables directly via SQL"?  Good luck making that
>>> work.  If you ever get it to work at all, it'll require a pile of hacks
>>> that will make pg_migrator look like paradise.
>>
>> For clarity, I really mean "from a standalone backend", but ideally
>> I'd like it to be SQL.
>
> Keep in mind that you have catalogs in all the databases, and even in
> standalone mode you need those catalogs to find the, er, catalogs.
> There's a reason bootstrap mode is so limited and even then it needs
> some of the catalogs to already be in place.

Yep.  Changes to the schema of the bootstrap tables are definitely the
toughest nut to crack.  I haven't been convinced that it's impossible,
but given my relative level of knowledge of the code compared to Tom,
that could well be an indication that I'm overly optimistic.

>>> (Which is not to say that pg_migrator isn't a hack; it surely is.
>>> But it looks like the least painful approach available.)
>>
>> Maybe.  It seems that we don't have a good way of handling datatype
>> conversions.  The approaches that have been proposed for tsvector
>> wouldn't work at all but for the fact that the new output function can
>> handle the old internal representation, which is not something that we
>> can guarantee in every case.
>
> Well I think all we need for that is to have pg_migrator provide the
> old output function wrapped up in a migrate_foo() C function

Well that would be better, but it still leaves the database temporarily broken.

>> And, even so, they leave the database in
>> a broken state until the post-migration scripts have been run.  The
>> good news is that tsvector is not a datatype that everyone uses, and
>> those who do probably don't use it in every table, but what happens
>> when we want to change numeric incompatibly?
>
> Or, say, timestamp...

Yeah.

>> We really need to figure out an approach that lets us keep the old
>> datatypes around under a different name while making the original name
>> be the new version of the datatype.  That way people can migrate and
>> be up, and deal with the need to rewrite their tables at a later time.
>
> I do agree that having to rewrite the whole table isn't really
> "upgrade-in-place".
>
> But the work to support multiple versions of data types is more than
> you're describing. You need to be concerned about things like joins
> between tables when some columns are the old data type and some the
> new, etc.

True.

> Ultimately people will have to convert the data types sometime.

Yes they will, but not having to do it as part of the upgrade is
important.  What particularly bothers me is the possibility that the
database comes on line and starts letting clients in (who don't know
that it's broken) while the breakage is still present.

...Robert


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Robert Haas
Date:
On Mon, Jun 1, 2009 at 11:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <stark@enterprisedb.com> writes:
>> On Mon, Jun 1, 2009 at 4:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> We really need to figure out an approach that lets us keep the old
>>> datatypes around under a different name while making the original name
>>> be the new version of the datatype.  That way people can migrate and
>>> be up, and deal with the need to rewrite their tables at a later time.
>
>> I do agree that having to rewrite the whole table isn't really
>> "upgrade-in-place".
>
> It's certainly the case that there is a lot more work to do before
> pg_migrator could support everything that we reasonably want to be
> able to do in a version update.  As I see it, the reason it's getting
> revived now is that 8.3->8.4 happens to be an update where most of what
> it can't (yet) do isn't necessary.  That means we can get it out there,
> get the bugs out of the functionality it does have, and most importantly
> try to set an expectation that future updates will also have some degree
> of update-in-place capability.  If we wait till it's perfect then
> nothing will ever happen at all in this space.

I agree.  I remain doubtful that dumping and reloading the schema is
the best way to go, but it's certainly a worthwhile experiment,
because (a) I might easily be wrong and (b) we'll hopefully learn some
things that will be useful going forward.

...Robert


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Bruce Momjian
Date:
Robert Haas wrote:
> > It's certainly the case that there is a lot more work to do before
> > pg_migrator could support everything that we reasonably want to be
> > able to do in a version update. ?As I see it, the reason it's getting
> > revived now is that 8.3->8.4 happens to be an update where most of what
> > it can't (yet) do isn't necessary. ?That means we can get it out there,
> > get the bugs out of the functionality it does have, and most importantly
> > try to set an expectation that future updates will also have some degree
> > of update-in-place capability. ?If we wait till it's perfect then
> > nothing will ever happen at all in this space.
> 
> I agree.  I remain doubtful that dumping and reloading the schema is
> the best way to go, but it's certainly a worthwhile experiment,
> because (a) I might easily be wrong and (b) we'll hopefully learn some
> things that will be useful going forward.

The amount of work to do an upgrade without dump/reloading the schema is
overly complex and would have to be adjusted for every major release,
with the high likelihood it would not be done or fail in random ways.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: pg_migrator and an 8.3-compatible tsvector data type

From
Bruce Momjian
Date:
Greg Stark wrote:
> On Mon, Jun 1, 2009 at 4:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Sun, May 31, 2009 at 11:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> "Updating the catalog tables directly via SQL"? ?Good luck making that
> >> work. ?If you ever get it to work at all, it'll require a pile of hacks
> >> that will make pg_migrator look like paradise.
> >
> > For clarity, I really mean "from a standalone backend", but ideally
> > I'd like it to be SQL.
> 
> Keep in mind that you have catalogs in all the databases, and even in
> standalone mode you need those catalogs to find the, er, catalogs.
> There's a reason bootstrap mode is so limited and even then it needs
> some of the catalogs to already be in place.
> 
> >> (Which is not to say that pg_migrator isn't a hack; it surely is.
> >> But it looks like the least painful approach available.)
> >
> > Maybe. ?It seems that we don't have a good way of handling datatype
> > conversions. ?The approaches that have been proposed for tsvector
> > wouldn't work at all but for the fact that the new output function can
> > handle the old internal representation, which is not something that we
> > can guarantee in every case.
> 
> Well I think all we need for that is to have pg_migrator provide the
> old output function wrapped up in a migrate_foo() C function.

Yep, it is just a matter of using a custom function call rather than a
cast.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +