Thread: Best practices for moving UTF8 databases

Best practices for moving UTF8 databases

From

Phoenix Kiula

Date:

12 July 2009, 08:20:04

Hi. I *always* get an error moving my current fully utf-8 database
data into a new DB.

My server has the version 8.3 with a five year old DB. Everything, all
collation, LC_LOCALE etc are all utf8.

When I install a new Postgresql 8.4 on my home Mac OSX machine (after
losing some hair) I set everything about a new database to be utf8. At
least anything I could see in PgAdmin.

But when I pull in the data dump from the server I always see that
error from the utf8 mismatch and such.

So, my question. What is a good way to make sure that error does NOT
occur? I simply wish to replicate the server database on another PG
installation. What should one do?

Re: Best practices for moving UTF8 databases

From

"Albe Laurenz"

Date:

13 July 2009, 05:31:32

Phoenix Kiula wrote:
> Hi. I *always* get an error moving my current fully utf-8 database
> data into a new DB.
>
> My server has the version 8.3 with a five year old DB. Everything, all
> collation, LC_LOCALE etc are all utf8.
>
> When I install a new Postgresql 8.4 on my home Mac OSX machine (after
> losing some hair) I set everything about a new database to be utf8. At
> least anything I could see in PgAdmin.
>
> But when I pull in the data dump from the server I always see that
> error from the utf8 mismatch and such.
>
> So, my question. What is a good way to make sure that error does NOT
> occur? I simply wish to replicate the server database on another PG
> installation. What should one do?

Improve PostgreSQL so that there are no more loopholes how you can
enter incorrect data :^(

8.3 is about one and a half years old, so your database must have
oribinated with an older release. Correctness checks have improved
considerably over the years.

What you will have to do it track down the rows containing incorrect data
and fix them.

Yours,
Laurenz Albe

Request for features

From

Michael Gould

Date:

13 July 2009, 09:19:25

I would like to know what the formal method of requesting new features are
for Postgres and how are decisions made which features are included in a
future release or not.

Best Regards
--
Michael Gould, Managing Partner
Intermodal Software Solutions, LLC
904.226.0978
904.592.5250 fax

Re: Request for features

From

Bruce Momjian

Date:

13 July 2009, 10:42:26

Michael Gould wrote:
> I would like to know what the formal method of requesting new features are
> for Postgres and how are decisions made which features are included in a
> future release or not.

Uh, you usually ask for the feature on the bugs or hackers email lists,
and if we consider it useful it will be added to the TODO list.  Then,
someday, someone might implement your feature.  Not very rigorous, but
that is the only method we have.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Best practices for moving UTF8 databases

From

Andres Freund

Date:

13 July 2009, 11:45:53

On Sunday 12 July 2009 13:19:50 Phoenix Kiula wrote:
> Hi. I *always* get an error moving my current fully utf-8 database
> data into a new DB.
>
> My server has the version 8.3 with a five year old DB. Everything, all
> collation, LC_LOCALE etc are all utf8.
>
> When I install a new Postgresql 8.4 on my home Mac OSX machine (after
> losing some hair) I set everything about a new database to be utf8. At
> least anything I could see in PgAdmin.
>
> But when I pull in the data dump from the server I always see that
> error from the utf8 mismatch and such.
>
> So, my question. What is a good way to make sure that error does NOT
> occur? I simply wish to replicate the server database on another PG
> installation. What should one do?
What is the _exact_ error you get? During which statement?

Andres

Re: Best practices for moving UTF8 databases

From

Jasen Betts

Date:

14 July 2009, 06:37:29

On 2009-07-13, Andres Freund <andres@anarazel.de> wrote:
> On Sunday 12 July 2009 13:19:50 Phoenix Kiula wrote:
>> Hi. I *always* get an error moving my current fully utf-8 database
>> data into a new DB.
>>
>> My server has the version 8.3 with a five year old DB. Everything, all
>> collation, LC_LOCALE etc are all utf8.
>>
>> When I install a new Postgresql 8.4 on my home Mac OSX machine (after
>> losing some hair) I set everything about a new database to be utf8. At
>> least anything I could see in PgAdmin.
>>
>> But when I pull in the data dump from the server I always see that
>> error from the utf8 mismatch and such.
>>
>> So, my question. What is a good way to make sure that error does NOT
>> occur? I simply wish to replicate the server database on another PG
>> installation. What should one do?
> What is the _exact_ error you get? During which statement?
>
> Andres
>

if you do an ascii dump and the dump starts out "SET CLIENT ENCODING 'UTF8'"
or similar but you still get errors.

run it through run it through "iconv -f UFT8 -t UTF8//IGNORE"
that'll drop any illegal symbols. In theory that's a reduction in data
integrity.

iconv doesn't seem to have an option to replace them with U+FFFD :(
I had hoped that //TRASNSLIT would do that, but no.

Re: Best practices for moving UTF8 databases

From

Andres Freund

Date:

14 July 2009, 06:58:26

On Tuesday 14 July 2009 11:36:57 Jasen Betts wrote:
> On 2009-07-13, Andres Freund <andres@anarazel.de> wrote:
> > On Sunday 12 July 2009 13:19:50 Phoenix Kiula wrote:
> >> Hi. I *always* get an error moving my current fully utf-8 database
> >> data into a new DB.
> >>
> >> My server has the version 8.3 with a five year old DB. Everything, all
> >> collation, LC_LOCALE etc are all utf8.
> >>
> >> When I install a new Postgresql 8.4 on my home Mac OSX machine (after
> >> losing some hair) I set everything about a new database to be utf8. At
> >> least anything I could see in PgAdmin.
> >>
> >> But when I pull in the data dump from the server I always see that
> >> error from the utf8 mismatch and such.
> >>
> >> So, my question. What is a good way to make sure that error does NOT
> >> occur? I simply wish to replicate the server database on another PG
> >> installation. What should one do?
> >
> > What is the _exact_ error you get? During which statement?
> if you do an ascii dump and the dump starts out "SET CLIENT ENCODING
> 'UTF8'" or similar but you still get errors.
Do you mean that a dump from SQL_ASCII can yield non-utf8 data? right. But
According to the OP his 8.3 database is UTF8...
So there should not be invalid data in there.

Andres

Re: Request for features

From

Jasen Betts

Date:

14 July 2009, 08:22:06

On 2009-07-13, Michael Gould <mgould@intermodalsoftwaresolutions.net> wrote:
> I would like to know what the formal method of requesting new features are
> for Postgres and how are decisions made which features are included in a
> future release or not.

Formal?

for open soure software in general, changes are are requested using
the bug reporting system.

Change requests accompanied with working source patches have an advantage.

Grants of cash to the developers tend to help too.

Re: Best practices for moving UTF8 databases

From

Alvaro Herrera

Date:

14 July 2009, 11:21:43

Andres Freund wrote:
> On Tuesday 14 July 2009 11:36:57 Jasen Betts wrote:

> > if you do an ascii dump and the dump starts out "SET CLIENT ENCODING
> > 'UTF8'" or similar but you still get errors.
> Do you mean that a dump from SQL_ASCII can yield non-utf8 data? right. But
> According to the OP his 8.3 database is UTF8...
> So there should not be invalid data in there.

I haven't followed this thread, but older PG versions had less strict
checks on UTF8 data, which meant that some invalid data could creep in.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Best practices for moving UTF8 databases

From

Andres Freund

Date:

14 July 2009, 13:11:43

On Tuesday 14 July 2009 15:52:29 Alvaro Herrera wrote:
> Andres Freund wrote:
> > On Tuesday 14 July 2009 11:36:57 Jasen Betts wrote:
> > > if you do an ascii dump and the dump starts out "SET CLIENT ENCODING
> > > 'UTF8'" or similar but you still get errors.
> >
> > Do you mean that a dump from SQL_ASCII can yield non-utf8 data? right.
> > But According to the OP his 8.3 database is UTF8...
> > So there should not be invalid data in there.
> I haven't followed this thread, but older PG versions had less strict
> checks on UTF8 data, which meant that some invalid data could creep in.
But that was from 8.2 -> 8.3 and not 8.3->8.4 I think?

Andres

Re: Request for features

From

David Fetter

Date:

14 July 2009, 20:13:08

On Tue, Jul 14, 2009 at 10:00:04AM +0000, Jasen Betts wrote:
> On 2009-07-13, Michael Gould <mgould@intermodalsoftwaresolutions.net> wrote:
> > I would like to know what the formal method of requesting new features are
> > for Postgres and how are decisions made which features are included in a
> > future release or not.
>
> Formal?
>
> for open soure software in general, changes are are requested using
> the bug reporting system.

Not in PostgreSQL, generally.

> Change requests accompanied with working source patches have an
> advantage.

Not always.  It's much better to discuss the design of the feature on
-hackers, come to some rough consensus, *then* write some code than to
come up with some huge wonk of code which no one is interested in
understanding and will languish.

> Grants of cash to the developers tend to help too.

There are several outfits that can help you organize your donations.
Software in the Public Interest, an non-profit company in the US, has
a way to earmark donations for the PostgreSQL project.

Other ways to contribute resources include hardware and/or people
available to the project.

Being a pleasant, helpful part of the community will incline everyone
else favorably toward setting your ideas at a higher priority, even if
some of them don't fly :)

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Best practices for moving UTF8 databases

From

Phoenix Kiula

Date:

18 July 2009, 23:16:29

On Tue, Jul 14, 2009 at 9:52 PM, Alvaro
Herrera<alvherre@commandprompt.com> wrote:
> Andres Freund wrote:
>> On Tuesday 14 July 2009 11:36:57 Jasen Betts wrote:
>
>> > if you do an ascii dump and the dump starts out "SET CLIENT ENCODING
>> > 'UTF8'" or similar but you still get errors.
>> Do you mean that a dump from SQL_ASCII can yield non-utf8 data? right. But
>> According to the OP his 8.3 database is UTF8...
>> So there should not be invalid data in there.
>
> I haven't followed this thread, but older PG versions had less strict
> checks on UTF8 data, which meant that some invalid data could creep in.

If so, how can I check for them in my old database, which is 8.2.9?
I'm now moving first to 8.3 (then to the 84).

Really, PG absolutely needs a way to upgrade the database without so
much data related downtime and all these silly woes. Several competing
database systems are a cinch to upgrade.

Anyway this is the annoying error I see as always:

  ERROR:  invalid byte sequence for encoding "UTF8": 0x80

I think my old DB is all utf8. If there are a few characters that are
not, how can I work with this? I've done everything I can to take care
of the encoding and such. This code was used to initdb:

 initdb --locale=en_US.UTF-8 --encoding=UTF8

Locale environment variables are all "en_US.UTF-8" too.

Thanks for any pointers!

Re: Best practices for moving UTF8 databases

From

Martijn van Oosterhout

Date:

19 July 2009, 08:09:13

On Sun, Jul 19, 2009 at 10:16:17AM +0800, Phoenix Kiula wrote:
> If so, how can I check for them in my old database, which is 8.2.9?
> I'm now moving first to 8.3 (then to the 84).
>
> Really, PG absolutely needs a way to upgrade the database without so
> much data related downtime and all these silly woes. Several competing
> database systems are a cinch to upgrade.

Look through the archives, there are scripts that will scan all your
text fields for UTF-8 problems. If you run them once you can clear out
all the problems prior to upgrading.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Attachment

signature.asc

Re: Best practices for moving UTF8 databases

From

Phoenix Kiula

Date:

19 July 2009, 23:32:28

On Sun, Jul 19, 2009 at 7:08 PM, Martijn van
Oosterhout<kleptog@svana.org> wrote:
> On Sun, Jul 19, 2009 at 10:16:17AM +0800, Phoenix Kiula wrote:

> Look through the archives, there are scripts that will scan all your
> text fields for UTF-8 problems. If you run them once you can clear out
> all the problems prior to upgrading.

Thanks Martin. I tried searching through the archives and could only
come with something like this:

http://docs.moodle.org/en/UTF-8_PostgreSQL

But this only has the usual iconv stuff suggested.

Could you pls suggest some specific scripts or URLs? Or let me know
the keywords I should look for?

Thanks!

Re: Best practices for moving UTF8 databases

From

"Albe Laurenz"

Date:

20 July 2009, 04:53:44

Phoenix Kiula wrote:
> Really, PG absolutely needs a way to upgrade the database without so
> much data related downtime and all these silly woes. Several competing
> database systems are a cinch to upgrade.

I'd call it data corruption, not a silly woe.

I know that Oracle for example would not make that much fuss about
your data: they would be imported without even a warning, and
depending on your encoding settings the bad bytes would either be
imported as-is or tacitly changed to inverted (or normal) question
marks.

It's basically a design choice that PostgreSQL made: we think that
an error is preferrable to clandestinely modifying the user's data
or accepting input that cannot possibly make any sense when it is
retrieved at a future time.

> Anyway this is the annoying error I see as always:
>
>   ERROR:  invalid byte sequence for encoding "UTF8": 0x80
>
> I think my old DB is all utf8. If there are a few characters that are
> not, how can I work with this? I've done everything I can to take care
> of the encoding and such. This code was used to initdb:
>
>  initdb --locale=en_US.UTF-8 --encoding=UTF8
>
> Locale environment variables are all "en_US.UTF-8" too.

"0x80" makes me think of the following:
The data originate from a Windows system, where 0x80 is a Euro
sign. Somehow these were imported into PostgreSQL without the
appropriate translation into UTF-8 (how I do not know).

I wonder: why do you spend so much time complaining instead of
simply locating the buggy data and fixing them?

This does not incur any downtime (you can fix the data in the old
database before migrating), and it will definitely enhance the fun
your users have with your database (if they actually see Euros where
they should be).

Yours,
Laurenz Albe

Re: Best practices for moving UTF8 databases

From

Martijn van Oosterhout

Date:

20 July 2009, 07:31:04

On Mon, Jul 20, 2009 at 10:32:15AM +0800, Phoenix Kiula wrote:
> Thanks Martin. I tried searching through the archives and could only
> come with something like this:
>
> http://docs.moodle.org/en/UTF-8_PostgreSQL
>
> But this only has the usual iconv stuff suggested.
>
> Could you pls suggest some specific scripts or URLs? Or let me know
> the keywords I should look for?

I was thinking of this one:

http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg68541.html

You can use it to find the bogus strings in your existing database and
fix them. It's not really fast, but maybe it'll help.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Attachment

signature.asc

Re: Best practices for moving UTF8 databases

From

Phoenix Kiula

Date:

20 July 2009, 22:44:08

> "0x80" makes me think of the following:
> The data originate from a Windows system, where 0x80 is a Euro
> sign. Somehow these were imported into PostgreSQL without the
> appropriate translation into UTF-8 (how I do not know).
>
> I wonder: why do you spend so much time complaining instead of
> simply locating the buggy data and fixing them?


I'd love to fix them. But if I do a search for

SELECT * FROM xyz WHERE col like '%0x80%'

it doesn't work. How should I search for these characters?

Thanks much

Re: Best practices for moving UTF8 databases

From

"Daniel Verite"

Date:

21 July 2009, 04:37:28

    Phoenix Kiula wrote:

> I'd love to fix them. But if I do a search for
>
> SELECT * FROM xyz WHERE col like '%0x80%'
>
> it doesn't work. How should I search for these characters?

In 8.2, try: WHERE strpos(col, E'\x80') > 0

Note that this may find valid data as well, because the error you get
is when 0x80 is the first byte of a character in UTF8; when it's at
another position, you don't want to change it.

Best regards,
--
Daniel
PostgreSQL-powered mail user agent and storage:
http://www.manitou-mail.org

Re: Best practices for moving UTF8 databases

From

Sam Mason

Date:

21 July 2009, 07:41:26

On Tue, Jul 21, 2009 at 09:37:04AM +0200, Daniel Verite wrote:
> >I'd love to fix them. But if I do a search for
> >SELECT * FROM xyz WHERE col like '%0x80%'
> >
> >it doesn't work. How should I search for these characters?
>
> In 8.2, try: WHERE strpos(col, E'\x80') > 0
>
> Note that this may find valid data as well, because the error you get
> is when 0x80 is the first byte of a character in UTF8; when it's at
> another position, you don't want to change it.

There are various regexs around to check for valid UTF-8 encoding; one
appears to be:

  http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex

One translation into PG would be:

  WHERE NOT col ~ ( '^('||
    $$[\09\0A\0D\x20-\x7E]|$$||               -- ASCII
    $$[\xC2-\xDF][\x80-\xBF]|$$||             -- non-overlong 2-byte
     $$\xE0[\xA0-\xBF][\x80-\xBF]|$$||        -- excluding overlongs
    $$[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|$$||  -- straight 3-byte
     $$\xED[\x80-\x9F][\x80-\xBF]|$$||        -- excluding surrogates
     $$\xF0[\x90-\xBF][\x80-\xBF]{2}|$$||     -- planes 1-3
    $$[\xF1-\xF3][\x80-\xBF]{3}|$$||          -- planes 4-15
     $$\xF4[\x80-\x8F][\x80-\xBF]{2}$$||      -- plane 16
   '*)$' );

This seems to do the right thing for me in an SQL_ASCII database.

--
  Sam  http://samason.me.uk/

Re: Best practices for moving UTF8 databases

From

"Albe Laurenz"

Date:

21 July 2009, 10:49:03

Phoenix Kiula wrote:
> > I wonder: why do you spend so much time complaining instead of
> > simply locating the buggy data and fixing them?
>
>
> I'd love to fix them. But if I do a search for
>
> SELECT * FROM xyz WHERE col like '%0x80%'
>
> it doesn't work. How should I search for these characters?

I would get GNU libiconv:
http://www.gnu.org/software/libiconv/

Then take a plain text pg_dump of the database in UTF-8.
Let's call it "db.orig.dmp".

Strip all problem characters:
iconv -f UTF-8 -t UTF-8 -c <db.orig.dmp >db.stripped.sql

Compare both files:
diff -u db.orig.dmp db.stripped.sql

That output will show all lines containing a problem character.

Now the tedious part:
Use "db.orig.dmp" to find out which tables they belong to,
locate the records in the database by primary key and fix them.

Alternatively, you can use iconv's auto-repair if you know which
bytes give you a problem.
For example, if you know that the trouble stems only from 0x80 bytes
that should be Euro symbols, you could:
iconv -f UTF-8 -t UTF-8 --byte-subst="<0x%x>" <db.orig.dmp | sed -e 's/<0x80>/EUR/g' >db.fixed.sql

The resulting "db.fixed.sql" could then be loaded into the new database.

Yours,
Laurenz Albe

Re: Best practices for moving UTF8 databases

From

Phoenix Kiula

Date:

22 July 2009, 06:26:49

On Tue, Jul 21, 2009 at 6:35 PM, Sam Mason<sam@samason.me.uk> wrote:
> On Tue, Jul 21, 2009 at 09:37:04AM +0200, Daniel Verite wrote:
>> >I'd love to fix them. But if I do a search for
>> >SELECT * FROM xyz WHERE col like '%0x80%'
>> >
>> >it doesn't work. How should I search for these characters?
>>
>> In 8.2, try: WHERE strpos(col, E'\x80') > 0
>>
>> Note that this may find valid data as well, because the error you get
>> is when 0x80 is the first byte of a character in UTF8; when it's at
>> another position, you don't want to change it.
>
> There are various regexs around to check for valid UTF-8 encoding; one
> appears to be:
>
>  http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex
>
> One translation into PG would be:
>
>  WHERE NOT col ~ ( '^('||
>    $$[\09\0A\0D\x20-\x7E]|$$||               -- ASCII
>    $$[\xC2-\xDF][\x80-\xBF]|$$||             -- non-overlong 2-byte
>     $$\xE0[\xA0-\xBF][\x80-\xBF]|$$||        -- excluding overlongs
>    $$[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|$$||  -- straight 3-byte
>     $$\xED[\x80-\x9F][\x80-\xBF]|$$||        -- excluding surrogates
>     $$\xF0[\x90-\xBF][\x80-\xBF]{2}|$$||     -- planes 1-3
>    $$[\xF1-\xF3][\x80-\xBF]{3}|$$||          -- planes 4-15
>     $$\xF4[\x80-\x8F][\x80-\xBF]{2}$$||      -- plane 16
>   '*)$' );
>
> This seems to do the right thing for me in an SQL_ASCII database.
>




I tried this. Get an error.


mypg=# select * from interesting WHERE NOT description ~ ( '^('||
mypg(#    $$[\09\0A\0D\x20-\x7E]|$$||               -- ASCII
mypg(#    $$[\xC2-\xDF][\x80-\xBF]|$$||             -- non-overlong 2-byte
mypg(#     $$\xE0[\xA0-\xBF][\x80-\xBF]|$$||        -- excluding overlongs
mypg(#    $$[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|$$||  -- straight 3-byte
mypg(#     $$\xED[\x80-\x9F][\x80-\xBF]|$$||        -- excluding surrogates
mypg(#     $$\xF0[\x90-\xBF][\x80-\xBF]{2}|$$||     -- planes 1-3
mypg(#    $$[\xF1-\xF3][\x80-\xBF]{3}|$$||          -- planes 4-15
mypg(#     $$\xF4[\x80-\x8F][\x80-\xBF]{2}$$||      -- plane 16
mypg(#   '*)$' )
mypg-#
mypg-#   ;
ERROR:  invalid regular expression: quantifier operand invalid

Re: Best practices for moving UTF8 databases

From

Justin Pasher

Date:

22 July 2009, 15:23:34

Phoenix Kiula wrote:
> I tried this. Get an error.
>
>
> mypg=# select * from interesting WHERE NOT description ~ ( '^('||
> mypg(#    $$[\09\0A\0D\x20-\x7E]|$$||               -- ASCII
> mypg(#    $$[\xC2-\xDF][\x80-\xBF]|$$||             -- non-overlong 2-byte
> mypg(#     $$\xE0[\xA0-\xBF][\x80-\xBF]|$$||        -- excluding overlongs
> mypg(#    $$[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|$$||  -- straight 3-byte
> mypg(#     $$\xED[\x80-\x9F][\x80-\xBF]|$$||        -- excluding surrogates
> mypg(#     $$\xF0[\x90-\xBF][\x80-\xBF]{2}|$$||     -- planes 1-3
> mypg(#    $$[\xF1-\xF3][\x80-\xBF]{3}|$$||          -- planes 4-15
> mypg(#     $$\xF4[\x80-\x8F][\x80-\xBF]{2}$$||      -- plane 16
> mypg(#   '*)$' )
> mypg-#
> mypg-#   ;
> ERROR:  invalid regular expression: quantifier operand invalid
>

If you really don't want to go the "pg_dump -> iconv (remove invalid
characters) -> diff the dump files" route, a stored procedure that
searches for invalid characters was posted a few years back that
attempts to find the invalid characters.

http://archives.postgresql.org/pgsql-hackers/2005-12/msg00511.php

http://svana.org/kleptog/pgsql/utf8_verify.sql

--
Justin Pasher

Re: Best practices for moving UTF8 databases

From

Sam Mason

Date:

22 July 2009, 20:04:19

On Wed, Jul 22, 2009 at 05:26:37PM +0800, Phoenix Kiula wrote:
> I tried this. Get an error.
>
> mypg=# select * from interesting WHERE NOT description ~ ( '^('||
> mypg(#    $$[\09\0A\0D\x20-\x7E]|$$||               -- ASCII
> mypg(#    $$[\xC2-\xDF][\x80-\xBF]|$$||             -- non-overlong 2-byte
> mypg(#     $$\xE0[\xA0-\xBF][\x80-\xBF]|$$||        -- excluding overlongs
> mypg(#    $$[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|$$||  -- straight 3-byte
> mypg(#     $$\xED[\x80-\x9F][\x80-\xBF]|$$||        -- excluding surrogates
> mypg(#     $$\xF0[\x90-\xBF][\x80-\xBF]{2}|$$||     -- planes 1-3
> mypg(#    $$[\xF1-\xF3][\x80-\xBF]{3}|$$||          -- planes 4-15
> mypg(#     $$\xF4[\x80-\x8F][\x80-\xBF]{2}$$||      -- plane 16
> mypg(#   '*)$' )

doh, I put the * in the wrong place! that last line should be:

  ')*$' )

at least that's what looks strange to me now--not sure how it got moved
though!

--
  Sam  http://samason.me.uk/

two questions about fulltext searchign / tsvector indexes

From

Jonathan Vanasco

Date:

10 June 2014, 00:55:10

I'm having some issues with fulltext searching.

I've gone though the list archives and stack overflow, but can't seem to get the exact answers.  hoping someone can
help.

Thanks in advance and apologies for these questions being rather basic.  I just felt the docs and some online posts are
leadingme into possibly making the wrong decision and I want to make sure I"m doing this right. 


1.  I need to make both 'title' and 'description' searchable.   What is the current proper way to index multiple
columnsof a table ( ie, not one ) ? 

    I've essentially seen the following in the docs, mailing list, and various websites:

    A unified index
        CREATE INDEX CONCURRENTLY unified_tsvector_idx ON mytable USING gin(to_tsvector('english', title || ' ' ||
description)); 

    Individual indexes
        CREATE INDEX CONCURRENTLY title_tsvector_idx ON mytable USING gin(to_tsvector('english', title ));
        CREATE INDEX CONCURRENTLY description_tsvector_idx ON mytable USING gin(to_tsvector('english', description ));

    Using dedicated columns ( one or more )
        ALTER TABLE ....
        create trigger ....

    I can't figure out which one to use.  This is on a steadily growing table of around 20MM rows that gets 20-80k new
recordsa day, but existing records are rarely updated. 


2. I've been getting a handful of 'can not index words longer than 2047 characters' in my tests.

    if this 2047 character max is on tokens, is there a way to lower it?  or to profile the index for distribution of
tokens?  I don't think we have to support any tokens larger than 20chars or so. 

3a. What should EXPLAIN ANALYZE show if it is using the index ?  i couldn't find an example.

3b. Depending on how I index the column, what do I need to pass into the query so that it uses the index ?

    1.     if the index is created like
            gin(to_tsvector('english', title ));

        do i have to search in this format ?
            to_tsvector('english',title) @@ to_tsquery('english', 'dog') ;

    2.     if i use an index like
             gin(to_tsvector('english', title || ' ' || description ));

        what is the correct way to query the database and let the planner know I want to use the index ?

Re: two questions about fulltext searchign / tsvector indexes

From

Vick Khera

Date:

10 June 2014, 12:27:06

On Mon, Jun 9, 2014 at 8:55 PM, Jonathan Vanasco <postgres@2xlp.com> wrote:
>         I can't figure out which one to use.  This is on a steadily growing table of around 20MM rows that gets
20-80knew records a day, but existing records are rarely updated. 

The question as always is a time-space trade-off. How frequently do
you make the full text search? If you do it frequently, then with a
pre-computed tsv column you save all that time per row of computing
the tsvector on every search. If you do it infrequently, the space
savings (and not needing to maintain that column) may benefit you.

Personally in these days of cheap disks I'd go with the dedicated
column. Given that, you want to just have a GIN index on that one
column, and the query you want, given some plain text string like
"fluffy dog" is this:

select plainto_tsquery('fluffy dog') @@ my_tsv_column;

I always use a trigger on insert and update to maintain the ts_vector
column, so there is no doubt of how it was computed by various
programs.

Re: two questions about fulltext searchign / tsvector indexes

From

Kevin Grittner

Date:

10 June 2014, 13:23:20

Vick Khera <vivek@khera.org> wrote:

> Jonathan Vanasco <postgres@2xlp.com> wrote:

> Personally in these days of cheap disks I'd go with the dedicated
> column. Given that, you want to just have a GIN index on that one
> column, and the query you want, given some plain text string like
> "fluffy dog" is this:
>
> select plainto_tsquery('fluffy dog') @@ my_tsv_column;
>
> I always use a trigger on insert and update to maintain the ts_vector
> column, so there is no doubt of how it was computed by various
> programs.

I was going to make a similar recommendation, but Vick beat me to
it.  The only thing I would add, is that you might want to consider
whether a match in one column should carry more weight than a match
in the other column.  If so, you should convert each to a tsvector
separately, and give each one a different weight before
concatenating the tsvector objects and storing the result.

http://www.postgresql.org/docs/current/interactive/textsearch-features.html#TEXTSEARCH-MANIPULATE-TSVECTOR

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: two questions about fulltext searchign / tsvector indexes

From

Jonathan Vanasco

Date:

10 June 2014, 14:53:05

On Jun 10, 2014, at 8:26 AM, Vick Khera wrote:

Thanks so much for this.

We do a lot of searching on this column, so pre-computing seems to be the way.

I'm not worried about disk space for now, and can revisit that later if there is a problem

Just for clarification on this:

Option A (less fast):

create gin index on tsvector(searchable_column)

Option B (faster):

create tsvector column for `searchable_column`

create gin index on searchable_column

On Mon, Jun 9, 2014 at 8:55 PM, Jonathan Vanasco <postgres@2xlp.com> wrote:
I can't figure out which one to use. This is on a steadily growing table of around 20MM rows that gets 20-80k new records a day, but existing records are rarely updated.

The question as always is a time-space trade-off. How frequently do
you make the full text search? If you do it frequently, then with a
pre-computed tsv column you save all that time per row of computing
the tsvector on every search. If you do it infrequently, the space
savings (and not needing to maintain that column) may benefit you.

Personally in these days of cheap disks I'd go with the dedicated
column. Given that, you want to just have a GIN index on that one
column, and the query you want, given some plain text string like
"fluffy dog" is this:

select plainto_tsquery('fluffy dog') @@ my_tsv_column;

I always use a trigger on insert and update to maintain the ts_vector
column, so there is no doubt of how it was computed by various
programs.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

// Jonathan Vanasco

c. 646.729.6436 | 415.501.9815

e. jonathan@2xlp.com

w. http://findmeon.com/user/jvanasco

linkedin. http://linkedin.com/in/jonathanvanasco

blog. http://destructuring.net