Thread: Best practices for moving UTF8 databases
Hi. I *always* get an error moving my current fully utf-8 database data into a new DB. My server has the version 8.3 with a five year old DB. Everything, all collation, LC_LOCALE etc are all utf8. When I install a new Postgresql 8.4 on my home Mac OSX machine (after losing some hair) I set everything about a new database to be utf8. At least anything I could see in PgAdmin. But when I pull in the data dump from the server I always see that error from the utf8 mismatch and such. So, my question. What is a good way to make sure that error does NOT occur? I simply wish to replicate the server database on another PG installation. What should one do?
Phoenix Kiula wrote: > Hi. I *always* get an error moving my current fully utf-8 database > data into a new DB. > > My server has the version 8.3 with a five year old DB. Everything, all > collation, LC_LOCALE etc are all utf8. > > When I install a new Postgresql 8.4 on my home Mac OSX machine (after > losing some hair) I set everything about a new database to be utf8. At > least anything I could see in PgAdmin. > > But when I pull in the data dump from the server I always see that > error from the utf8 mismatch and such. > > So, my question. What is a good way to make sure that error does NOT > occur? I simply wish to replicate the server database on another PG > installation. What should one do? Improve PostgreSQL so that there are no more loopholes how you can enter incorrect data :^( 8.3 is about one and a half years old, so your database must have oribinated with an older release. Correctness checks have improved considerably over the years. What you will have to do it track down the rows containing incorrect data and fix them. Yours, Laurenz Albe
I would like to know what the formal method of requesting new features are for Postgres and how are decisions made which features are included in a future release or not. Best Regards -- Michael Gould, Managing Partner Intermodal Software Solutions, LLC 904.226.0978 904.592.5250 fax
Michael Gould wrote: > I would like to know what the formal method of requesting new features are > for Postgres and how are decisions made which features are included in a > future release or not. Uh, you usually ask for the feature on the bugs or hackers email lists, and if we consider it useful it will be added to the TODO list. Then, someday, someone might implement your feature. Not very rigorous, but that is the only method we have. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Sunday 12 July 2009 13:19:50 Phoenix Kiula wrote: > Hi. I *always* get an error moving my current fully utf-8 database > data into a new DB. > > My server has the version 8.3 with a five year old DB. Everything, all > collation, LC_LOCALE etc are all utf8. > > When I install a new Postgresql 8.4 on my home Mac OSX machine (after > losing some hair) I set everything about a new database to be utf8. At > least anything I could see in PgAdmin. > > But when I pull in the data dump from the server I always see that > error from the utf8 mismatch and such. > > So, my question. What is a good way to make sure that error does NOT > occur? I simply wish to replicate the server database on another PG > installation. What should one do? What is the _exact_ error you get? During which statement? Andres
On 2009-07-13, Andres Freund <andres@anarazel.de> wrote: > On Sunday 12 July 2009 13:19:50 Phoenix Kiula wrote: >> Hi. I *always* get an error moving my current fully utf-8 database >> data into a new DB. >> >> My server has the version 8.3 with a five year old DB. Everything, all >> collation, LC_LOCALE etc are all utf8. >> >> When I install a new Postgresql 8.4 on my home Mac OSX machine (after >> losing some hair) I set everything about a new database to be utf8. At >> least anything I could see in PgAdmin. >> >> But when I pull in the data dump from the server I always see that >> error from the utf8 mismatch and such. >> >> So, my question. What is a good way to make sure that error does NOT >> occur? I simply wish to replicate the server database on another PG >> installation. What should one do? > What is the _exact_ error you get? During which statement? > > Andres > if you do an ascii dump and the dump starts out "SET CLIENT ENCODING 'UTF8'" or similar but you still get errors. run it through run it through "iconv -f UFT8 -t UTF8//IGNORE" that'll drop any illegal symbols. In theory that's a reduction in data integrity. iconv doesn't seem to have an option to replace them with U+FFFD :( I had hoped that //TRASNSLIT would do that, but no.
On Tuesday 14 July 2009 11:36:57 Jasen Betts wrote: > On 2009-07-13, Andres Freund <andres@anarazel.de> wrote: > > On Sunday 12 July 2009 13:19:50 Phoenix Kiula wrote: > >> Hi. I *always* get an error moving my current fully utf-8 database > >> data into a new DB. > >> > >> My server has the version 8.3 with a five year old DB. Everything, all > >> collation, LC_LOCALE etc are all utf8. > >> > >> When I install a new Postgresql 8.4 on my home Mac OSX machine (after > >> losing some hair) I set everything about a new database to be utf8. At > >> least anything I could see in PgAdmin. > >> > >> But when I pull in the data dump from the server I always see that > >> error from the utf8 mismatch and such. > >> > >> So, my question. What is a good way to make sure that error does NOT > >> occur? I simply wish to replicate the server database on another PG > >> installation. What should one do? > > > > What is the _exact_ error you get? During which statement? > if you do an ascii dump and the dump starts out "SET CLIENT ENCODING > 'UTF8'" or similar but you still get errors. Do you mean that a dump from SQL_ASCII can yield non-utf8 data? right. But According to the OP his 8.3 database is UTF8... So there should not be invalid data in there. Andres
On 2009-07-13, Michael Gould <mgould@intermodalsoftwaresolutions.net> wrote: > I would like to know what the formal method of requesting new features are > for Postgres and how are decisions made which features are included in a > future release or not. Formal? for open soure software in general, changes are are requested using the bug reporting system. Change requests accompanied with working source patches have an advantage. Grants of cash to the developers tend to help too.
Andres Freund wrote: > On Tuesday 14 July 2009 11:36:57 Jasen Betts wrote: > > if you do an ascii dump and the dump starts out "SET CLIENT ENCODING > > 'UTF8'" or similar but you still get errors. > Do you mean that a dump from SQL_ASCII can yield non-utf8 data? right. But > According to the OP his 8.3 database is UTF8... > So there should not be invalid data in there. I haven't followed this thread, but older PG versions had less strict checks on UTF8 data, which meant that some invalid data could creep in. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Tuesday 14 July 2009 15:52:29 Alvaro Herrera wrote: > Andres Freund wrote: > > On Tuesday 14 July 2009 11:36:57 Jasen Betts wrote: > > > if you do an ascii dump and the dump starts out "SET CLIENT ENCODING > > > 'UTF8'" or similar but you still get errors. > > > > Do you mean that a dump from SQL_ASCII can yield non-utf8 data? right. > > But According to the OP his 8.3 database is UTF8... > > So there should not be invalid data in there. > I haven't followed this thread, but older PG versions had less strict > checks on UTF8 data, which meant that some invalid data could creep in. But that was from 8.2 -> 8.3 and not 8.3->8.4 I think? Andres
On Tue, Jul 14, 2009 at 10:00:04AM +0000, Jasen Betts wrote: > On 2009-07-13, Michael Gould <mgould@intermodalsoftwaresolutions.net> wrote: > > I would like to know what the formal method of requesting new features are > > for Postgres and how are decisions made which features are included in a > > future release or not. > > Formal? > > for open soure software in general, changes are are requested using > the bug reporting system. Not in PostgreSQL, generally. > Change requests accompanied with working source patches have an > advantage. Not always. It's much better to discuss the design of the feature on -hackers, come to some rough consensus, *then* write some code than to come up with some huge wonk of code which no one is interested in understanding and will languish. > Grants of cash to the developers tend to help too. There are several outfits that can help you organize your donations. Software in the Public Interest, an non-profit company in the US, has a way to earmark donations for the PostgreSQL project. Other ways to contribute resources include hardware and/or people available to the project. Being a pleasant, helpful part of the community will incline everyone else favorably toward setting your ideas at a higher priority, even if some of them don't fly :) Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Tue, Jul 14, 2009 at 9:52 PM, Alvaro Herrera<alvherre@commandprompt.com> wrote: > Andres Freund wrote: >> On Tuesday 14 July 2009 11:36:57 Jasen Betts wrote: > >> > if you do an ascii dump and the dump starts out "SET CLIENT ENCODING >> > 'UTF8'" or similar but you still get errors. >> Do you mean that a dump from SQL_ASCII can yield non-utf8 data? right. But >> According to the OP his 8.3 database is UTF8... >> So there should not be invalid data in there. > > I haven't followed this thread, but older PG versions had less strict > checks on UTF8 data, which meant that some invalid data could creep in. If so, how can I check for them in my old database, which is 8.2.9? I'm now moving first to 8.3 (then to the 84). Really, PG absolutely needs a way to upgrade the database without so much data related downtime and all these silly woes. Several competing database systems are a cinch to upgrade. Anyway this is the annoying error I see as always: ERROR: invalid byte sequence for encoding "UTF8": 0x80 I think my old DB is all utf8. If there are a few characters that are not, how can I work with this? I've done everything I can to take care of the encoding and such. This code was used to initdb: initdb --locale=en_US.UTF-8 --encoding=UTF8 Locale environment variables are all "en_US.UTF-8" too. Thanks for any pointers!
On Sun, Jul 19, 2009 at 10:16:17AM +0800, Phoenix Kiula wrote: > If so, how can I check for them in my old database, which is 8.2.9? > I'm now moving first to 8.3 (then to the 84). > > Really, PG absolutely needs a way to upgrade the database without so > much data related downtime and all these silly woes. Several competing > database systems are a cinch to upgrade. Look through the archives, there are scripts that will scan all your text fields for UTF-8 problems. If you run them once you can clear out all the problems prior to upgrading. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Attachment
On Sun, Jul 19, 2009 at 7:08 PM, Martijn van Oosterhout<kleptog@svana.org> wrote: > On Sun, Jul 19, 2009 at 10:16:17AM +0800, Phoenix Kiula wrote: > Look through the archives, there are scripts that will scan all your > text fields for UTF-8 problems. If you run them once you can clear out > all the problems prior to upgrading. Thanks Martin. I tried searching through the archives and could only come with something like this: http://docs.moodle.org/en/UTF-8_PostgreSQL But this only has the usual iconv stuff suggested. Could you pls suggest some specific scripts or URLs? Or let me know the keywords I should look for? Thanks!
Phoenix Kiula wrote: > Really, PG absolutely needs a way to upgrade the database without so > much data related downtime and all these silly woes. Several competing > database systems are a cinch to upgrade. I'd call it data corruption, not a silly woe. I know that Oracle for example would not make that much fuss about your data: they would be imported without even a warning, and depending on your encoding settings the bad bytes would either be imported as-is or tacitly changed to inverted (or normal) question marks. It's basically a design choice that PostgreSQL made: we think that an error is preferrable to clandestinely modifying the user's data or accepting input that cannot possibly make any sense when it is retrieved at a future time. > Anyway this is the annoying error I see as always: > > ERROR: invalid byte sequence for encoding "UTF8": 0x80 > > I think my old DB is all utf8. If there are a few characters that are > not, how can I work with this? I've done everything I can to take care > of the encoding and such. This code was used to initdb: > > initdb --locale=en_US.UTF-8 --encoding=UTF8 > > Locale environment variables are all "en_US.UTF-8" too. "0x80" makes me think of the following: The data originate from a Windows system, where 0x80 is a Euro sign. Somehow these were imported into PostgreSQL without the appropriate translation into UTF-8 (how I do not know). I wonder: why do you spend so much time complaining instead of simply locating the buggy data and fixing them? This does not incur any downtime (you can fix the data in the old database before migrating), and it will definitely enhance the fun your users have with your database (if they actually see Euros where they should be). Yours, Laurenz Albe
On Mon, Jul 20, 2009 at 10:32:15AM +0800, Phoenix Kiula wrote: > Thanks Martin. I tried searching through the archives and could only > come with something like this: > > http://docs.moodle.org/en/UTF-8_PostgreSQL > > But this only has the usual iconv stuff suggested. > > Could you pls suggest some specific scripts or URLs? Or let me know > the keywords I should look for? I was thinking of this one: http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg68541.html You can use it to find the bogus strings in your existing database and fix them. It's not really fast, but maybe it'll help. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Attachment
> "0x80" makes me think of the following: > The data originate from a Windows system, where 0x80 is a Euro > sign. Somehow these were imported into PostgreSQL without the > appropriate translation into UTF-8 (how I do not know). > > I wonder: why do you spend so much time complaining instead of > simply locating the buggy data and fixing them? I'd love to fix them. But if I do a search for SELECT * FROM xyz WHERE col like '%0x80%' it doesn't work. How should I search for these characters? Thanks much
Phoenix Kiula wrote: > I'd love to fix them. But if I do a search for > > SELECT * FROM xyz WHERE col like '%0x80%' > > it doesn't work. How should I search for these characters? In 8.2, try: WHERE strpos(col, E'\x80') > 0 Note that this may find valid data as well, because the error you get is when 0x80 is the first byte of a character in UTF8; when it's at another position, you don't want to change it. Best regards, -- Daniel PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org
On Tue, Jul 21, 2009 at 09:37:04AM +0200, Daniel Verite wrote: > >I'd love to fix them. But if I do a search for > >SELECT * FROM xyz WHERE col like '%0x80%' > > > >it doesn't work. How should I search for these characters? > > In 8.2, try: WHERE strpos(col, E'\x80') > 0 > > Note that this may find valid data as well, because the error you get > is when 0x80 is the first byte of a character in UTF8; when it's at > another position, you don't want to change it. There are various regexs around to check for valid UTF-8 encoding; one appears to be: http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex One translation into PG would be: WHERE NOT col ~ ( '^('|| $$[\09\0A\0D\x20-\x7E]|$$|| -- ASCII $$[\xC2-\xDF][\x80-\xBF]|$$|| -- non-overlong 2-byte $$\xE0[\xA0-\xBF][\x80-\xBF]|$$|| -- excluding overlongs $$[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|$$|| -- straight 3-byte $$\xED[\x80-\x9F][\x80-\xBF]|$$|| -- excluding surrogates $$\xF0[\x90-\xBF][\x80-\xBF]{2}|$$|| -- planes 1-3 $$[\xF1-\xF3][\x80-\xBF]{3}|$$|| -- planes 4-15 $$\xF4[\x80-\x8F][\x80-\xBF]{2}$$|| -- plane 16 '*)$' ); This seems to do the right thing for me in an SQL_ASCII database. -- Sam http://samason.me.uk/
Phoenix Kiula wrote: > > I wonder: why do you spend so much time complaining instead of > > simply locating the buggy data and fixing them? > > > I'd love to fix them. But if I do a search for > > SELECT * FROM xyz WHERE col like '%0x80%' > > it doesn't work. How should I search for these characters? I would get GNU libiconv: http://www.gnu.org/software/libiconv/ Then take a plain text pg_dump of the database in UTF-8. Let's call it "db.orig.dmp". Strip all problem characters: iconv -f UTF-8 -t UTF-8 -c <db.orig.dmp >db.stripped.sql Compare both files: diff -u db.orig.dmp db.stripped.sql That output will show all lines containing a problem character. Now the tedious part: Use "db.orig.dmp" to find out which tables they belong to, locate the records in the database by primary key and fix them. Alternatively, you can use iconv's auto-repair if you know which bytes give you a problem. For example, if you know that the trouble stems only from 0x80 bytes that should be Euro symbols, you could: iconv -f UTF-8 -t UTF-8 --byte-subst="<0x%x>" <db.orig.dmp | sed -e 's/<0x80>/EUR/g' >db.fixed.sql The resulting "db.fixed.sql" could then be loaded into the new database. Yours, Laurenz Albe
On Tue, Jul 21, 2009 at 6:35 PM, Sam Mason<sam@samason.me.uk> wrote: > On Tue, Jul 21, 2009 at 09:37:04AM +0200, Daniel Verite wrote: >> >I'd love to fix them. But if I do a search for >> >SELECT * FROM xyz WHERE col like '%0x80%' >> > >> >it doesn't work. How should I search for these characters? >> >> In 8.2, try: WHERE strpos(col, E'\x80') > 0 >> >> Note that this may find valid data as well, because the error you get >> is when 0x80 is the first byte of a character in UTF8; when it's at >> another position, you don't want to change it. > > There are various regexs around to check for valid UTF-8 encoding; one > appears to be: > > http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex > > One translation into PG would be: > > WHERE NOT col ~ ( '^('|| > $$[\09\0A\0D\x20-\x7E]|$$|| -- ASCII > $$[\xC2-\xDF][\x80-\xBF]|$$|| -- non-overlong 2-byte > $$\xE0[\xA0-\xBF][\x80-\xBF]|$$|| -- excluding overlongs > $$[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|$$|| -- straight 3-byte > $$\xED[\x80-\x9F][\x80-\xBF]|$$|| -- excluding surrogates > $$\xF0[\x90-\xBF][\x80-\xBF]{2}|$$|| -- planes 1-3 > $$[\xF1-\xF3][\x80-\xBF]{3}|$$|| -- planes 4-15 > $$\xF4[\x80-\x8F][\x80-\xBF]{2}$$|| -- plane 16 > '*)$' ); > > This seems to do the right thing for me in an SQL_ASCII database. > I tried this. Get an error. mypg=# select * from interesting WHERE NOT description ~ ( '^('|| mypg(# $$[\09\0A\0D\x20-\x7E]|$$|| -- ASCII mypg(# $$[\xC2-\xDF][\x80-\xBF]|$$|| -- non-overlong 2-byte mypg(# $$\xE0[\xA0-\xBF][\x80-\xBF]|$$|| -- excluding overlongs mypg(# $$[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|$$|| -- straight 3-byte mypg(# $$\xED[\x80-\x9F][\x80-\xBF]|$$|| -- excluding surrogates mypg(# $$\xF0[\x90-\xBF][\x80-\xBF]{2}|$$|| -- planes 1-3 mypg(# $$[\xF1-\xF3][\x80-\xBF]{3}|$$|| -- planes 4-15 mypg(# $$\xF4[\x80-\x8F][\x80-\xBF]{2}$$|| -- plane 16 mypg(# '*)$' ) mypg-# mypg-# ; ERROR: invalid regular expression: quantifier operand invalid
Phoenix Kiula wrote: > I tried this. Get an error. > > > mypg=# select * from interesting WHERE NOT description ~ ( '^('|| > mypg(# $$[\09\0A\0D\x20-\x7E]|$$|| -- ASCII > mypg(# $$[\xC2-\xDF][\x80-\xBF]|$$|| -- non-overlong 2-byte > mypg(# $$\xE0[\xA0-\xBF][\x80-\xBF]|$$|| -- excluding overlongs > mypg(# $$[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|$$|| -- straight 3-byte > mypg(# $$\xED[\x80-\x9F][\x80-\xBF]|$$|| -- excluding surrogates > mypg(# $$\xF0[\x90-\xBF][\x80-\xBF]{2}|$$|| -- planes 1-3 > mypg(# $$[\xF1-\xF3][\x80-\xBF]{3}|$$|| -- planes 4-15 > mypg(# $$\xF4[\x80-\x8F][\x80-\xBF]{2}$$|| -- plane 16 > mypg(# '*)$' ) > mypg-# > mypg-# ; > ERROR: invalid regular expression: quantifier operand invalid > If you really don't want to go the "pg_dump -> iconv (remove invalid characters) -> diff the dump files" route, a stored procedure that searches for invalid characters was posted a few years back that attempts to find the invalid characters. http://archives.postgresql.org/pgsql-hackers/2005-12/msg00511.php http://svana.org/kleptog/pgsql/utf8_verify.sql -- Justin Pasher
On Wed, Jul 22, 2009 at 05:26:37PM +0800, Phoenix Kiula wrote: > I tried this. Get an error. > > mypg=# select * from interesting WHERE NOT description ~ ( '^('|| > mypg(# $$[\09\0A\0D\x20-\x7E]|$$|| -- ASCII > mypg(# $$[\xC2-\xDF][\x80-\xBF]|$$|| -- non-overlong 2-byte > mypg(# $$\xE0[\xA0-\xBF][\x80-\xBF]|$$|| -- excluding overlongs > mypg(# $$[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|$$|| -- straight 3-byte > mypg(# $$\xED[\x80-\x9F][\x80-\xBF]|$$|| -- excluding surrogates > mypg(# $$\xF0[\x90-\xBF][\x80-\xBF]{2}|$$|| -- planes 1-3 > mypg(# $$[\xF1-\xF3][\x80-\xBF]{3}|$$|| -- planes 4-15 > mypg(# $$\xF4[\x80-\x8F][\x80-\xBF]{2}$$|| -- plane 16 > mypg(# '*)$' ) doh, I put the * in the wrong place! that last line should be: ')*$' ) at least that's what looks strange to me now--not sure how it got moved though! -- Sam http://samason.me.uk/
I'm having some issues with fulltext searching. I've gone though the list archives and stack overflow, but can't seem to get the exact answers. hoping someone can help. Thanks in advance and apologies for these questions being rather basic. I just felt the docs and some online posts are leadingme into possibly making the wrong decision and I want to make sure I"m doing this right. 1. I need to make both 'title' and 'description' searchable. What is the current proper way to index multiple columnsof a table ( ie, not one ) ? I've essentially seen the following in the docs, mailing list, and various websites: A unified index CREATE INDEX CONCURRENTLY unified_tsvector_idx ON mytable USING gin(to_tsvector('english', title || ' ' || description)); Individual indexes CREATE INDEX CONCURRENTLY title_tsvector_idx ON mytable USING gin(to_tsvector('english', title )); CREATE INDEX CONCURRENTLY description_tsvector_idx ON mytable USING gin(to_tsvector('english', description )); Using dedicated columns ( one or more ) ALTER TABLE .... create trigger .... I can't figure out which one to use. This is on a steadily growing table of around 20MM rows that gets 20-80k new recordsa day, but existing records are rarely updated. 2. I've been getting a handful of 'can not index words longer than 2047 characters' in my tests. if this 2047 character max is on tokens, is there a way to lower it? or to profile the index for distribution of tokens? I don't think we have to support any tokens larger than 20chars or so. 3a. What should EXPLAIN ANALYZE show if it is using the index ? i couldn't find an example. 3b. Depending on how I index the column, what do I need to pass into the query so that it uses the index ? 1. if the index is created like gin(to_tsvector('english', title )); do i have to search in this format ? to_tsvector('english',title) @@ to_tsquery('english', 'dog') ; 2. if i use an index like gin(to_tsvector('english', title || ' ' || description )); what is the correct way to query the database and let the planner know I want to use the index ?
On Mon, Jun 9, 2014 at 8:55 PM, Jonathan Vanasco <postgres@2xlp.com> wrote: > I can't figure out which one to use. This is on a steadily growing table of around 20MM rows that gets 20-80knew records a day, but existing records are rarely updated. The question as always is a time-space trade-off. How frequently do you make the full text search? If you do it frequently, then with a pre-computed tsv column you save all that time per row of computing the tsvector on every search. If you do it infrequently, the space savings (and not needing to maintain that column) may benefit you. Personally in these days of cheap disks I'd go with the dedicated column. Given that, you want to just have a GIN index on that one column, and the query you want, given some plain text string like "fluffy dog" is this: select plainto_tsquery('fluffy dog') @@ my_tsv_column; I always use a trigger on insert and update to maintain the ts_vector column, so there is no doubt of how it was computed by various programs.
Vick Khera <vivek@khera.org> wrote: > Jonathan Vanasco <postgres@2xlp.com> wrote: > Personally in these days of cheap disks I'd go with the dedicated > column. Given that, you want to just have a GIN index on that one > column, and the query you want, given some plain text string like > "fluffy dog" is this: > > select plainto_tsquery('fluffy dog') @@ my_tsv_column; > > I always use a trigger on insert and update to maintain the ts_vector > column, so there is no doubt of how it was computed by various > programs. I was going to make a similar recommendation, but Vick beat me to it. The only thing I would add, is that you might want to consider whether a match in one column should carry more weight than a match in the other column. If so, you should convert each to a tsvector separately, and give each one a different weight before concatenating the tsvector objects and storing the result. http://www.postgresql.org/docs/current/interactive/textsearch-features.html#TEXTSEARCH-MANIPULATE-TSVECTOR -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Jun 10, 2014, at 8:26 AM, Vick Khera wrote:
Thanks so much for this.
We do a lot of searching on this column, so pre-computing seems to be the way.
I'm not worried about disk space for now, and can revisit that later if there is a problem
Just for clarification on this:
Option A (less fast):
create gin index on tsvector(searchable_column)
Option B (faster):
create tsvector column for `searchable_column`
create gin index on searchable_column
On Mon, Jun 9, 2014 at 8:55 PM, Jonathan Vanasco <postgres@2xlp.com> wrote:I can't figure out which one to use. This is on a steadily growing table of around 20MM rows that gets 20-80k new records a day, but existing records are rarely updated.
The question as always is a time-space trade-off. How frequently do
you make the full text search? If you do it frequently, then with a
pre-computed tsv column you save all that time per row of computing
the tsvector on every search. If you do it infrequently, the space
savings (and not needing to maintain that column) may benefit you.
Personally in these days of cheap disks I'd go with the dedicated
column. Given that, you want to just have a GIN index on that one
column, and the query you want, given some plain text string like
"fluffy dog" is this:
select plainto_tsquery('fluffy dog') @@ my_tsv_column;
I always use a trigger on insert and update to maintain the ts_vector
column, so there is no doubt of how it was computed by various
programs.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
// Jonathan Vanasco
c. 646.729.6436 | 415.501.9815
linkedin. http://linkedin.com/in/jonathanvanasco
blog. http://destructuring.net