Thread: directional marks
Greetings, I've found some problems to handle the directional marks (i.e. for arabic charset in UNICODE 0x200e and 0x200f). As I've exported the db from Microsoft SQL 7.0 there were so many directional marks even inside words (i.e. "foo" -> "f(200e)oo"). This probably is due to an external program which was used to fill the db with values. Directional marks are not shown in the client browser but for PostgreSQL is a character. This is a problem when I try to SELECT the db: SELECT * FROM table WHERE field = "foo"; If in "table" there is a record which contais "foo" as value for "field", but it has a directional mark in it (i.e.: "f(200e)oo"), I can't get any result. The only way to fix the problem is to remove any directional mark occurrence, or to make PostgreSQL ignore that kind of characters during UNICODE queries. What do you think about it? Thx. --- Nhan NGO DINH __________________________________________________________________ Tiscali Ricaricasa la prima prepagata per navigare in Internet a meno di un'urbana e risparmiare su tutte le tue telefonate. Acquistala on line e non avrai nessun costo di attivazione né di ricarica! http://ricaricasaonline.tiscali.it/
nngodinh@tiscali.it writes: > The only way to fix the problem is to remove any directional mark occurrence, > or to make PostgreSQL ignore that kind of characters during UNICODE queries. > > What do you think about it? Either remove the directional marks or consistently use them in all your queries (or use wildcards to paint over the difference). The directional mark characters aren't just for amusement -- they contain real information so they cannot be ignored. -- Peter Eisentraut peter_e@gmx.net
I'm speaking about directional marks that are ignored by - for instance - by Microsoft SQL 7.0 because they're unuseful in that position (like when they're in a one way text either left-to-right or right-to-left). It may happen that this kind of symbols are randomly inserted: for example... The entry user types an english text like "test". At the end he switches the keyboard layout to arabic and types something arabic but he realizes he don't want to do that and erases the arabic text, switches again the keyboard and inserts english text after "test". Some directional marks are inserted but they're unuseful. The problem is that sometimes the directional mark is inside a word, not just at the ending, and after all if you try to index using txt2txtidx, directional marks are not recognized as delimiters (and they aren't) so the txtidx array will contain the near word with an appended directional mark. May be you can say that the source I've exported the db from is a malformed one, and you are absolutely right. Anyway I know that some programs (expecially Microsoft) does this mistake. I'm not speaking of PHP. Bye. >-- Messaggio Originale -- >Date: Mon, 16 Sep 2002 19:25:30 +0200 (CEST) >From: Peter Eisentraut <peter_e@gmx.net> >To: nngodinh@tiscali.it >cc: pgsql-hackers@postgresql.org >Subject: Re: [HACKERS] directional marks > > >nngodinh@tiscali.it writes: > >> The only way to fix the problem is to remove any directional mark occurrence, >> or to make PostgreSQL ignore that kind of characters during UNICODE queries. >> >> What do you think about it? > >Either remove the directional marks or consistently use them in all your >queries (or use wildcards to paint over the difference). The directional >mark characters aren't just for amusement -- they contain real information >so they cannot be ignored. > >-- >Peter Eisentraut peter_e@gmx.net > > __________________________________________________________________ Tiscali Ricaricasa la prima prepagata per navigare in Internet a meno di un'urbana e risparmiare su tutte le tue telefonate. Acquistala on line e non avrai nessun costo di attivazione né di ricarica! http://ricaricasaonline.tiscali.it/
nngodinh@tiscali.it writes: > I'm speaking about directional marks that are ignored by - for instance > - by Microsoft SQL 7.0 because they're unuseful in that position (like when > they're in a one way text either left-to-right or right-to-left). It may > happen that this kind of symbols are randomly inserted: for example... To me this sounds analogous to inserting tons of <space><backspace> sequences into a string and expecting the software to automatically figure out that they cancel. It would be possible, but it would probably add a lot of overhead and it doesn't seem to be requested a lot. The best solution is probably to fix your data. Unless you can point to a Unicode standard that states that such cancellation should happen. -- Peter Eisentraut peter_e@gmx.net