Thread: Regex Replace with 2 conditions
Hi all,
Is there a way to specify 2 conditions in regexp_replace?
I need an SQL function that eliminates all ASCII characters from 1-255 that are not A-Z, a-z, 0-9, and special characters % and _ so something like:
SELECT regexp_replace(p_string, E'[' || CHR(1) || '-' || CHR(255) || '&&[^A-Za-z0-9%_]]', '', 'g'));
But this syntax is not really working.
I have written a SQL function that achieves this, but I am not happy with it because it is hard to read and maintain:
-- Eliminates all ASCII characters from 1-255 that are not A-z, a-z, 0-9, and special characters % and _
-- The computed regex expression that is between E[] is CHR(1)-$&-/:-@[-^`{-ÿ].
CREATE OR REPLACE FUNCTION testFunction(p_string CHARACTER VARYING) RETURNS VARCHAR AS $$
SELECT regexp_replace(p_string, E'[' || CHR(1) || '-' || CHR(36) || CHR(38) || '-' || CHR(47) || CHR(58) || '-' || CHR(64) || CHR(91) || '-' || CHR(94) || CHR(96) || CHR(123) || '-' || CHR(255) || ']', '', 'g');
$$ LANGUAGE sql IMMUTABLE;
Please help me figure out how to achieve this.
Thanks a lot,
Denisa Cîrstescu
Denisa: On Mon, Feb 5, 2018 at 2:34 PM, Denisa Cirstescu <Denisa.Cirstescu@tangoe.com> wrote: > I need an SQL function that eliminates all ASCII characters from 1-255 that > are not A-Z, a-z, 0-9, and special characters % and _ so something like: Are you aware ASCII is a SEVEN bit code ? And now, why don't you just write the negated condition, maybe throwing in a null to avoid it? Do you have codes above 255 which you do not need replacing? I.e., something like SELECT regexp_replace(p_string, E'[^A-Za-z0-9%_]', '', 'g')); This will also zap \0 and all chars >255 if you are using unicode, if this is not a problem that's all there is to it. If you are using it you could throw a null plus a character range from 256 to the largest one, but I doubt this is useful. Which is the character set of your source data? ( It can NOT be ascii if you are worried about 128-255, but is it a single byte one or is it unicode or something wide ? ) Also, it may perform a bit faster if you throw a + after the character class ( for >1 char runs ). Francisco Olarte.
Denisa Cirstescu <Denisa.Cirstescu@tangoe.com> writes: > Is there a way to specify 2 conditions in regexp_replace? > I need an SQL function that eliminates all ASCII characters from 1-255 that are not A-Z, a-z, 0-9, and special characters% and _ so something like: > SELECT regexp_replace(p_string, E'[' || CHR(1) || '-' || CHR(255) || '&&[^A-Za-z0-9%_]]', '', 'g')); > But this syntax is not really working. Nope, because there's no && operator in regexes. But I think you could get what you want by using lookahead or lookbehind to combine additional condition(s) with a basic character-class pattern. Something like (?=[\001-\377])[^A-Za-z0-9%_] regards, tom lane
Is there a way to specify 2 conditions in regexp_replace?
I have written a SQL function that achieves this, but I am not happy with it because it is hard to read and maintain:
-- Eliminates all ASCII characters from 1-255 that are not A-z, a-z, 0-9, and special characters % and _
-- The computed regex expression that is between E[] is CHR(1)-$&-/:-@[-^`{-ÿ].
CREATE OR REPLACE FUNCTION testFunction(p_string CHARACTER VARYING) RETURNS VARCHAR AS $$
SELECT regexp_replace(p_string, E'[' || CHR(1) || '-' || CHR(36) || CHR(38) || '-' || CHR(47) || CHR(58) || '-' || CHR(64) || CHR(91) || '-' || CHR(94) || CHR(96) || CHR(123) || '-' || CHR(255) || ']', '', 'g');
$$ LANGUAGE sql IMMUTABLE;
Francisco, I've tried the version that you are proposing before posting this question, but it is not good as it is removing charactersthat have ASCII code greater than 255 and those are characters that I need to keep, such as "ă". SELECT regexp_replace(p_string, E'[^A-Za-z0-9%_]', '', 'g')); This is the request that I have: write a function that eliminates all ASCII characters from 1-255 that are not A-Z, a-z,0-9, and special characters % and _ Tom, I have tried what you suggested with the lookahead and it is working. It is exactly what I needed. The final version of the function is: CREATE OR REPLACE FUNCTION testFunction(p_string CHARACTER VARYING) RETURNS VARCHAR AS $$ SELECT regexp_replace(p_string, E'(?=[' || CHR(1) || '-' || CHR(255) || '])[^A-Za-z0-9%_]', '', 'g'); $$ LANGUAGE sql IMMUTABLE; Thanks a lot, Denisa Cîrstescu -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Monday, February 5, 2018 4:43 PM To: Denisa Cirstescu <Denisa.Cirstescu@tangoe.com> Cc: pgsql-general@postgresql.org Subject: Re: Regex Replace with 2 conditions Denisa Cirstescu <Denisa.Cirstescu@tangoe.com> writes: > Is there a way to specify 2 conditions in regexp_replace? > I need an SQL function that eliminates all ASCII characters from 1-255 that are not A-Z, a-z, 0-9, and special characters% and _ so something like: > SELECT regexp_replace(p_string, E'[' || CHR(1) || '-' || CHR(255) || > '&&[^A-Za-z0-9%_]]', '', 'g')); But this syntax is not really working. Nope, because there's no && operator in regexes. But I think you could get what you want by using lookahead or lookbehind to combine additional condition(s) with a basiccharacter-class pattern. Something like (?=[\001-\377])[^A-Za-z0-9%_] regards, tom lane
Denisa: 1.- Please, do not top-post, it makes seeing what you arereplying to difficult. 2.- Do not reply to several messages in one. Nobody reading this knows my suggestions. Having said that, regarding my part: On Mon, Feb 5, 2018 at 5:54 PM, Denisa Cirstescu <Denisa.Cirstescu@tangoe.com> wrote: > I've tried the version that you are proposing before posting this question, but it is not good as it is removing charactersthat have ASCII code greater than 255 and those are characters that I need to keep, such as "ă". > SELECT regexp_replace(p_string, E'[^A-Za-z0-9%_]', '', 'g')); > This is the request that I have: write a function that eliminates all ASCII characters from 1-255 that are not A-Z, a-z,0-9, and special characters % and _ I repeat for the last time. YOU ARE NOT USING ASCII. ASCII IS A SEVEN BIT CODE, 0-128. "ă" IS NOT IN THE ASCII CHARACTER SET. Having said that I asked which character set you were using, you didn't bother to answer. This is neccessary to know jow to express the characters you want to preserve. We could assume unicode, but from your examples it is not clear whether you want to preserve all codepoints above 255 or just the alphabetic ones. I suspect the later. The easier one can be, as I suggested earlier, adding a range from codepoint 256 to the last one in your regexp. Anyway, RTFM. I suggest starting at https://www.postgresql.org/docs/10/static/functions-matching.html#POSIX-BRACKET-EXPRESSIONS and following on to see the many ways you have to expresss characters in a regexp. Read carefully, as the documentation is rather dense, but if you are not willing to answer simple questions like what is your character set, or database encoding, it is the most I can do. Francisco Olarte. P.S. I have not commented on the rest of the message, directed to Tom, but seeing you want something equivalent to the functionality of "tr [-c] -d" I suspect lookahead/behind is overkill ( and from your message I would assume well beyond your regexp skills ). FO
On Mon, 5 Feb 2018 18:22:02 +0100, Francisco Olarte <folarte@peoplecall.com> wrote: >I repeat for the last time. YOU ARE NOT USING ASCII. ASCII IS A SEVEN >BIT CODE, 0-128. "?" IS NOT IN THE ASCII CHARACTER SET. What ASCII table are you reading? The question mark symbol is #63. It lies between the numbers and the capital letter set. George
On Mon, 5 Feb 2018 18:22:02 +0100, Francisco Olarte
<folarte@peoplecall.com> wrote:
>I repeat for the last time. YOU ARE NOT USING ASCII. ASCII IS A SEVEN
>BIT CODE, 0-128. "?" IS NOT IN THE ASCII CHARACTER SET.
What ASCII table are you reading? The question mark symbol is #63. It
lies between the numbers and the capital letter set.
On Mon, 5 Feb 2018 18:22:02 +0100, Francisco Olarte
<folarte@peoplecall.com> wrote:
>I repeat for the last time. YOU ARE NOT USING ASCII. ASCII IS A SEVEN
>BIT CODE, 0-128. "?" IS NOT IN THE ASCII CHARACTER SET.
What ASCII table are you reading? The question mark symbol is #63. It
lies between the numbers and the capital letter set.Your mail client mangled that sentence - the "?" you are seeing is a placeholder for the non-ASCII character "'a' with a bowl on top of it"...
Thanks David. Apologies to everyone for the noise.
George
George: On Tue, Feb 6, 2018 at 4:46 PM, George Neuner <gneuner2@comcast.net> wrote: > On Mon, 5 Feb 2018 18:22:02 +0100, Francisco Olarte > <folarte@peoplecall.com> wrote: >>I repeat for the last time. YOU ARE NOT USING ASCII. ASCII IS A SEVEN >>BIT CODE, 0-128. "?" IS NOT IN THE ASCII CHARACTER SET. I made a typo there, 0..127, not 128 ( or [0,128) ;-) ) > What ASCII table are you reading? The question mark symbol is #63. It > lies between the numbers and the capital letter set. I'm not reading any ascii table, and I did NOT send a question mark. IIRC I copied an a with something looking like an inverted circumflex above. I was using gmail in ubuntu in firefox, wihich I think works in unicode and sends mail in UTF-8, AAMOF I've looked at it and I see: >>> From: Francisco Olarte <folarte@peoplecall.com> To: Denisa Cirstescu <Denisa.Cirstescu@tangoe.com> Cc: Tom Lane <tgl@sss.pgh.pa.us>, "pgsql-general@postgresql.org" <pgsql-general@postgresql.org> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <<< and a little below: >>> I repeat for the last time. YOU ARE NOT USING ASCII. ASCII IS A SEVEN BIT CODE, 0-128. "=C4=83" IS NOT IN THE ASCII CHARACTER SET. <<< So, no question mark sent, I suspect your mail chain may be playing tricks on you, or may be you are translating to 7 bits on purpose since your mail came with the headers: >>> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit <<< I'll suggest you fix that before participating in threads with unicode content. Also, many programs use ? as a placeholder for something not in its charset, so always suspect you are not seeing the right char when you encounter one of this things. Francisco Olarte.
On Tue, 6 Feb 2018 17:57:33 +0100, Francisco Olarte <folarte@peoplecall.com> wrote: >So, no question mark sent, I suspect your mail chain may be playing >tricks on you, or may be you are translating to 7 bits on purpose >since your mail came with the headers: > >>>> >Content-Type: text/plain; charset=us-ascii >Content-Transfer-Encoding: 7bit ><<< > >I'll suggest you fix that before participating in threads with unicode >content. Also, many programs use ? as a placeholder for something not >in its charset, so always suspect you are not seeing the right char >when you encounter one of this things. Usually I do see unicode characters correctly. My news client defaults to *sending* in ISO-8859-1 (US acsii), but it displays incoming messages in UTF-8, and in HTML if applicable ... so I'm not sure why I'm not seeing whatever it was that you actually typed. It does keep coming through as a question mark in all the responses. I read this group through the Gmane mail->news reflector ... maybe that has something to do with it? George