Thread: Regex Replace with 2 conditions

Regex Replace with 2 conditions

From

Denisa Cirstescu

Date:

05 February 2018, 16:34:37

Hi all,

Is there a way to specify 2 conditions in regexp_replace?

I need an SQL function that eliminates all ASCII characters from 1-255 that are not A-Z, a-z, 0-9, and special characters % and _ so something like:

SELECT regexp_replace(p_string, E'[' || CHR(1) || '-' || CHR(255) || '&&[^A-Za-z0-9%_]]', '', 'g'));

But this syntax is not really working.

I have written a SQL function that achieves this, but I am not happy with it because it is hard to read and maintain:

-- Eliminates all ASCII characters from 1-255 that are not A-z, a-z, 0-9, and special characters % and _

-- The computed regex expression that is between E[] is CHR(1)-$&-/:-@[-^`{-ÿ].

CREATE OR REPLACE FUNCTION testFunction(p_string CHARACTER VARYING) RETURNS VARCHAR AS $$

SELECT regexp_replace(p_string, E'[' || CHR(1) || '-' || CHR(36) || CHR(38) || '-' || CHR(47) || CHR(58) || '-' || CHR(64) || CHR(91) || '-' || CHR(94) || CHR(96) || CHR(123) || '-' || CHR(255) || ']', '', 'g');

$$ LANGUAGE sql IMMUTABLE;

Please help me figure out how to achieve this.

Thanks a lot,

Denisa Cîrstescu

Re: Regex Replace with 2 conditions

From

Francisco Olarte

Date:

05 February 2018, 17:26:46

Denisa:

On Mon, Feb 5, 2018 at 2:34 PM, Denisa Cirstescu
<Denisa.Cirstescu@tangoe.com> wrote:
> I need an SQL function that eliminates all ASCII characters from 1-255 that
> are not A-Z, a-z, 0-9, and special characters % and _  so something like:

Are you aware ASCII is a SEVEN bit code ?

And now, why don't you just write the negated condition, maybe
throwing in a null to avoid it? Do you have codes above 255 which you
do not need replacing?

I.e., something like

SELECT regexp_replace(p_string, E'[^A-Za-z0-9%_]', '', 'g'));

This will also zap \0 and all chars >255 if you are using unicode, if
this is not a problem that's all there is to it.

If you are using it you could throw a null plus a character range from
256 to the largest one, but I doubt this is useful. Which is the
character set of your source data? ( It can NOT be ascii if you are
worried about 128-255, but is it a single byte one or is it unicode or
something wide ? )

Also, it may perform a bit faster if you throw a + after the character
class ( for >1 char runs ).

Francisco Olarte.

Re: Regex Replace with 2 conditions

From

Tom Lane

Date:

05 February 2018, 17:43:06

Denisa Cirstescu <Denisa.Cirstescu@tangoe.com> writes:
> Is there a way to specify 2 conditions in regexp_replace?
> I need an SQL function that eliminates all ASCII characters from 1-255 that are not A-Z, a-z, 0-9, and special
characters% and _  so something like: 
> SELECT regexp_replace(p_string, E'[' || CHR(1) || '-' || CHR(255) || '&&[^A-Za-z0-9%_]]', '', 'g'));
> But this syntax is not really working.

Nope, because there's no && operator in regexes.

But I think you could get what you want by using lookahead or lookbehind
to combine additional condition(s) with a basic character-class pattern.
Something like

    (?=[\001-\377])[^A-Za-z0-9%_]

            regards, tom lane

Re: Regex Replace with 2 conditions

From

"David G. Johnston"

Date:

05 February 2018, 17:53:24

On Mon, Feb 5, 2018 at 6:34 AM, Denisa Cirstescu <Denisa.Cirstescu@tangoe.com> wrote:

Is there a way to specify 2 conditions in regexp_replace?

Tom and Francisco both give excellent responses.

I have written a SQL function that achieves this, but I am not happy with it because it is hard to read and maintain:

-- Eliminates all ASCII characters from 1-255 that are not A-z, a-z, 0-9, and special characters % and _
-- The computed regex expression that is between E[] is CHR(1)-$&-/:-@[-^`{-ÿ].
CREATE OR REPLACE FUNCTION testFunction(p_string CHARACTER VARYING) RETURNS VARCHAR AS $$
SELECT regexp_replace(p_string, E'[' || CHR(1) || '-' || CHR(36) || CHR(38) || '-' || CHR(47) || CHR(58) || '-' || CHR(64) || CHR(91) || '-' || CHR(94) || CHR(96) || CHR(123) || '-' || CHR(255) || ']', '', 'g');
$$ LANGUAGE sql IMMUTABLE;

I'm not seeing what kind of maintenance would be involved here - and you have various string tricks to use to make the expression itself more comprehensible (at the possible cost of performance).

control_codes_1 := CHR(1) || '-' || CHR(36)

control_codes_2 := CHR(38) || '-' || CHR(47)

regexp_replace(

p_string,

format('[%s%s%s%s%s%s]',

control_codes_1,

control_codes_2,

blah1,

blah2,

blah3,

blah4

'x')

Add a code comment and the next person to read this should be able to understand its purpose.

Note, as a matter of course I try to avoid E'' strings whenever I write regular expressions - since backslash is special to both I have to escape the ones being passed to the regex engine and that is undesirable.

David J.

RE: Regex Replace with 2 conditions

From

Denisa Cirstescu

Date:

05 February 2018, 19:54:18

Francisco,

I've tried the version that you are proposing before posting this question, but it is not good as it is removing
charactersthat have ASCII code greater than 255 and those are characters that I need to keep, such as "ă". 

    SELECT regexp_replace(p_string, E'[^A-Za-z0-9%_]', '', 'g'));

This is the request that I have: write a function that eliminates all ASCII characters from 1-255 that are not A-Z,
a-z,0-9, and special characters % and _ 

Tom,

I have tried what you suggested with the lookahead and it is working.
It is exactly what I needed. The final version of the function is:

    CREATE OR REPLACE FUNCTION testFunction(p_string CHARACTER VARYING) RETURNS VARCHAR AS $$
        SELECT regexp_replace(p_string, E'(?=[' || CHR(1) || '-' || CHR(255) || '])[^A-Za-z0-9%_]', '', 'g');
    $$ LANGUAGE sql IMMUTABLE;

Thanks a lot,
Denisa Cîrstescu

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Monday, February 5, 2018 4:43 PM
To: Denisa Cirstescu <Denisa.Cirstescu@tangoe.com>
Cc: pgsql-general@postgresql.org
Subject: Re: Regex Replace with 2 conditions

Denisa Cirstescu <Denisa.Cirstescu@tangoe.com> writes:
> Is there a way to specify 2 conditions in regexp_replace?
> I need an SQL function that eliminates all ASCII characters from 1-255 that are not A-Z, a-z, 0-9, and special
characters% and _  so something like: 
> SELECT regexp_replace(p_string, E'[' || CHR(1) || '-' || CHR(255) ||
> '&&[^A-Za-z0-9%_]]', '', 'g')); But this syntax is not really working.

Nope, because there's no && operator in regexes.

But I think you could get what you want by using lookahead or lookbehind to combine additional condition(s) with a
basiccharacter-class pattern. 
Something like

    (?=[\001-\377])[^A-Za-z0-9%_]

            regards, tom lane

Re: Regex Replace with 2 conditions

From

Francisco Olarte

Date:

05 February 2018, 20:22:02

Denisa:

1.- Please, do not top-post, it makes seeing what you arereplying to difficult.

2.- Do not reply to several messages in one. Nobody reading this knows
my suggestions.

Having said that, regarding my part:

On Mon, Feb 5, 2018 at 5:54 PM, Denisa Cirstescu
<Denisa.Cirstescu@tangoe.com> wrote:
> I've tried the version that you are proposing before posting this question, but it is not good as it is removing
charactersthat have ASCII code greater than 255 and those are characters that I need to keep, such as "ă".
> SELECT regexp_replace(p_string, E'[^A-Za-z0-9%_]', '', 'g'));
> This is the request that I have: write a function that eliminates all ASCII characters from 1-255 that are not A-Z,
a-z,0-9, and special characters % and _

I repeat for the last time. YOU ARE NOT USING ASCII. ASCII IS A SEVEN
BIT CODE, 0-128. "ă" IS NOT IN THE ASCII CHARACTER SET.

Having said that I asked which character set you were using, you
didn't bother to answer. This is neccessary to know jow to express the
characters you want to preserve.

We could assume unicode, but from your examples it is not clear
whether you want to preserve all codepoints above 255 or just the
alphabetic ones. I suspect the later. The easier one can be, as I
suggested earlier, adding a range from codepoint 256 to the last one
in your regexp.

Anyway, RTFM. I suggest starting at
https://www.postgresql.org/docs/10/static/functions-matching.html#POSIX-BRACKET-EXPRESSIONS
and following on to see the many ways you have to expresss characters
in a regexp. Read carefully, as the documentation is rather dense, but
if you are not willing to answer simple questions like what is your
character set, or database encoding, it is the most I can do.

Francisco Olarte.

P.S. I have not commented on the rest of the message, directed to Tom,
but seeing you want something equivalent to the functionality of "tr
[-c] -d" I suspect lookahead/behind is overkill ( and from your
message I would assume well beyond your regexp skills ).

Re: Regex Replace with 2 conditions

From

George Neuner

Date:

06 February 2018, 18:46:02

On Mon, 5 Feb 2018 18:22:02 +0100, Francisco Olarte
<folarte@peoplecall.com> wrote:

>I repeat for the last time. YOU ARE NOT USING ASCII. ASCII IS A SEVEN
>BIT CODE, 0-128. "?" IS NOT IN THE ASCII CHARACTER SET.

What ASCII table are you reading?  The question mark symbol is #63. It
lies between the numbers and the capital letter set.

George

Re: Regex Replace with 2 conditions

From

"David G. Johnston"

Date:

06 February 2018, 18:52:45

On Tue, Feb 6, 2018 at 8:46 AM, George Neuner <gneuner2@comcast.net> wrote:

On Mon, 5 Feb 2018 18:22:02 +0100, Francisco Olarte
<folarte@peoplecall.com> wrote:

>I repeat for the last time. YOU ARE NOT USING ASCII. ASCII IS A SEVEN
>BIT CODE, 0-128. "?" IS NOT IN THE ASCII CHARACTER SET.

What ASCII table are you reading? The question mark symbol is #63. It
lies between the numbers and the capital letter set.

Your mail client mangled that sentence - the "?" you are seeing is a placeholder for the non-ASCII character "'a' with a bowl on top of it"...

David J.

Re: Regex Replace with 2 conditions

From

George Neuner

Date:

06 February 2018, 19:13:08

On 2/6/2018 10:52 AM, David G. Johnston wrote:

On Tue, Feb 6, 2018 at 8:46 AM, George Neuner <gneuner2@comcast.net> wrote:
On Mon, 5 Feb 2018 18:22:02 +0100, Francisco Olarte
<folarte@peoplecall.com> wrote:

>I repeat for the last time. YOU ARE NOT USING ASCII. ASCII IS A SEVEN
>BIT CODE, 0-128. "?" IS NOT IN THE ASCII CHARACTER SET.

What ASCII table are you reading? The question mark symbol is #63. It
lies between the numbers and the capital letter set.

Your mail client mangled that sentence - the "?" you are seeing is a placeholder for the non-ASCII character "'a' with a bowl on top of it"...

Thanks David. Apologies to everyone for the noise.

George

Re: Regex Replace with 2 conditions

From

Francisco Olarte

Date:

06 February 2018, 19:57:33

George:

On Tue, Feb 6, 2018 at 4:46 PM, George Neuner <gneuner2@comcast.net> wrote:
> On Mon, 5 Feb 2018 18:22:02 +0100, Francisco Olarte
> <folarte@peoplecall.com> wrote:
>>I repeat for the last time. YOU ARE NOT USING ASCII. ASCII IS A SEVEN
>>BIT CODE, 0-128. "?" IS NOT IN THE ASCII CHARACTER SET.

I made a typo there, 0..127, not 128 ( or [0,128)  ;-) )

> What ASCII table are you reading?  The question mark symbol is #63. It
> lies between the numbers and the capital letter set.

I'm not reading any ascii table, and I did NOT send a question mark.
IIRC I copied an a with something looking like an inverted circumflex
above. I was using gmail in ubuntu in firefox, wihich I think works in
unicode and sends mail in UTF-8, AAMOF I've looked at it and I see:

>>>
From: Francisco Olarte <folarte@peoplecall.com>
To: Denisa Cirstescu <Denisa.Cirstescu@tangoe.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>, "pgsql-general@postgresql.org"
<pgsql-general@postgresql.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
<<<

and a little below:

>>>
I repeat for the last time. YOU ARE NOT USING ASCII. ASCII IS A SEVEN
BIT CODE, 0-128. "=C4=83" IS NOT IN THE ASCII CHARACTER SET.
<<<

So, no question mark sent, I suspect your mail chain may be playing
tricks on you, or may be you are translating to 7 bits on purpose
since your mail came with the headers:

>>>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
<<<

I'll suggest you fix that before participating in threads with unicode
content. Also, many programs use ? as a placeholder for something not
in its charset, so always suspect you are not seeing the right char
when you encounter one of this things.

Francisco Olarte.

Re: Regex Replace with 2 conditions

From

George Neuner

Date:

06 February 2018, 20:18:52

On Tue, 6 Feb 2018 17:57:33 +0100, Francisco Olarte
<folarte@peoplecall.com> wrote:

>So, no question mark sent, I suspect your mail chain may be playing
>tricks on you, or may be you are translating to 7 bits on purpose
>since your mail came with the headers:
>
>>>>
>Content-Type: text/plain; charset=us-ascii
>Content-Transfer-Encoding: 7bit
><<<
>
>I'll suggest you fix that before participating in threads with unicode
>content. Also, many programs use ? as a placeholder for something not
>in its charset, so always suspect you are not seeing the right char
>when you encounter one of this things.

Usually I do see unicode characters correctly.

My news client defaults to *sending* in ISO-8859-1 (US acsii), but it
displays incoming messages in UTF-8, and in HTML if applicable ... so
I'm not sure why I'm not seeing whatever it was that you actually
typed.  It does keep coming through as a question mark in all the
responses.

I read this group through the Gmane mail->news reflector ... maybe
that has something to do with it?

George