Thread: Regular expression to UPPER() a lower case string

Regular expression to UPPER() a lower case string

From
Eagna
Date:

Hi all,

as per the subject, I want a regular expression to do what the UPPER() function does.

Obviously, I know about that function and it is not what I want.

This should be very (very) easy - I don't know what I'm missing - I've done quite complex regular expressions before and I don't know what I'm doing wrong. Brain burping this morning!

Here's a dbfiddle where I've tried a couple of things (https://dbfiddle.uk/G5c_CycU) - but I just can't seem to get it working...

Any input appreciated.


E.



Sent with Proton Mail secure email.

Re: Regular expression to UPPER() a lower case string

From
Gianni Ceccarelli
Date:
On 2022-12-10 Eagna <eagna@protonmail.com> wrote:
> This should be very (very) easy - I don't know what I'm missing -
> I've done quite complex regular expressions before and I don't know
> what I'm doing wrong. Brain burping this morning!

You're missing that:

* `regexp_replace` doesn't work like that, at all
* your logic only works by accident for some languages (try to upcase
  a `ß` or a `ı`)

--
    Dakkar - <Mobilis in mobile>
    GPG public key fingerprint = A071 E618 DD2C 5901 9574
                                 6FE2 40EA 9883 7519 3F88
                        key id = 0x75193F88




Re: Regular expression to UPPER() a lower case string

From
Eagna
Date:

Ciao and thanks for your input.


> * `regexp_replace` doesn't work like that, at all
> * your logic only works by accident for some languages (try to upcase
> a `ß` or a `ı`)

If you have any ideas how it could be done indirectly/different strategy - I'm all ears.

You can assume all English characters [a-z][A-Z].

E.


> Dakkar - <Mobilis in mobile>




Re: Regular expression to UPPER() a lower case string

From
Gianni Ceccarelli
Date:
On 2022-12-10 Eagna <eagna@protonmail.com> wrote:
> If you have any ideas how it could be done indirectly/different
> strategy - I'm all ears.

You haven't explained what you're trying to accomplish.

-- 
    Dakkar - <Mobilis in mobile>
    GPG public key fingerprint = A071 E618 DD2C 5901 9574
                                 6FE2 40EA 9883 7519 3F88
                        key id = 0x75193F88




Re: Regular expression to UPPER() a lower case string

From
"Peter J. Holzer"
Date:
On 2022-12-10 10:41:41 +0000, Gianni Ceccarelli wrote:
> On 2022-12-10 Eagna <eagna@protonmail.com> wrote:
> > This should be very (very) easy - I don't know what I'm missing -
> > I've done quite complex regular expressions before and I don't know
> > what I'm doing wrong. Brain burping this morning!
>
> You're missing that:
>
> * `regexp_replace` doesn't work like that, at all

No it doesn't. But the equivalent operation in some other languages and
tools does, so there is no reason that it couldn't do that (of course
neither is that a reason why it should).

> * your logic only works by accident for some languages (try to upcase
>   a `ß` or a `ı`)

This is also true of upper() and lower() and SQL does provide those.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

Re: Regular expression to UPPER() a lower case string

From
Eagna
Date:

Hi again, and thanks for sticking with this.

> You haven't explained what you're trying to accomplish.

Ok.

CREATE TABLE test(x TEXT);

INSERT INTO test VALUES ('abc');

SELECT   REGEXP_REPLACE(x, '<something>', '<something_else>', 'g')  FROM test;

Expected result: ABC

See fiddle here: https://dbfiddle.uk/Q2qXXwtF

David Johnston suggested something along these lines:


==========
> RegExp by itself cannot do this. You have to match all parts of the input into different capturing groups, then use
lower()combined with format() to build a new string. Putting the capturing groups into an array is the most useful
option.
===========

But it's a bit above my pay grade to do this - I've tried, but no go! :-( It *_appears_* to me that the string's length
wouldhave to be hard coded under this strategy - but if that's the only way, then so be it. 


I'd just be interested to see a solution based on DJ's suggestion or any other code that would use REGEXP_REPLACE() to
dowhat I want - preferably without hard coding, but if it's absolutely necessary. 

Thanks for any input.

E.






Re: Regular expression to UPPER() a lower case string

From
Gianni Ceccarelli
Date:
On 2022-12-10 "Peter J. Holzer" <hjp-pgsql@hjp.at> wrote:
> > * your logic only works by accident for some languages (try to
> > upcase a `ß` or a `ı`)
>
> This is also true of upper() and lower() and SQL does provide those.

Well…

> select upper('ı');
┌───────┐
│ upper │
├───────┤
│ I     │
└───────┘
(1 row)

> select upper('ß');
┌───────┐
│ upper │
├───────┤
│ ß     │
└───────┘
(1 row)

> select upper('ä');
┌───────┐
│ upper │
├───────┤
│ Ä     │
└───────┘
(1 row)

Of course all of this is dependent of locale, too.

--
    Dakkar - <Mobilis in mobile>
    GPG public key fingerprint = A071 E618 DD2C 5901 9574
                                 6FE2 40EA 9883 7519 3F88
                        key id = 0x75193F88




Re: Regular expression to UPPER() a lower case string

From
Gianni Ceccarelli
Date:
On 2022-12-10 Eagna <eagna@protonmail.com> wrote:
> Hi again, and thanks for sticking with this.
> > You haven't explained what you're trying to accomplish.  
> 
> Ok.
> 
> CREATE TABLE test(x TEXT); 
> 
> INSERT INTO test VALUES ('abc');
> 
> SELECT   REGEXP_REPLACE(x, '<something>', '<something_else>', 'g')
> FROM test;
> 
> Expected result: ABC

`select upper(x) from test`

You haven't explained why you need regexes at all. If you want to
convert the value of a text column to uppercase, there's a function
for that.

If you want to do something else, please describe the actual thing you
want to do. Not "how", but "what".

> David Johnston suggested something along these lines:
> 
> ==========
> > RegExp by itself cannot do this. You have to match all parts of the
> > input into different capturing groups, then use lower() combined
> > with format() to build a new string. Putting the capturing groups
> > into an array is the most useful option.  
> ===========
> 
> But it's a bit above my pay grade to do this - I've tried, but no go!
> :-( It *_appears_* to me that the string's length would have to be
> hard coded under this strategy - but if that's the only way, then so
> be it.

Maybe you need to look at `regexp_split_to_array` or
`regexp_split_to_table`?

-- 
    Dakkar - <Mobilis in mobile>
    GPG public key fingerprint = A071 E618 DD2C 5901 9574
                                 6FE2 40EA 9883 7519 3F88
                        key id = 0x75193F88




Re: Regular expression to UPPER() a lower case string

From
"Peter J. Holzer"
Date:
On 2022-12-10 13:44:37 +0000, Gianni Ceccarelli wrote:
> On 2022-12-10 "Peter J. Holzer" <hjp-pgsql@hjp.at> wrote:
> > > * your logic only works by accident for some languages (try to
> > > upcase a `ß` or a `ı`)
> >
> > This is also true of upper() and lower() and SQL does provide those.
>
> Well…
>
> > select upper('ı');
> ┌───────┐
> │ upper │
> ├───────┤
> │ I     │
> └───────┘
> (1 row)

This is I think universally correct. A better example would be
upper('i') which should be 'İ' in Turkish and 'I' in most other
languages.


> > select upper('ß');
> ┌───────┐
> │ upper │
> ├───────┤
> │ ß     │
> └───────┘
> (1 row)

This is incorrect according to German spelling rules. It should be
either 'SS' (traditionally) or 'ẞ' (since the introduction of the
upper-case sharp s). However, given the long absence of the ẞ from
official German orthography and the lack of reversability of the ß → SS
mapping it has been (and still is) quite common to leave the ß in lower
case.


> > select upper('ä');
> ┌───────┐
> │ upper │
> ├───────┤
> │ Ä     │
> └───────┘
> (1 row)

Correct (in German[1] and probably any other language).


So, what's the point you are trying to make?


> Of course all of this is dependent of locale, too.

Right. But why would that be different for regexp_replace than for
upper/lower)?

        hp

[1] Although I have one book which uses ä, ö, ü for lower case but Ae,
    Oe, Ue for upper case letters.

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

Re: Regular expression to UPPER() a lower case string

From
Eagna
Date:
> `select upper(x) from test`

I know about the UPPER() and LOWER() functions - I don't want them!

> If you want to do something else, please describe the actual thing you
> want to do. Not "how", but "what".


I have described it - I want to do the *_same_* thing as UPPER() does using REGEXP_REPLACE().

In the duplicate thread (apologies to the list for that - my bad!), I wrote:


====================

> I want to index on a REGEXP_REPLACE() - I thought using lower -> upper would be a good test.

> I could always have used another REGEXP_REPLACE() for my testing, but I then became "obsessed"  with the idea of
usingREGEXP_REPLACE() as a substitute for UPPER() - kind of an obfuscated code competition with myself! :-) 

========================

So, I have no actual *_need_* for this, other than a desire to learn and understand what's going on and why.

> Maybe you need to look at `regexp_split_to_array` or
> `regexp_split_to_table`?

I'm pretty sure that I have a good understanding of these. I thought I understood REGEXP_REPLACE(), but obviously not.

Thanks for any further input.

E.


> Dakkar - <Mobilis in mobile>




Re: Regular expression to UPPER() a lower case string

From
"Peter J. Holzer"
Date:
On 2022-12-10 14:36:04 +0000, Eagna wrote:
> > I want to index on a REGEXP_REPLACE() - I thought using lower -> upper would be a good test.
>
> > I could always have used another REGEXP_REPLACE() for my testing,
> > but I then became "obsessed"  with the idea of using
> > REGEXP_REPLACE() as a substitute for UPPER() - kind of an obfuscated
> > code competition with myself! :-)
>
> ========================
>
> So, I have no actual *_need_* for this, other than a desire to learn
> and understand what's going on and why.

You can't do that. Well, theoretically you could replace every
individual lower case letter with it's upper case equivalent:

select regexp_replace(...regexp_replace(regexp_replace(s, 'a', 'A'), 'b', 'B')... 'z', 'Z') ...

but that would be insane even for the 26 letters of the basic Latin
alphabet, much less the myriad of accented letters (and other alphabets
like Cyrillic or Greek ...).

On second thought you could probably use NFD normalization to separate
base letters from accents, uppercase the base letters and then
(optionally) NFC normalize everything again. Still insane ;-).

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

Re: Regular expression to UPPER() a lower case string

From
Gianni Ceccarelli
Date:
On 2022-12-10 "Peter J. Holzer" <hjp-pgsql@hjp.at> wrote:
> So, what's the point you are trying to make?

I suspect I mis-understood a thing you wrote… ignore me.

> > Of course all of this is dependent of locale, too.
> Right. But why would that be different for regexp_replace than for
> upper/lower?

I was trying to point out (badly) that the code shown, based
on a simplistic application of `regex_replace`, is going to work
even worse than `upper` (as you point out elsethread).

--
    Dakkar - <Mobilis in mobile>
    GPG public key fingerprint = A071 E618 DD2C 5901 9574
                                 6FE2 40EA 9883 7519 3F88
                        key id = 0x75193F88




Re: Regular expression to UPPER() a lower case string

From
Eagna
Date:
Hi, and thanks for all of the input - I think I'm beginning to grok it.

> On second thought you could probably use NFD normalization to separate
> base letters from accents, uppercase the base letters and then
> (optionally) NFC normalize everything again. Still insane ;-).

As far as I can see, I can only do what I want by using the following.

If there's a monthly prize on this list for the most insanely contrived piece of code, I think this is a strong
contender:

test.x = 'abc'

SELECT
  UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\1')) ||
  UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\2')) ||
  UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\3'))
FROM test;

Result: ABC

(see https://dbfiddle.uk/LQ-6udga).

Still a bit of work to do, but I'm getting there :-)

E.


> hp




Re: Regular expression to UPPER() a lower case string

From
"Peter J. Holzer"
Date:
On 2022-12-10 15:37:01 +0000, Eagna wrote:
> Hi, and thanks for all of the input - I think I'm beginning to grok it.

I'm not so sure about that

> As far as I can see, I can only do what I want by using the following.
>
> If there's a monthly prize on this list for the most insanely
> contrived piece of code, I think this is a strong contender:
>
> test.x = 'abc'
>
> SELECT
>   UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\1')) ||
>   UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\2')) ||
>   UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\3'))
> FROM test;
>
> Result: ABC

I don't think this does what you want it to do:

wds=> SELECT x,
  UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\1')) ||
  UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\2')) ||
  UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\3'))
FROM test;
╔═════════════╤═════════════════════════════╗
║      x      │          ?column?           ║
╟─────────────┼─────────────────────────────╢
║ abc_def_ghi │ A_DEF_GHIB_DEF_GHIC_DEF_GHI ║
╚═════════════╧═════════════════════════════╝

(I admit it took me at least a minute to figure out what was happening
here)

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

Re: Regular expression to UPPER() a lower case string

From
"Peter J. Holzer"
Date:
On 2022-12-10 15:48:58 +0100, Peter J. Holzer wrote:
> On second thought you could probably use NFD normalization to separate
> base letters from accents, uppercase the base letters and then
> (optionally) NFC normalize everything again.

Of course I had to try that:

wds=> select
    normalize(
        replace(
            replace(
                replace(
                    replace(
                        normalize('Käse', NFD),
                        's', 'S'
                    ),
                    'k', 'K'
                ),
                'e', 'E'
            ),
            'a', 'A'
        ),
    NFC
    )
;
╔═══════════╗
║ normalize ║
╟───────────╢
║ KÄSE      ║
╚═══════════╝
(1 row)

Works as expected.

> Still insane ;-).

I haven't changed my mind about that.

        hp


--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment