Thread: Regular expression to UPPER() a lower case string
Hi all,
as per the subject, I want a regular expression to do what the UPPER() function does.
Obviously, I know about that function and it is not what I want.
This should be very (very) easy - I don't know what I'm missing - I've done quite complex regular expressions before and I don't know what I'm doing wrong. Brain burping this morning!
Here's a dbfiddle where I've tried a couple of things (https://dbfiddle.uk/G5c_CycU) - but I just can't seem to get it working...
Any input appreciated.
E.
Sent with Proton Mail secure email.
On 2022-12-10 Eagna <eagna@protonmail.com> wrote: > This should be very (very) easy - I don't know what I'm missing - > I've done quite complex regular expressions before and I don't know > what I'm doing wrong. Brain burping this morning! You're missing that: * `regexp_replace` doesn't work like that, at all * your logic only works by accident for some languages (try to upcase a `ß` or a `ı`) -- Dakkar - <Mobilis in mobile> GPG public key fingerprint = A071 E618 DD2C 5901 9574 6FE2 40EA 9883 7519 3F88 key id = 0x75193F88
Ciao and thanks for your input. > * `regexp_replace` doesn't work like that, at all > * your logic only works by accident for some languages (try to upcase > a `ß` or a `ı`) If you have any ideas how it could be done indirectly/different strategy - I'm all ears. You can assume all English characters [a-z][A-Z]. E. > Dakkar - <Mobilis in mobile>
On 2022-12-10 Eagna <eagna@protonmail.com> wrote: > If you have any ideas how it could be done indirectly/different > strategy - I'm all ears. You haven't explained what you're trying to accomplish. -- Dakkar - <Mobilis in mobile> GPG public key fingerprint = A071 E618 DD2C 5901 9574 6FE2 40EA 9883 7519 3F88 key id = 0x75193F88
On 2022-12-10 10:41:41 +0000, Gianni Ceccarelli wrote: > On 2022-12-10 Eagna <eagna@protonmail.com> wrote: > > This should be very (very) easy - I don't know what I'm missing - > > I've done quite complex regular expressions before and I don't know > > what I'm doing wrong. Brain burping this morning! > > You're missing that: > > * `regexp_replace` doesn't work like that, at all No it doesn't. But the equivalent operation in some other languages and tools does, so there is no reason that it couldn't do that (of course neither is that a reason why it should). > * your logic only works by accident for some languages (try to upcase > a `ß` or a `ı`) This is also true of upper() and lower() and SQL does provide those. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
Hi again, and thanks for sticking with this. > You haven't explained what you're trying to accomplish. Ok. CREATE TABLE test(x TEXT); INSERT INTO test VALUES ('abc'); SELECT REGEXP_REPLACE(x, '<something>', '<something_else>', 'g') FROM test; Expected result: ABC See fiddle here: https://dbfiddle.uk/Q2qXXwtF David Johnston suggested something along these lines: ========== > RegExp by itself cannot do this. You have to match all parts of the input into different capturing groups, then use lower()combined with format() to build a new string. Putting the capturing groups into an array is the most useful option. =========== But it's a bit above my pay grade to do this - I've tried, but no go! :-( It *_appears_* to me that the string's length wouldhave to be hard coded under this strategy - but if that's the only way, then so be it. I'd just be interested to see a solution based on DJ's suggestion or any other code that would use REGEXP_REPLACE() to dowhat I want - preferably without hard coding, but if it's absolutely necessary. Thanks for any input. E.
On 2022-12-10 "Peter J. Holzer" <hjp-pgsql@hjp.at> wrote: > > * your logic only works by accident for some languages (try to > > upcase a `ß` or a `ı`) > > This is also true of upper() and lower() and SQL does provide those. Well… > select upper('ı'); ┌───────┐ │ upper │ ├───────┤ │ I │ └───────┘ (1 row) > select upper('ß'); ┌───────┐ │ upper │ ├───────┤ │ ß │ └───────┘ (1 row) > select upper('ä'); ┌───────┐ │ upper │ ├───────┤ │ Ä │ └───────┘ (1 row) Of course all of this is dependent of locale, too. -- Dakkar - <Mobilis in mobile> GPG public key fingerprint = A071 E618 DD2C 5901 9574 6FE2 40EA 9883 7519 3F88 key id = 0x75193F88
On 2022-12-10 Eagna <eagna@protonmail.com> wrote: > Hi again, and thanks for sticking with this. > > You haven't explained what you're trying to accomplish. > > Ok. > > CREATE TABLE test(x TEXT); > > INSERT INTO test VALUES ('abc'); > > SELECT REGEXP_REPLACE(x, '<something>', '<something_else>', 'g') > FROM test; > > Expected result: ABC `select upper(x) from test` You haven't explained why you need regexes at all. If you want to convert the value of a text column to uppercase, there's a function for that. If you want to do something else, please describe the actual thing you want to do. Not "how", but "what". > David Johnston suggested something along these lines: > > ========== > > RegExp by itself cannot do this. You have to match all parts of the > > input into different capturing groups, then use lower() combined > > with format() to build a new string. Putting the capturing groups > > into an array is the most useful option. > =========== > > But it's a bit above my pay grade to do this - I've tried, but no go! > :-( It *_appears_* to me that the string's length would have to be > hard coded under this strategy - but if that's the only way, then so > be it. Maybe you need to look at `regexp_split_to_array` or `regexp_split_to_table`? -- Dakkar - <Mobilis in mobile> GPG public key fingerprint = A071 E618 DD2C 5901 9574 6FE2 40EA 9883 7519 3F88 key id = 0x75193F88
On 2022-12-10 13:44:37 +0000, Gianni Ceccarelli wrote: > On 2022-12-10 "Peter J. Holzer" <hjp-pgsql@hjp.at> wrote: > > > * your logic only works by accident for some languages (try to > > > upcase a `ß` or a `ı`) > > > > This is also true of upper() and lower() and SQL does provide those. > > Well… > > > select upper('ı'); > ┌───────┐ > │ upper │ > ├───────┤ > │ I │ > └───────┘ > (1 row) This is I think universally correct. A better example would be upper('i') which should be 'İ' in Turkish and 'I' in most other languages. > > select upper('ß'); > ┌───────┐ > │ upper │ > ├───────┤ > │ ß │ > └───────┘ > (1 row) This is incorrect according to German spelling rules. It should be either 'SS' (traditionally) or 'ẞ' (since the introduction of the upper-case sharp s). However, given the long absence of the ẞ from official German orthography and the lack of reversability of the ß → SS mapping it has been (and still is) quite common to leave the ß in lower case. > > select upper('ä'); > ┌───────┐ > │ upper │ > ├───────┤ > │ Ä │ > └───────┘ > (1 row) Correct (in German[1] and probably any other language). So, what's the point you are trying to make? > Of course all of this is dependent of locale, too. Right. But why would that be different for regexp_replace than for upper/lower)? hp [1] Although I have one book which uses ä, ö, ü for lower case but Ae, Oe, Ue for upper case letters. -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
> `select upper(x) from test` I know about the UPPER() and LOWER() functions - I don't want them! > If you want to do something else, please describe the actual thing you > want to do. Not "how", but "what". I have described it - I want to do the *_same_* thing as UPPER() does using REGEXP_REPLACE(). In the duplicate thread (apologies to the list for that - my bad!), I wrote: ==================== > I want to index on a REGEXP_REPLACE() - I thought using lower -> upper would be a good test. > I could always have used another REGEXP_REPLACE() for my testing, but I then became "obsessed" with the idea of usingREGEXP_REPLACE() as a substitute for UPPER() - kind of an obfuscated code competition with myself! :-) ======================== So, I have no actual *_need_* for this, other than a desire to learn and understand what's going on and why. > Maybe you need to look at `regexp_split_to_array` or > `regexp_split_to_table`? I'm pretty sure that I have a good understanding of these. I thought I understood REGEXP_REPLACE(), but obviously not. Thanks for any further input. E. > Dakkar - <Mobilis in mobile>
On 2022-12-10 14:36:04 +0000, Eagna wrote: > > I want to index on a REGEXP_REPLACE() - I thought using lower -> upper would be a good test. > > > I could always have used another REGEXP_REPLACE() for my testing, > > but I then became "obsessed" with the idea of using > > REGEXP_REPLACE() as a substitute for UPPER() - kind of an obfuscated > > code competition with myself! :-) > > ======================== > > So, I have no actual *_need_* for this, other than a desire to learn > and understand what's going on and why. You can't do that. Well, theoretically you could replace every individual lower case letter with it's upper case equivalent: select regexp_replace(...regexp_replace(regexp_replace(s, 'a', 'A'), 'b', 'B')... 'z', 'Z') ... but that would be insane even for the 26 letters of the basic Latin alphabet, much less the myriad of accented letters (and other alphabets like Cyrillic or Greek ...). On second thought you could probably use NFD normalization to separate base letters from accents, uppercase the base letters and then (optionally) NFC normalize everything again. Still insane ;-). hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
On 2022-12-10 "Peter J. Holzer" <hjp-pgsql@hjp.at> wrote: > So, what's the point you are trying to make? I suspect I mis-understood a thing you wrote… ignore me. > > Of course all of this is dependent of locale, too. > Right. But why would that be different for regexp_replace than for > upper/lower? I was trying to point out (badly) that the code shown, based on a simplistic application of `regex_replace`, is going to work even worse than `upper` (as you point out elsethread). -- Dakkar - <Mobilis in mobile> GPG public key fingerprint = A071 E618 DD2C 5901 9574 6FE2 40EA 9883 7519 3F88 key id = 0x75193F88
Hi, and thanks for all of the input - I think I'm beginning to grok it. > On second thought you could probably use NFD normalization to separate > base letters from accents, uppercase the base letters and then > (optionally) NFC normalize everything again. Still insane ;-). As far as I can see, I can only do what I want by using the following. If there's a monthly prize on this list for the most insanely contrived piece of code, I think this is a strong contender: test.x = 'abc' SELECT UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\1')) || UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\2')) || UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\3')) FROM test; Result: ABC (see https://dbfiddle.uk/LQ-6udga). Still a bit of work to do, but I'm getting there :-) E. > hp
On 2022-12-10 15:37:01 +0000, Eagna wrote: > Hi, and thanks for all of the input - I think I'm beginning to grok it. I'm not so sure about that > As far as I can see, I can only do what I want by using the following. > > If there's a monthly prize on this list for the most insanely > contrived piece of code, I think this is a strong contender: > > test.x = 'abc' > > SELECT > UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\1')) || > UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\2')) || > UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\3')) > FROM test; > > Result: ABC I don't think this does what you want it to do: wds=> SELECT x, UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\1')) || UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\2')) || UPPER(REGEXP_REPLACE(x, '(.)(.)(.)', '\3')) FROM test; ╔═════════════╤═════════════════════════════╗ ║ x │ ?column? ║ ╟─────────────┼─────────────────────────────╢ ║ abc_def_ghi │ A_DEF_GHIB_DEF_GHIC_DEF_GHI ║ ╚═════════════╧═════════════════════════════╝ (I admit it took me at least a minute to figure out what was happening here) hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
On 2022-12-10 15:48:58 +0100, Peter J. Holzer wrote: > On second thought you could probably use NFD normalization to separate > base letters from accents, uppercase the base letters and then > (optionally) NFC normalize everything again. Of course I had to try that: wds=> select normalize( replace( replace( replace( replace( normalize('Käse', NFD), 's', 'S' ), 'k', 'K' ), 'e', 'E' ), 'a', 'A' ), NFC ) ; ╔═══════════╗ ║ normalize ║ ╟───────────╢ ║ KÄSE ║ ╚═══════════╝ (1 row) Works as expected. > Still insane ;-). I haven't changed my mind about that. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"