Thread: Funtion to clean up strings?

Funtion to clean up strings?

From
Andreas
Date:
Hi,

I need a function that removes characters in strings that aren't in a 
given set of chars.
e.g
Input:    12-34/ 56(8)
I want just numbers so Output should in this case be:  1234568

Is there a solution?



Re: Funtion to clean up strings?

From
"Dickson S. Guedes"
Date:
2009/2/12 Andreas <maps.on@gmx.net>:
> Hi,
>
> I need a function that removes characters in strings that aren't in a given
> set of chars.
> e.g
> Input:    12-34/ 56(8)
> I want just numbers so Output should in this case be:  1234568
>
> Is there a solution?

One is:

SELECT regexp_replace('12-34/ 56(8)',E'[^0-9]','','g');

-- 
Dickson S. Guedes
-
mail/xmpp: guedes@guedesoft.net - skype: guediz
http://guedesoft.net - http://planeta.postgresql.org.br


Re: Funtion to clean up strings?

From
Andreas
Date:
Wow, that was fast
THANKS   :)

now ... lets get more complicated.
Phone numbers are entered:
0123/4567-89          national number
0049/123/4567-89         the same number
+49/123/4567-89         still the same number

should come out as   0123456789  to search in this column.
"0049" and "+49"  --> 0

while international numbers
+33/123456789
0033/123456789

should come as
+33123456789

I fear the hard part is that international codes have 1-3 digits   :(


Regards
Andreas


ivan Stoykov wrote:
> regexp_replace('120323423 23(fdf_)fd','[^[:digit:]]','','gi')
>
> Andreas wrote:
>   
>> Hi,
>>
>> I need a function that removes characters in strings that aren't in a
>> given set of chars.
>> e.g
>> Input:    12-34/ 56(8)
>> I want just numbers so Output should in this case be:  1234568
>>
>> Is there a solution?


Re: Funtion to clean up strings?

From
Raj Mathur
Date:
On Friday 13 Feb 2009, Andreas wrote:
> now ... lets get more complicated.
> Phone numbers are entered:
> 0123/4567-89          national number
> 0049/123/4567-89         the same number
> +49/123/4567-89         still the same number
>
> should come out as   0123456789  to search in this column.
> "0049" and "+49"  --> 0
>
> while international numbers
> +33/123456789
> 0033/123456789
>
> should come as
> +33123456789

TEST=> create table foo(p text);

TEST=> insert into foo (select regexp_split_to_table('0123/4567-89 
0049/123/4567-89 +49/123/4567-89 +33/123456789 0033/123456789',' '));

TEST=> select * from foo;       p
------------------0123/4567-890049/123/4567-89+49/123/4567-89+33/1234567890033/123456789
(5 rows)

TEST=> select
(case when p ~ E'^(\\+|00)49'   then '0'||regexp_replace(regexp_replace(p, E'[^0-9+]', '', 'g'), 
E'^(?:\\+|00)49(.*)', E'\\1') when p ~ E'^(\\+|00)'   then '+'||regexp_replace(regexp_replace(p, E'[^0-9+]', '', 'g'),

E'^(?:\\+||00)(.*)', E'\\1') else   regexp_replace(p, E'[^0-9]', '', 'g')
end)
from foo;regexp_replace
----------------012345678901234567890123456789+33123456789+33123456789
(5 rows)

That do what you want?  (Apologies for the wrapped lines.)

Regards,

-- Raju
-- 
Raj Mathur                raju@kandalaya.org      http://kandalaya.org/      GPG: 78D4 FC67 367F 40E2 0DD5  0FEF C968
D0EFCC68 D17F
 
PsyTrance & Chill: http://schizoid.in/   ||   It is the mind that moves


Re: Funtion to clean up strings?

From
Andreas
Date:
Hi
Thanks, that really works   :)

Now a last extension.
Some numbers were entered in a "110% perfect" way with an excessive (0).  
+49 (0) 123 / 456 789
I have to suspect the source liked to express that it's either +49  or  
0 if the +49 isn't applicable, but not both.
Both together are semantically wrong and your function results therefore 
to "00123456789".
Correct was "0123456789" or e.g. "+33123456789" if it were an 
international number.

This (0) should be silently dropped as long as the endresult has at 
least one 0 or + like in the allready covered cases.

I tried to use this RegEx magic myself as far as I could figure it out, 
yet and came up with replacing every p in your solution with another regex

case when regexp_replace(p, E'[^0-9+]', '', 'g') ~ E'^(\\+|00)49'   then '0'||      regexp_replace(
regexp_replace(           regexp_replace(p, E'[^0-9+()]', '', 'g')         , '\\(0\\)||\\(||\\)', '', 'g')      ,
E'^(?:\\+|00)49(.*)',E'\\1') when regexp_replace(p, E'[^0-9+]', '', 'g') ~ E'^(\\+|00)'   then '+'||
regexp_replace(           regexp_replace(               regexp_replace(p, E'[^0-9+()]', '', 'g')            ,
'\\(0\\)||\\(||\\)','', 'g')      , E'^(?:\\+||00)(.*)', E'\\1') else   regexp_replace(p, E'[^0-9]', '', 'g')
 
end

That would catch the leading spaces in "  00 49 ( 0 ) 1 2 3 456 -0", too.
Creating a sql-function thows a WARNING:  nonstandard use of \\ in a 
string literal
but it still works. Do you know a better or more correct way to reach 
the same?

Perhaps one could find a way with less calls to regexp_replace ?

Regards
Andreas  :)


Raj Mathur wrote:
> On Friday 13 Feb 2009, Andreas wrote:
>   
>> now ... lets get more complicated.
>> Phone numbers are entered:
>> 0123/4567-89          national number
>> 0049/123/4567-89         the same number
>> +49/123/4567-89         still the same number
>>
>> should come out as   0123456789  to search in this column.
>> "0049" and "+49"  --> 0
>>
>> while international numbers
>> +33/123456789
>> 0033/123456789
>>
>> should come as
>> +33123456789
>>     
>
> TEST=> create table foo(p text);
>
> TEST=> insert into foo (select regexp_split_to_table('0123/4567-89 
> 0049/123/4567-89 +49/123/4567-89 +33/123456789 0033/123456789',' '));
>
> TEST=> select * from foo;
>         p
> ------------------
>  0123/4567-89
>  0049/123/4567-89
>  +49/123/4567-89
>  +33/123456789
>  0033/123456789
> (5 rows)
>
> TEST=> select
> (case
>   when p ~ E'^(\\+|00)49'
>     then '0'||regexp_replace(regexp_replace(p, E'[^0-9+]', '', 'g'), 
> E'^(?:\\+|00)49(.*)', E'\\1')
>   when p ~ E'^(\\+|00)'
>     then '+'||regexp_replace(regexp_replace(p, E'[^0-9+]', '', 'g'), 
> E'^(?:\\+||00)(.*)', E'\\1')
>   else
>     regexp_replace(p, E'[^0-9]', '', 'g')
> end)
> from foo;
>  regexp_replace
> ----------------
>  0123456789
>  0123456789
>  0123456789
>  +33123456789
>  +33123456789
> (5 rows)
>
> That do what you want?  (Apologies for the wrapped lines.)
>
> Regards,
>
> -- Raju
>   



Re: Funtion to clean up strings?

From
Raj Mathur
Date:
On Tuesday 17 Feb 2009, Andreas wrote:
> [snip]
> case
>   when regexp_replace(p, E'[^0-9+]', '', 'g') ~ E'^(\\+|00)49'
>     then '0'||
>        regexp_replace(
>           regexp_replace(
>              regexp_replace(p, E'[^0-9+()]', '', 'g')
>           , '\\(0\\)||\\(||\\)', '', 'g')
>        , E'^(?:\\+|00)49(.*)', E'\\1')
>   when regexp_replace(p, E'[^0-9+]', '', 'g') ~ E'^(\\+|00)'
>     then '+'||
>        regexp_replace(
>              regexp_replace(
>                 regexp_replace(p, E'[^0-9+()]', '', 'g')
>              , '\\(0\\)||\\(||\\)', '', 'g')
>        , E'^(?:\\+||00)(.*)', E'\\1')
>   else
>     regexp_replace(p, E'[^0-9]', '', 'g')
> end
>
> That would catch the leading spaces in "  00 49 ( 0 ) 1 2 3 456 -0",
> too. Creating a sql-function thows a WARNING:  nonstandard use of \\
> in a string literal
> but it still works. Do you know a better or more correct way to reach
> the same?
>
> Perhaps one could find a way with less calls to regexp_replace ?

That is what I would have tried too :)  The only improvement I can think 
of is to replace one instance of regex_replace with a string replace, 
since the string (0) is fixed.

On the other hand, I'm not an expert at Pg functions by any means, so 
someone else may have a faster or more elegant solution.

Regards,

-- Raju
-- 
Raj Mathur                raju@kandalaya.org      http://kandalaya.org/      GPG: 78D4 FC67 367F 40E2 0DD5  0FEF C968
D0EFCC68 D17F
 
PsyTrance & Chill: http://schizoid.in/   ||   It is the mind that moves


Re: Funtion to clean up strings?

From
Andreas
Date:
string_replace would only replace one searchstring at a time.
In this case I need to replace 3 :   (0)      (     )
because there could be some braces not just as (0) since the innermost 
replace spares braces regardless where they are.

Could one express the following in one expression
1)  remove all spaces
2)  remove everything but  0-9+  or  the first occurance  (0)

Leading spaces had to be removed anyway because the distract the CASE 
... WHEN when it checks for "+49..." but gets "     +49...".

Then I could use string_replace at the end.

Do you know the correct way to code the backslashes to avoid the Warning?


Regards
Andreas


Raj Mathur schrieb:
> On Tuesday 17 Feb 2009, Andreas wrote:
>   
>> [snip]
>> case
>>   when regexp_replace(p, E'[^0-9+]', '', 'g') ~ E'^(\\+|00)49'
>>     then '0'||
>>        regexp_replace(
>>           regexp_replace(
>>              regexp_replace(p, E'[^0-9+()]', '', 'g')
>>           , '\\(0\\)||\\(||\\)', '', 'g')
>>        , E'^(?:\\+|00)49(.*)', E'\\1')
>>   when regexp_replace(p, E'[^0-9+]', '', 'g') ~ E'^(\\+|00)'
>>     then '+'||
>>        regexp_replace(
>>              regexp_replace(
>>                 regexp_replace(p, E'[^0-9+()]', '', 'g')
>>              , '\\(0\\)||\\(||\\)', '', 'g')
>>        , E'^(?:\\+||00)(.*)', E'\\1')
>>   else
>>     regexp_replace(p, E'[^0-9]', '', 'g')
>> end
>>
>> That would catch the leading spaces in "  00 49 ( 0 ) 1 2 3 456 -0",
>> too. Creating a sql-function thows a WARNING:  nonstandard use of \\
>> in a string literal
>> but it still works. Do you know a better or more correct way to reach
>> the same?
>>
>> Perhaps one could find a way with less calls to regexp_replace ?
>>     
>
> That is what I would have tried too :)  The only improvement I can think 
> of is to replace one instance of regex_replace with a string replace, 
> since the string (0) is fixed.
>
> On the other hand, I'm not an expert at Pg functions by any means, so 
> someone else may have a faster or more elegant solution.
>
> Regards,
>
> -- Raju
>   



Re: Funtion to clean up strings?

From
Gregory Stark
Date:
Andreas <maps.on@gmx.net> writes:

> Hi
> Thanks, that really works   :)
>
> Now a last extension.
> Some numbers were entered in a "110% perfect" way with an excessive (0).  +49
> (0) 123 / 456 789
> I have to suspect the source liked to express that it's either +49  or  0 if
> the +49 isn't applicable, but not both.

This is the standard format for phone numbers. Parenthesized digits -- as you
suspected -- represent digits which must only be dialled when using the number
locally and must be omitted from outside.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!