Re: SELECT substring with regex - Mailing list pgsql-sql

From T E Schmitz
Subject Re: SELECT substring with regex
Date
Msg-id 44AEC8C5.80103@numerixtechnology.de
Whole thread Raw
In response to Re: SELECT substring with regex  (Erik Jones <erik@myemma.com>)
Responses Re: SELECT substring with regex  (Richard Broersma Jr <rabroersma@yahoo.com>)
Re: SELECT substring with regex  ("Aaron Bono" <postgresql@aranya.com>)
Re: SELECT substring with regex  (Emils <gnudiff@gmail.com>)
Re: SELECT substring with regex  (Alvaro Herrera <alvherre@commandprompt.com>)
List pgsql-sql
Erik Jones wrote:
> T E Schmitz wrote:
> 
>> Gary Stainburn wrote:
>>
>>> On Friday 07 July 2006 14:51, T E Schmitz wrote:
>>>
>>>> I would like to split the contents of a column using substring with a
>>>> regular expression:
>>>>
>>>> The column contains something like
>>>> "150mm LD AD Asp XR Macro"
>>>> I want to split this into
>>>> "150mm", "LD AD Asp XR Macro"
>>>>
>>>
>>>
>>> select substring('150mm LD AD Asp XR Macro','^\\d+mm') as BASE_NAME,
>>>        substring('150mm LD AD Asp XR Macro','^\\d+mm (.*)$') as SUFFIX;
>>>  base_name |       suffix
>>> -----------+--------------------
>>>  150mm     | LD AD Asp XR Macro
>>> (1 row)
>>>
>>> The brackets surround the required match
>> But that takes me to the next problem:
>>
>> For the sake of the example I simplified the regular pattern.
>> In reality, BASE_NAME might be:
>>
>> 28mm
>> 28-70mm
>>
>> So the reg. expr. requires brackets:
>>
>> substring (NAME, '^(\\d+(-\\d+)?mm)' ) as BASE_NAME
>>
> Will the mm always be the end of the base name?

I had thought it best to simplify the problem for the purposes of the
mailing list but maybe I should supply the complete problem and describe
the purpose of the exercise:

I am trying to come up with a semi-automatic solution to tidy up some
data. If it's got to be done manually via the GUI it would mean a lot of
dummy work [for the customer].

First of all I did a 5 table join to select those NAMEs which don't
follow the required pattern: the pattern describes a photographic lens
(focal length followed by lens speed (aperture)) and nothing else.
Unfortuantely, there are a few hundred occurences where a few attributes
have been appended which should have been stored elsewhere.


valid entries would be:
"28mm F2.8" (prime lens)
"30-70mm F4" (zoom lens)
"30-70mm F2.8" (zoom lens)
"30-100mm F4.5-5.6" (zoom lens with variable speed)


In the WHERE clause I have specified all those NAMEs, which follow that
pattern but have some gubbins appended:

WHERE NAME ~
'^((\\d+(-\\d+)?)mm F((\\d+(\.\\d+)?)+(\.(\\d+(\.\\d+)?)+)?))\\D+$'


which gives me a listing of those candidates that need to be amended -
manually or otherwise.

Next, I wanted to produce a result which splits NAME into what it should
be (BASE) and attributes (SUFFIX). Maybe I can generate some SQL from
that to tidy up the data.

-- 


Regards,

Tarlika Elisabeth Schmitz



pgsql-sql by date:

Previous
From: "Jim Buttafuoco"
Date:
Subject: Re: SELECT substring with regex
Next
From: "David Clarke"
Date:
Subject: Re: Alternative to serial primary key