Thread: atoi-like function: is there a better way to do this?

atoi-like function: is there a better way to do this?

From
Chris Angelico
Date:
One of our tables has a few columns that may be interpreted as strings
or may be numbers (data type is varchar, numbers are stored as
decimal). Generally, operations are performed on the string, but
sometimes we need to parse out a number - without it failing on error.
I wrote the following function to approximate to the semantics of
atoi:

create or replace function str2int(val varchar) returns bigint immutable as $$
begin
    val=substring(val from '[0-9]*');
    if length(val) between 1 and 19 then return val::bigint; end if;
    return 0;
end;
$$ language plpgsql;

It uses a regular expression to chop the string down to just the
leading digits (leading only - '123.45' should parse as 123). Is there
a more efficient way to achieve this?

Thanks!

Chris Angelico

Re: atoi-like function: is there a better way to do this?

From
Chris Angelico
Date:
(Hoping you meant for that reply to be on-list as I'm here responding on-list.)

On Mon, Mar 5, 2012 at 2:16 PM, A.M. <agentm@themactionfaction.com> wrote:
>
> On Mar 4, 2012, at 9:13 PM, Chris Angelico wrote:
>
>> One of our tables has a few columns that may be interpreted as strings
>> or may be numbers (data type is varchar, numbers are stored as
>> decimal). Generally, operations are performed on the string, but
>> sometimes we need to parse out a number - without it failing on error.
>> I wrote the following function to approximate to the semantics of
>> atoi:
>
> I would recommend against such a schema since different data types should warrant their own columns, but if you are
leftwith no choice... 

The values have to be strings for other reasons (eg '' is valid
everywhere, and this is subsequently processed by a script that
expects all strings). So yeah, no choice there. But I agree that
normally you DO want integers stored in integer columns, and we're
paying a performance penalty for this.

>>
>> create or replace function str2int(val varchar) returns bigint immutable as $$
>> begin
>>       val=substring(val from '[0-9]*');
>>       if length(val) between 1 and 19 then return val::bigint; end if;
>>       return 0;
>> end;
>> $$ language plpgsql;
>
> This can be written as:
> select substring('35345345.45645654' from '\d{1,19}')::bigint;
>
> Be aware that this does not account for:
> 3dogs
> 3 dogs
> 3,dogs
> 3.5.6.7
>
> all of which will return 3::bigint, but I assume that the column is not completely free-form or maybe this is what
youwant. (?) 
>
> Cheers,
> M

That is in fact the behaviour I want. Trouble is, that simpler version
returns NULL if given 'asdf' as its input - I want it to return 0.
It's also about the same performance (slightly slower in a quick test)
than the original; it's still doing a regular expression parse. I was
hoping very much to avoid the regex altogether.

ChrisA

Re: atoi-like function: is there a better way to do this?

From
David Johnston
Date:
On Mar 4, 2012, at 22:31, Chris Angelico <rosuav@gmail.com> wrote:

> (Hoping you meant for that reply to be on-list as I'm here responding on-list.)
>
> On Mon, Mar 5, 2012 at 2:16 PM, A.M. <agentm@themactionfaction.com> wrote:
>>
>> On Mar 4, 2012, at 9:13 PM, Chris Angelico wrote:
>>
>>> One of our tables has a few columns that may be interpreted as strings
>>> or may be numbers (data type is varchar, numbers are stored as
>>> decimal). Generally, operations are performed on the string, but
>>> sometimes we need to parse out a number - without it failing on error.
>>> I wrote the following function to approximate to the semantics of
>>> atoi:
>>
>> I would recommend against such a schema since different data types should warrant their own columns, but if you are
leftwith no choice... 
>
> The values have to be strings for other reasons (eg '' is valid
> everywhere, and this is subsequently processed by a script that
> expects all strings). So yeah, no choice there. But I agree that
> normally you DO want integers stored in integer columns, and we're
> paying a performance penalty for this.
>
>>>
>>> create or replace function str2int(val varchar) returns bigint immutable as $$
>>> begin
>>>       val=substring(val from '[0-9]*');
>>>       if length(val) between 1 and 19 then return val::bigint; end if;
>>>       return 0;
>>> end;
>>> $$ language plpgsql;
>>
>> This can be written as:
>> select substring('35345345.45645654' from '\d{1,19}')::bigint;
>>
>> Be aware that this does not account for:
>> 3dogs
>> 3 dogs
>> 3,dogs
>> 3.5.6.7
>>
>> all of which will return 3::bigint, but I assume that the column is not completely free-form or maybe this is what
youwant. (?) 
>>
>> Cheers,
>> M
>
> That is in fact the behaviour I want. Trouble is, that simpler version
> returns NULL if given 'asdf' as its input - I want it to return 0.

COALESCE is your friend


> It's also about the same performance (slightly slower in a quick test)
> than the original; it's still doing a regular expression parse. I was
> hoping very much to avoid the regex altogether.
>
> ChrisA
>

Any efficient, non-RegEx, alternative would require more context to evaluate than you provide.  Mainly, would it be
fasterto have a separate field to store the parsed (at input) number and then query that field directly (even if it is
atext field as well)?  Basically cache the parse. 

David J..



Re: atoi-like function: is there a better way to do this?

From
Chris Angelico
Date:
On Mon, Mar 5, 2012 at 2:50 PM, David Johnston <polobo@yahoo.com> wrote:
> Any efficient, non-RegEx, alternative would require more context to evaluate than you provide.  Mainly, would it be
fasterto have a separate field to store the parsed (at input) number and then query that field directly (even if it is
atext field as well)?  Basically cache the parse. 

Caching's looking tempting, but I don't know if it'll be worth it
(these fields won't be searched-as-int very often compared to
search-as-string, and there's potentially a lot of such fields). All I
need out of it is the leading digits - I can strip them with trim(),
but I can't keep _only_ those digits.

The other possibility that may be of value is to write the function in
C instead of pl/pgsql, which will then actually call atoi() itself. Is
that going to be a better option?

ChrisA

Re: atoi-like function: is there a better way to do this?

From
Tom Molesworth
Date:
On 05/03/12 04:06, Chris Angelico wrote:
> On Mon, Mar 5, 2012 at 2:50 PM, David Johnston<polobo@yahoo.com>  wrote:
>> Any efficient, non-RegEx, alternative would require more context to evaluate than you provide.  Mainly, would it be
fasterto have a separate field to store the parsed (at input) number and then query that field directly (even if it is
atext field as well)?  Basically cache the parse. 
> Caching's looking tempting, but I don't know if it'll be worth it
> (these fields won't be searched-as-int very often compared to
> search-as-string, and there's potentially a lot of such fields). All I
> need out of it is the leading digits - I can strip them with trim(),
> but I can't keep _only_ those digits.
>
> The other possibility that may be of value is to write the function in
> C instead of pl/pgsql, which will then actually call atoi() itself. Is
> that going to be a better option?

Can you use to_number() here? It sounds like something along the lines
of cast(to_number('0' || field::varchar, '999999999.') as int) might
give the behaviour you're after, and a quick test seems to indicate that
it's about 4x faster than the original function:

postgres=# explain analyze select cast(to_number('0' ||
generate_series::varchar, '999999999.') as int) from
generate_series(1,1000000);
                                                          QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------
  Function Scan on generate_series  (cost=0.00..22.50 rows=1000 width=4)
(actual time=137.720..1065.752 rows=1000000 loops=1)
  Total runtime: 1144.993 ms
(2 rows)

postgres=# explain analyze select str2int(generate_series::varchar) from
generate_series(1,1000000);
                                                           QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------
  Function Scan on generate_series  (cost=0.00..265.00 rows=1000
width=4) (actual time=135.180..4022.408 rows=1000000 loops=1)
  Total runtime: 4121.233 ms
(2 rows)

Hopefully there's a cleaner way of writing that without a long list of
9s in the format string, and if the field is nullable I'd guess you
probably need a coalesce(..., 0) around that as well.

cheers,

Tom


Re: atoi-like function: is there a better way to do this?

From
Chris Angelico
Date:
On Mon, Mar 5, 2012 at 3:15 PM, Tom Molesworth <tom@audioboundary.com> wrote:
> Can you use to_number() here? It sounds like something along the lines of
> cast(to_number('0' || field::varchar, '999999999.') as int) might give the
> behaviour you're after, and a quick test seems to indicate that it's about
> 4x faster than the original function:

I looked at to_number but it seems to find digits anywhere inside the
field - to_number('12.34','99999') returns 1234, but I want it to stop
at the decimal.

Nice trick with prepending a zero though - I didn't think of that.
That may save some hassle!

ChrisA

Re: atoi-like function: is there a better way to do this?

From
David Johnston
Date:
On Mar 5, 2012, at 0:08, Chris Angelico <rosuav@gmail.com> wrote:

> On Mon, Mar 5, 2012 at 3:15 PM, Tom Molesworth <tom@audioboundary.com> wrote:
>> Can you use to_number() here? It sounds like something along the lines of
>> cast(to_number('0' || field::varchar, '999999999.') as int) might give the
>> behaviour you're after, and a quick test seems to indicate that it's about
>> 4x faster than the original function:
>
> I looked at to_number but it seems to find digits anywhere inside the
> field - to_number('12.34','99999') returns 1234, but I want it to stop
> at the decimal.
>
> Nice trick with prepending a zero though - I didn't think of that.
> That may save some hassle!
>
> ChrisA
>
>

0alpha999 -> 0
alpha999 -> 999


Re: atoi-like function: is there a better way to do this?

From
Tom Molesworth
Date:
On 05/03/12 05:08, Chris Angelico wrote:
> On Mon, Mar 5, 2012 at 3:15 PM, Tom Molesworth<tom@audioboundary.com>  wrote:
>> Can you use to_number() here? It sounds like something along the lines of
>> cast(to_number('0' || field::varchar, '999999999.') as int) might give the
>> behaviour you're after, and a quick test seems to indicate that it's about
>> 4x faster than the original function:
> I looked at to_number but it seems to find digits anywhere inside the
> field - to_number('12.34','99999') returns 1234, but I want it to stop
> at the decimal.

Right - that's why I included the . in the format string. Haven't done
exhaustive testing but it seemed to do what you were after.

cheers,

Tom