Thread: [PATCH] avoid double scanning in function byteain

[PATCH] avoid double scanning in function byteain

From
Steven Niu
Date:
Hi,

The byteain function converts a string input into a bytea type.
The original implementation processes two input formats:
a hex format (starting with \x) and a traditional escaped format.
For the escaped format, the function scans the input string twice
— once to calculate the exact size of the output and allocate memory,
and again to fill the allocated memory with the parsed data.

This double scanning can be inefficient, especially for large inputs.
So I optimized the function to eliminate the need for two scans,
while preserving correctness and efficiency.

Please help review it and share your valuable comments.

Thanks,
Steven Niu
https://www.highgo.com/
Attachment

Re: [PATCH] avoid double scanning in function byteain

From
Kirill Reshke
Date:
On Wed, 26 Mar 2025 at 12:17, Steven Niu <niushiji@gmail.com> wrote:
>
> Hi,

Hi!

> This double scanning can be inefficient, especially for large inputs.
> So I optimized the function to eliminate the need for two scans,
> while preserving correctness and efficiency.

While the argument that processing input once not twice is fast is
generally true, may we have some simple bench here just to have an
idea how valuable this patch is?


Patch:


>+ /* Handle traditional escaped style in a single pass */
>+ input_len = strlen(inputText);
>+ result = palloc(input_len + VARHDRSZ);  /* Allocate max possible size */
>  rp = VARDATA(result);
>+ tp = inputText;
>+
>  while (*tp != '\0')


Isn't this `strlen` O(n) + `while` O(n)? Where is the speed up?



[0] https://github.com/bminor/glibc/blob/master/string/strlen.c#L43-L45

-- 
Best regards,
Kirill Reshke



Re: [PATCH] avoid double scanning in function byteain

From
Steven Niu
Date:
在 2025/3/26 16:37, Kirill Reshke 写道:
> On Wed, 26 Mar 2025 at 12:17, Steven Niu <niushiji@gmail.com> wrote:
>>
>> Hi,
> 
> Hi!
> 
>> This double scanning can be inefficient, especially for large inputs.
>> So I optimized the function to eliminate the need for two scans,
>> while preserving correctness and efficiency.
> 
> While the argument that processing input once not twice is fast is
> generally true, may we have some simple bench here just to have an
> idea how valuable this patch is?
> 
> 
> Patch:
> 
> 
>> + /* Handle traditional escaped style in a single pass */
>> + input_len = strlen(inputText);
>> + result = palloc(input_len + VARHDRSZ);  /* Allocate max possible size */
>>   rp = VARDATA(result);
>> + tp = inputText;
>> +
>>   while (*tp != '\0')
> 
> 
> Isn't this `strlen` O(n) + `while` O(n)? Where is the speed up?
> 
> 
> 
> [0] https://github.com/bminor/glibc/blob/master/string/strlen.c#L43-L45
> 



Hi, Kirill,

Your deep insight suprised me!

Yes, you are correct that strlen() actually performed a loop operation.
So maybe the performance difference is not so obvious.

However, there are some other reasons that drive me to make this change.

1. The author of original code left comment: "BUGS: The input is scanned 
twice." .
You can find this line of code in my patch. This indicates a left work 
to be done.

2. If I were the author of this function, I would not be satisfied with 
myself that I used two loops to do something which actually can be done 
with one loop.
I prefer to choose a way that would not add more burden to readers.

3. The while (*tp != '\0') loop has some unnecessary codes and I made 
some change.

Thanks,
Steven