Re: [PATCH] avoid double scanning in function byteain - Mailing list pgsql-hackers

From Stepan Neretin
Subject Re: [PATCH] avoid double scanning in function byteain
Date
Msg-id CA+Yyo5TK-1iTEKT=VGwawTZ5bX9cjuhDcgp4z0GhFwDSG7SdJA@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] avoid double scanning in function byteain  (Steven Niu <niushiji@gmail.com>)
List pgsql-hackers


On Wed, Mar 26, 2025 at 9:39 PM Steven Niu <niushiji@gmail.com> wrote:

在 2025/3/26 16:37, Kirill Reshke 写道:
> On Wed, 26 Mar 2025 at 12:17, Steven Niu <niushiji@gmail.com> wrote:
>>
>> Hi,
>
> Hi!
>
>> This double scanning can be inefficient, especially for large inputs.
>> So I optimized the function to eliminate the need for two scans,
>> while preserving correctness and efficiency.
>
> While the argument that processing input once not twice is fast is
> generally true, may we have some simple bench here just to have an
> idea how valuable this patch is?
>
>
> Patch:
>
>
>> + /* Handle traditional escaped style in a single pass */
>> + input_len = strlen(inputText);
>> + result = palloc(input_len + VARHDRSZ);  /* Allocate max possible size */
>>   rp = VARDATA(result);
>> + tp = inputText;
>> +
>>   while (*tp != '\0')
>
>
> Isn't this `strlen` O(n) + `while` O(n)? Where is the speed up?
>
>
>
> [0] https://github.com/bminor/glibc/blob/master/string/strlen.c#L43-L45
>



Hi, Kirill,

Your deep insight suprised me!

Yes, you are correct that strlen() actually performed a loop operation.
So maybe the performance difference is not so obvious.

However, there are some other reasons that drive me to make this change.

1. The author of original code left comment: "BUGS: The input is scanned
twice." .
You can find this line of code in my patch. This indicates a left work
to be done.

2. If I were the author of this function, I would not be satisfied with
myself that I used two loops to do something which actually can be done
with one loop.
I prefer to choose a way that would not add more burden to readers.

3. The while (*tp != '\0') loop has some unnecessary codes and I made
some change.

Thanks,
Steven





Hi hackers,

This is a revised version (v2) of the patch that optimizes the `byteain()` function.

The original implementation handled escaped input by scanning the string twice — first to determine the output size and again to fill in the bytea. This patch eliminates the double scan by using a single-pass approach with `StringInfo`, simplifying the logic and improving maintainability.

Changes since v1 (originally by Steven Niu):
- Use `StringInfo` instead of manual memory allocation.
- Remove redundant code and improve readability.
- Add regression tests for both hex and escaped formats.

This version addresses performance and clarity while ensuring compatibility with existing behavior. The patch also reflects discussion on the original version, including feedback from Kirill Reshke.

Looking forward to your review and comments.

Best regards,  
Stepan Neretin
Attachment

pgsql-hackers by date:

Previous
From: Sutou Kouhei
Date:
Subject: Re: Make COPY format extendable: Extract COPY TO format implementations
Next
From: Yasir
Date:
Subject: Re: Why our Valgrind reports suck