Andres Freund <andres@anarazel.de> writes:
> On 2024-04-09 20:12:48 -0400, Tom Lane wrote:
>> In any case, this is all moot unless we can come to a new design for
>> how ecpg does its string-mashing. Thoughts?
> Am I missing something, or is ecpg string handling almost comically
> inefficient? Building up strings in tiny increments, which then get mashed
> together to get slightly larger pieces, just to then be mashed together again?
> It's like an intentional allocator stress test.
> It's particularly absurd because in the end we just print those strings, after
> carefully assembling them...
It is that. Here's what I'm thinking: probably 90% of what ecpg
does is to verify that a chunk of its input represents a valid bit
of SQL (or C) syntax and then emit a representation of that chunk.
Currently, that representation tends to case-normalize tokens and
smash inter-token whitespace and comments to a single space.
I propose though that neither of those behaviors is mission-critical,
or even all that desirable. I think few users would complain if
ecpg preserved the input's casing and spacing and comments.
Given that definition, most of ecpg's productions (certainly just
about all the auto-generated ones) would simply need to return a
pointer and length describing a part of the input string. There are
places where ecpg wants to insert some text it generates, and I think
it might need to re-order text in a few places, so we need a
production result representation that can cope with those cases ---
but if we can make "regurgitate the input" cases efficient, I think
we'll have licked the performance problem.
With that in mind, I wonder whether we couldn't make the simple
cases depend on bison's existing support for location tracking.
In which case, the actual actions for all those cases could be
default, achieving one of the goals you mention.
Obviously, this is not going to be a small lift, but it kind
of seems do-able.
regards, tom lane