Re: Cleaning up array_in() - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Cleaning up array_in()
Date
Msg-id 5859ce4e-2be4-92b0-c85c-e1e24eab57c6@iki.fi
Whole thread Raw
In response to Re: Cleaning up array_in()  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Cleaning up array_in()
Re: Cleaning up array_in()
List pgsql-hackers
On 08/07/2023 19:08, Tom Lane wrote:
> I wrote:
>> So I end up with the attached.  I went ahead and dropped
>> ArrayGetOffset0() as part of 0001, and I split 0002 into two patches
>> where the new 0002 avoids re-indenting any existing code in order
>> to ease review, and then 0003 is just a mechanical application
>> of pgindent.
> 
> That got sideswiped by ae6d06f09, so here's a trivial rebase to
> pacify the cfbot.
> 
> #text/x-diff; name="v3-0001-Simplify-and-speed-up-ReadArrayStr.patch"
[v3-0001-Simplify-and-speed-up-ReadArrayStr.patch]/home/tgl/pgsql/v3-0001-Simplify-and-speed-up-ReadArrayStr.patch
 
> #text/x-diff; name="v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch"
[v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch]
/home/tgl/pgsql/v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch
> #text/x-diff; name="v3-0003-Re-indent-ArrayCount.patch" [v3-0003-Re-indent-ArrayCount.patch]
/home/tgl/pgsql/v3-0003-Re-indent-ArrayCount.patch

Something's wrong with your attachments.

Hmm, I wonder if ae6d06f09 had a negative performance impact. In an 
unquoted array element, scanner_isspace() function is called for every 
character, so it might be worth inlining.

On the patches: They are a clear improvement, thanks for that. That 
said, I still find the logic very hard to follow, and there are some 
obvious performance optimizations that could be made.

ArrayCount() interprets low-level quoting and escaping, and tracks the 
dimensions at the same time. The state machine is pretty complicated. 
And when you've finally finished reading and grokking that function, you 
see that ReadArrayStr() repeats most of the same logic. Ugh.

I spent some time today refactoring it for readability and speed. I 
introduced a separate helper function to tokenize the input. It deals 
with whitespace, escapes, and backslashes. Then I merged ArrayCount() 
and ReadArrayStr() into one function that parses the elements and 
determines the dimensions in one pass. That speeds up parsing large 
arrays. With the tokenizer function, the logic in ReadArrayStr() is 
still quite readable, even though it's now checking the dimensions at 
the same time.

I also noticed that we used atoi() to parse the integers in the 
dimensions, which doesn't do much error checking. Some funny cases were 
accepted because of that, for example:

postgres=# select '[1+-+-+-+-+-+:2]={foo,bar}'::text[];
    text
-----------
  {foo,bar}
(1 row)

I tightened that up in the passing.

Attached are your patches, rebased to fix the conflicts with ae6d06f09 
like you intended. On top of that, my patches. My patches need more 
testing, benchmarking, and review, so if we want to sneak something into 
v16, better go with just your patches. If we're tightening up the 
accepted inputs, maybe fix that atoi() sloppiness, though.

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Cleaning up array_in()
Next
From: Gurjeet Singh
Date:
Subject: Re: DecodeInterval fixes