Re: Cleaning up array_in() - Mailing list pgsql-hackers

From jian he
Subject Re: Cleaning up array_in()
Date
Msg-id CACJufxEZrqEcUckVuHBQLVgZxoQZSEV7L8cRKhBNn-ciD86WSA@mail.gmail.com
Whole thread Raw
In response to Re: Cleaning up array_in()  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: Cleaning up array_in()
List pgsql-hackers
On Sun, Jul 9, 2023 at 3:38 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 08/07/2023 19:08, Tom Lane wrote:
> > I wrote:
> >> So I end up with the attached.  I went ahead and dropped
> >> ArrayGetOffset0() as part of 0001, and I split 0002 into two patches
> >> where the new 0002 avoids re-indenting any existing code in order
> >> to ease review, and then 0003 is just a mechanical application
> >> of pgindent.
> >
> > That got sideswiped by ae6d06f09, so here's a trivial rebase to
> > pacify the cfbot.
> >
> > #text/x-diff; name="v3-0001-Simplify-and-speed-up-ReadArrayStr.patch"
[v3-0001-Simplify-and-speed-up-ReadArrayStr.patch]/home/tgl/pgsql/v3-0001-Simplify-and-speed-up-ReadArrayStr.patch 
> > #text/x-diff; name="v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch"
[v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch]
/home/tgl/pgsql/v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch
> > #text/x-diff; name="v3-0003-Re-indent-ArrayCount.patch" [v3-0003-Re-indent-ArrayCount.patch]
/home/tgl/pgsql/v3-0003-Re-indent-ArrayCount.patch
>
> Something's wrong with your attachments.
>
> Hmm, I wonder if ae6d06f09 had a negative performance impact. In an
> unquoted array element, scanner_isspace() function is called for every
> character, so it might be worth inlining.
>
> On the patches: They are a clear improvement, thanks for that. That
> said, I still find the logic very hard to follow, and there are some
> obvious performance optimizations that could be made.
>
> ArrayCount() interprets low-level quoting and escaping, and tracks the
> dimensions at the same time. The state machine is pretty complicated.
> And when you've finally finished reading and grokking that function, you
> see that ReadArrayStr() repeats most of the same logic. Ugh.
>
> I spent some time today refactoring it for readability and speed. I
> introduced a separate helper function to tokenize the input. It deals
> with whitespace, escapes, and backslashes. Then I merged ArrayCount()
> and ReadArrayStr() into one function that parses the elements and
> determines the dimensions in one pass. That speeds up parsing large
> arrays. With the tokenizer function, the logic in ReadArrayStr() is
> still quite readable, even though it's now checking the dimensions at
> the same time.
>
> I also noticed that we used atoi() to parse the integers in the
> dimensions, which doesn't do much error checking. Some funny cases were
> accepted because of that, for example:
>
> postgres=# select '[1+-+-+-+-+-+:2]={foo,bar}'::text[];
>     text
> -----------
>   {foo,bar}
> (1 row)
>
> I tightened that up in the passing.
>
> Attached are your patches, rebased to fix the conflicts with ae6d06f09
> like you intended. On top of that, my patches. My patches need more
> testing, benchmarking, and review, so if we want to sneak something into
> v16, better go with just your patches. If we're tightening up the
> accepted inputs, maybe fix that atoi() sloppiness, though.
>
> --
> Heikki Linnakangas
> Neon (https://neon.tech)

your idea is so clear!!!
all the Namings are way more descriptive. ArrayToken, personally
something with "state", "type" will be more clear.

> /*
> * FIXME: Is this still required? I believe all the checks it performs are
> * redundant with other checks in ReadArrayDimension() and ReadArrayStr()
> */
> nitems_according_to_dims = ArrayGetNItemsSafe(ndim, dim, escontext);
> if (nitems_according_to_dims < 0)
> PG_RETURN_NULL();
> if (nitems != nitems_according_to_dims)
> elog(ERROR, "mismatch nitems, %d vs %d", nitems, nitems_according_to_dims);
> if (!ArrayCheckBoundsSafe(ndim, dim, lBound, escontext))
> PG_RETURN_NULL();

--first time run
select '[0:3][0:2]={{1,2,3}, {4,5,6}, {7,8,9},{1,2,3}}'::int[];
INFO:  253 after ReadArrayDimensions dim:  4 3 71803430 21998
103381120 21998 ndim: 2
INFO:  770 after ReadArrayStr: dim:  4 3 71803430 21998 103381120
21998 nitems:12, ndim:2

--second time run.
INFO:  253 after ReadArrayDimensions dim:  4 3 0 0 0 0 ndim: 2
INFO:  770 after ReadArrayStr: dim:  4 3 0 0 0 0 nitems:12, ndim:2

select '{{1,2,3}, {4,5,6}, {7,8,9},{1,2,3}}'::int[]; --every time run,
the result is the same.
INFO:  253 after ReadArrayDimensions dim:  0 0 0 0 0 0 ndim: 0
INFO:  770 after ReadArrayStr: dim:  4 3 -1 -1 -1 -1 nitems:12, ndim:2

I think the reason is that the dim int array didn't explicitly assign
value when initializing it.

> /* Now it's safe to compute ub + 1 */
> if (ub + 1 < lBound[ndim])
> ereturn(escontext, false,
> (errcode(ERRCODE_ARRAY_SUBSCRIPT_ERROR),
> errmsg("upper bound cannot be less than lower bound minus one")));

This part seems safe against cases like select
'[-2147483649:-2147483648]={1,2}'::int[];
but I am not sure. If so, then ArrayCheckBoundsSafe is unnecessary.

Another corner case successed: select '{1,}'::int[]; should fail.



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Make pgbench exit on SIGINT more reliably
Next
From: Amit Kapila
Date:
Subject: Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication