Re: Cleaning up array_in() - Mailing list pgsql-hackers
From | jian he |
---|---|
Subject | Re: Cleaning up array_in() |
Date | |
Msg-id | CACJufxEZrqEcUckVuHBQLVgZxoQZSEV7L8cRKhBNn-ciD86WSA@mail.gmail.com Whole thread Raw |
In response to | Re: Cleaning up array_in() (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: Cleaning up array_in()
|
List | pgsql-hackers |
On Sun, Jul 9, 2023 at 3:38 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 08/07/2023 19:08, Tom Lane wrote: > > I wrote: > >> So I end up with the attached. I went ahead and dropped > >> ArrayGetOffset0() as part of 0001, and I split 0002 into two patches > >> where the new 0002 avoids re-indenting any existing code in order > >> to ease review, and then 0003 is just a mechanical application > >> of pgindent. > > > > That got sideswiped by ae6d06f09, so here's a trivial rebase to > > pacify the cfbot. > > > > #text/x-diff; name="v3-0001-Simplify-and-speed-up-ReadArrayStr.patch" [v3-0001-Simplify-and-speed-up-ReadArrayStr.patch]/home/tgl/pgsql/v3-0001-Simplify-and-speed-up-ReadArrayStr.patch > > #text/x-diff; name="v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch" [v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch] /home/tgl/pgsql/v3-0002-Rewrite-ArrayCount-to-make-dimensionality-checks-.patch > > #text/x-diff; name="v3-0003-Re-indent-ArrayCount.patch" [v3-0003-Re-indent-ArrayCount.patch] /home/tgl/pgsql/v3-0003-Re-indent-ArrayCount.patch > > Something's wrong with your attachments. > > Hmm, I wonder if ae6d06f09 had a negative performance impact. In an > unquoted array element, scanner_isspace() function is called for every > character, so it might be worth inlining. > > On the patches: They are a clear improvement, thanks for that. That > said, I still find the logic very hard to follow, and there are some > obvious performance optimizations that could be made. > > ArrayCount() interprets low-level quoting and escaping, and tracks the > dimensions at the same time. The state machine is pretty complicated. > And when you've finally finished reading and grokking that function, you > see that ReadArrayStr() repeats most of the same logic. Ugh. > > I spent some time today refactoring it for readability and speed. I > introduced a separate helper function to tokenize the input. It deals > with whitespace, escapes, and backslashes. Then I merged ArrayCount() > and ReadArrayStr() into one function that parses the elements and > determines the dimensions in one pass. That speeds up parsing large > arrays. With the tokenizer function, the logic in ReadArrayStr() is > still quite readable, even though it's now checking the dimensions at > the same time. > > I also noticed that we used atoi() to parse the integers in the > dimensions, which doesn't do much error checking. Some funny cases were > accepted because of that, for example: > > postgres=# select '[1+-+-+-+-+-+:2]={foo,bar}'::text[]; > text > ----------- > {foo,bar} > (1 row) > > I tightened that up in the passing. > > Attached are your patches, rebased to fix the conflicts with ae6d06f09 > like you intended. On top of that, my patches. My patches need more > testing, benchmarking, and review, so if we want to sneak something into > v16, better go with just your patches. If we're tightening up the > accepted inputs, maybe fix that atoi() sloppiness, though. > > -- > Heikki Linnakangas > Neon (https://neon.tech) your idea is so clear!!! all the Namings are way more descriptive. ArrayToken, personally something with "state", "type" will be more clear. > /* > * FIXME: Is this still required? I believe all the checks it performs are > * redundant with other checks in ReadArrayDimension() and ReadArrayStr() > */ > nitems_according_to_dims = ArrayGetNItemsSafe(ndim, dim, escontext); > if (nitems_according_to_dims < 0) > PG_RETURN_NULL(); > if (nitems != nitems_according_to_dims) > elog(ERROR, "mismatch nitems, %d vs %d", nitems, nitems_according_to_dims); > if (!ArrayCheckBoundsSafe(ndim, dim, lBound, escontext)) > PG_RETURN_NULL(); --first time run select '[0:3][0:2]={{1,2,3}, {4,5,6}, {7,8,9},{1,2,3}}'::int[]; INFO: 253 after ReadArrayDimensions dim: 4 3 71803430 21998 103381120 21998 ndim: 2 INFO: 770 after ReadArrayStr: dim: 4 3 71803430 21998 103381120 21998 nitems:12, ndim:2 --second time run. INFO: 253 after ReadArrayDimensions dim: 4 3 0 0 0 0 ndim: 2 INFO: 770 after ReadArrayStr: dim: 4 3 0 0 0 0 nitems:12, ndim:2 select '{{1,2,3}, {4,5,6}, {7,8,9},{1,2,3}}'::int[]; --every time run, the result is the same. INFO: 253 after ReadArrayDimensions dim: 0 0 0 0 0 0 ndim: 0 INFO: 770 after ReadArrayStr: dim: 4 3 -1 -1 -1 -1 nitems:12, ndim:2 I think the reason is that the dim int array didn't explicitly assign value when initializing it. > /* Now it's safe to compute ub + 1 */ > if (ub + 1 < lBound[ndim]) > ereturn(escontext, false, > (errcode(ERRCODE_ARRAY_SUBSCRIPT_ERROR), > errmsg("upper bound cannot be less than lower bound minus one"))); This part seems safe against cases like select '[-2147483649:-2147483648]={1,2}'::int[]; but I am not sure. If so, then ArrayCheckBoundsSafe is unnecessary. Another corner case successed: select '{1,}'::int[]; should fail.
pgsql-hackers by date: