Thread: Patch for SortSupport implementation on inet/cdir
Hi list,
I've attached a patch that implements SortSupport for the
inet/cidr types. It has the effect of typically reducing
the time taken to sort these types by ~50-60% (as measured
by `SELECT COUNT(DISTINCT ...)` which will carry over to
common operations like index creation, `ORDER BY`, and
`DISTINCT`.
As with other SortSupport implementations, this one
proposes an abbreviated key design that packs in as much
sorting-relevant information into the available datum as
possible while remaining faithful to the existing sorting
rules for the types. The key format depends on datum size
and whether working with IPv4 or IPv6. In most cases that
involves storing as many netmask bits as we can fit, but we
can do something a little more complete with IPv4 on 64-bit
— because inet data is small and the datum is large, we
can store enough information for a successful key-only
comparison in the majority of cases. Precise details
including background and bit-level diagrams are included in
the comments of the attached patch.
I've tried to take a variety of steps to build confidence
that the proposed SortSupport-based sort is correct:
1. It passes the existing inet regression suite (which was
pretty complete already).
2. I added a few new test cases the suite, specifically
trying to target edge cases like the minimums and
maximums of each abbreviated key bit "boundary". The new
cases were run against the master branch to double-check
that they're right.
3. I've added a variety of assertions throughout the
implementation to make sure that each step is seeing
inputs with expected parameters, and only manipulates
the bits that it's supposed to be manipulating.
4. I built large random data sets (10M rows) and sorted
them against a development build to try and trigger the
aforementioned assertions.
5. I built an index on 10M values and ran amcheck against
it.
6. I wrote some unit tests to verify that the bit-level
representation of the new abbreviated keys are indeed
what we expect. They're available here [3]. I didn't
include them in the patch because I couldn't find an
easy way to produce an expected `.out` file for a 32-bit
machine (experiments building with `setarch` didn't
succeed). They might be overkill anyway. I can continue
to pursue getting them working if reviewers think they'd
be useful.
My benchmarking methodology and script is available here
[1], and involves gathering statistics for 100
`count(distinct ...)` queries at various data sizes. I've
saved the results I got on my machine here [2].
Hat tip to Peter Geoghegan for advising on an appropriate
abbreviated key design and helping answer many questions
along the way.
Brandur
Attachment
Attached a V2 patch: identical to V1 except rebased and
with a new OID selected.
Attachment
On Sat, 9 Feb 2019 at 04:48, Brandur Leach <brandur@mutelight.org> wrote:
> I've attached a patch that implements SortSupport for the
> inet/cidr types. It has the effect of typically reducing
> the time taken to sort these types by ~50-60% (as measured
> by `SELECT COUNT(DISTINCT ...)` which will carry over to
> common operations like index creation, `ORDER BY`, and
> `DISTINCT`.
Hi Brandur,
I had a look at this. Your V2 patch applies cleanly, and the code was straightforward and well commented. I appreciate the big comment at the top of network_abbrev_convert explaining how you encode the data.
The tests pass. I ran a couple of large scale tests myself and didn't find any problems. Sorting a million random inets in work_mem = 256MB goes from roughty 3670ms to 1620ms with the SortSupport, which is pretty impressive. (But that's in my debug build, so not a serious benchmark.)
An interesting thing about sorting IPv4 inets on 64-bit machines is that when the inets are the same, the abbreviated comparitor will return 0 which is taken by the sorting machinery to mean "the datums are the same up to this point, so you need to call the full comparitor" -- but, in this case, 0 means "the datums truly are the same, no need to call the full comparitor". Since the full comparitor assumes its arguments to be the original (typically pass-by-reference) datums, you can't do it there. You'd need to add another optional comparitor to be called after the abbreviated one. In inet's case on a 64-bit machine, it would look at the abbreviated datums and if they're both in the IPv4 family, would return 0 (knowing that the abbreviated comparitor has already done the real work). I have no reason to believe this particular optimisation is worth anything much, though; it's outside the scope of this patch, besides.
I have some comments on the comments:
network.c:552
* SortSupport conversion routine. Converts original inet/cidr representations
* to abbreviated keys . The inet/cidr types are pass-by-reference, so is an
* optimization so that sorting routines don't have to pull full values from
* the heap to compare.
Looks like you have an extra space before the "." on line 553. And abbreviated keys being an optimisation for pass-by-reference types can be taken for granted, so I think the last sentence is redundant.
network.c::567
* IPv4 and IPv6 are identical in this makeup, with the difference being that
* IPv4 addresses have a maximum of 32 bits compared to IPv6's 64 bits, so in
* IPv6 each part may be larger.
IPv6's addresses are 128 bit. I'm not sure sure if "maximum" is accurate, or whether you should just say "IPv4 addresses have 32 bits".
network.c::571
* inet/cdir types compare using these sorting rules. If inequality is detected
* at any step, comparison is done. If any rule is a tie, the algorithm drops
* through to the next to break it:
When you say "comparison is done" it sounds like more comparing is going to be done, but what I think you mean is that comparison is finished.
> [...]
> My benchmarking methodology and script is available here
> [1], and involves gathering statistics for 100
> `count(distinct ...)` queries at various data sizes. I've
> saved the results I got on my machine here [2].
I didn't see any links for [1], [2] and [3] in your email.
Finally, there's a duplicate CF entry: https://commitfest.postgresql.org/22/1990/ .
Since you're updating https://commitfest.postgresql.org/22/1991/ , I suggest you mark 1990 as Withdrawn to avoid confusion. If there's a way to remove it from the CF list, that would be even better.
Edmund
Hi Brandur, On 2019-02-09 20:12:53 +1300, Edmund Horner wrote: > I had a look at this. Your V2 patch applies cleanly, and the code was > straightforward and well commented. I appreciate the big comment at the > top of network_abbrev_convert explaining how you encode the data. I've marked the CF entry as waiting-on-author, due to this review. Brandur, unfortunately this patch has only been submitted to the last commitfest for v12. Our policy is that all nontrivial patches should be submitted to at least one earlier commitfest. Therefore I unfortunately think that v12 is not a realistic target, sorry :( Greetings, Andres Freund
On 2/16/19 4:13 AM, Andres Freund wrote: > On 2019-02-09 20:12:53 +1300, Edmund Horner wrote: >> I had a look at this. Your V2 patch applies cleanly, and the code was >> straightforward and well commented. I appreciate the big comment at the >> top of network_abbrev_convert explaining how you encode the data. > > I've marked the CF entry as waiting-on-author, due to this review. > > Brandur, unfortunately this patch has only been submitted to the last > commitfest for v12. Our policy is that all nontrivial patches should be > submitted to at least one earlier commitfest. Therefore I unfortunately > think that v12 is not a realistic target, sorry :( I agree, and since there has been no response from the author I have pushed the target version to PG13. Regards, -- -David david@pgmasters.net
On Fri, Feb 8, 2019 at 11:13 PM Edmund Horner <ejrh00@gmail.com> wrote: > I have some comments on the comments: Seems reasonable to me. Where are we on this? I'd like to get the patch committed soon. -- Peter Geoghegan
On Fri, Feb 8, 2019 at 10:20 AM Brandur Leach <brandur@mutelight.org> wrote: > Attached a V2 patch: identical to V1 except rebased and > with a new OID selected. Attached is a revised version that I came up with, based on your v2. I found this part of your approach confusing: > + /* > + * Number of bits in subnet. e.g. An IPv4 that's /24 is 32 - 24 = 8. > + * > + * However, only some of the bits may have made it into the fixed sized > + * datum, so take the smallest number between bits in the subnet and bits > + * in the datum which are not part of the network. > + */ > + datum_subnet_size = Min(ip_maxbits(authoritative) - ip_bits(authoritative), > + SIZEOF_DATUM * BITS_PER_BYTE - ip_bits(authoritative)); The way that you put a Min() on the subnet size potentially constrains the size of the bitmask used for the network component of the abbreviated key (the component that comes immediately after the ipfamily status bit). Why not just let the bitmask be a bitmask, without bringing SIZEOF_DATUM into it? Doing it that way allowed for a more streamlined approach, with significantly fewer special cases. I'm not sure whether or not your approach had bugs, but I didn't like the way you sometimes did a straight "network = ipaddr_datum" assignment without masking. I really liked your diagrams, but much of the text that went with them either seemed redundant (it described established rules about how the underlying types sort), or seemed to discuss things that were better discussed next to the relevant network_abbrev_convert() code. Thoughts? -- Peter Geoghegan
Attachment
On Fri, Jul 26, 2019 at 6:58 PM Peter Geoghegan <pg@bowt.ie> wrote: > I found this part of your approach confusing: > > > + /* > > + * Number of bits in subnet. e.g. An IPv4 that's /24 is 32 - 24 = 8. > > + * > > + * However, only some of the bits may have made it into the fixed sized > > + * datum, so take the smallest number between bits in the subnet and bits > > + * in the datum which are not part of the network. > > + */ > > + datum_subnet_size = Min(ip_maxbits(authoritative) - ip_bits(authoritative), > > + SIZEOF_DATUM * BITS_PER_BYTE - ip_bits(authoritative)); > > The way that you put a Min() on the subnet size potentially constrains > the size of the bitmask used for the network component of the > abbreviated key (the component that comes immediately after the > ipfamily status bit). Why not just let the bitmask be a bitmask, > without bringing SIZEOF_DATUM into it? Doing it that way allowed for a > more streamlined approach, with significantly fewer special cases. I'm > not sure whether or not your approach had bugs, but I didn't like the > way you sometimes did a straight "network = ipaddr_datum" assignment > without masking. I guess that the idea here was to prevent masking on ipv6 addresses, though not on ipv4 addresses. Obviously we're only dealing with a prefix with ipv6 addresses, whereas we usually have the whole raw ipaddr with ipv4. Not sure if I'm doing the right thing there in v3, even though the tests pass. In any case, this will need to be a lot clearer in the final version. -- Peter Geoghegan
Thanks the follow ups on this one Edmund/Peter!
I've attached a new V4 variant of the patch based on
Peter's V3, mostly containing comment amendments and a few
other minor stylistic fixes.
> An interesting thing about sorting IPv4 inets on 64-bit machines is that when the inets are the same, the abbreviated comparitor will return 0 which is taken by the sorting machinery to mean "the datums are the same up to this point, so you need to call the full comparitor" — but, in this case, 0 means "the datums truly are the same, no need to call the full comparitor". Since the full comparitor assumes its arguments to be the original (typically pass-by-reference) datums, you can't do it there. You'd need to add another optional comparitor to be called after the abbreviated one. In inet's case on a 64-bit machine, it would look at the abbreviated datums and if they're both in the IPv4 family, would return 0 (knowing that the abbreviated comparitor has already done the real work). I have no reason to believe this particular optimisation is worth anything much, though; it's outside the scope of this patch, besides.
Edmund: thanks a lot for the really quick review turn
around, and apologies for not following up sooner!
Agreed that this change is out-of-scope, but it could be an
interesting improvement. You'd have similar potential speed
improvements for other SortSupport data types like uuid,
strings (short ones), and macaddr. Low cardinality data
sets would probably benefit the most.
> When you say "comparison is done" it sounds like more comparing is going to be done, but what I think you mean is that comparison is finished.
Peter's version of my patch ended up stripping out and/or
changing some of my original comments, so most of them are
fixed by virtue of that. And agreed about the ambiguity in
wording above — I changed "done" to "finished".
> I didn't see any links for [1], [2] and [3] in your email.
Good catch. Here are the original footnote links:
[1] https://github.com/brandur/inet-sortsupport-test
[2] https://github.com/brandur/inet-sortsupport-test/blob/master/results.md
[3] https://github.com/brandur/postgres/compare/master...brandur-inet-sortsupport-unit#diff-a28824d1339d3bb74bb0297c60140dd1
> The way that you put a Min() on the subnet size potentially constrains
> the size of the bitmask used for the network component of the
> abbreviated key (the component that comes immediately after the
> ipfamily status bit). Why not just let the bitmask be a bitmask,
> without bringing SIZEOF_DATUM into it? Doing it that way allowed for a
> more streamlined approach, with significantly fewer special cases.
Peter: thanks a lot for the very thorough look and revised
version! Generally agreed that fewer special cases is good,
but I was also trying to make sure that we didn't
compromise the code's understandability by optimizing for
fewer special cases above everything else (especially for
this sort of thing where tricky bit manipulation is
involved).
But I traced through your variant and it looks fine to me
(still looks correct, and readability is still good). I've
pulled most of your changes into V4.
> I'm not sure whether or not your approach had bugs, but I
> didn't like the way you sometimes did a straight "network
> = ipaddr_datum" assignment without masking.
For what it's worth, I believe this did work, even if it
did depend on being within that one branch of code. Agreed
though that avoiding it (as in the new version) is more
hygienic.
> I really liked your diagrams, but much of the text that went with them
> either seemed redundant (it described established rules about how the
> underlying types sort), or seemed to discuss things that were better
> discussed next to the relevant network_abbrev_convert() code.
Thanks! I kept most of your changes, but resurrected some
of my original introductory text. The fine points of the
code's implementation are intricate enough that I think
having some background included is useful to new entrants;
specifically:
1. Norms for naming the different "parts" (network, size,
subnet) of an inet/cidr value aren't broadly
well-established, so defining what they are before
showing diagrams is helpful.
2. Having examples of each part (`1.2.3.0`, `/24`,
`0.0.0.4`) helps mentally cement them.
3. I know that it's in the PG documentation, but the rules
for sorting inet/cidr are not very intuitive. Spending a
few lines re-iterating them so that they don't have to
be cross-referenced elsewhere is worth the space.
Anyway, once again appreciate the extreme attention to
detail on these reviews — this level of rigor would be a
very rare find in projects outside of the Postgres
community!
Brandur
I've attached a new V4 variant of the patch based on
Peter's V3, mostly containing comment amendments and a few
other minor stylistic fixes.
> An interesting thing about sorting IPv4 inets on 64-bit machines is that when the inets are the same, the abbreviated comparitor will return 0 which is taken by the sorting machinery to mean "the datums are the same up to this point, so you need to call the full comparitor" — but, in this case, 0 means "the datums truly are the same, no need to call the full comparitor". Since the full comparitor assumes its arguments to be the original (typically pass-by-reference) datums, you can't do it there. You'd need to add another optional comparitor to be called after the abbreviated one. In inet's case on a 64-bit machine, it would look at the abbreviated datums and if they're both in the IPv4 family, would return 0 (knowing that the abbreviated comparitor has already done the real work). I have no reason to believe this particular optimisation is worth anything much, though; it's outside the scope of this patch, besides.
Edmund: thanks a lot for the really quick review turn
around, and apologies for not following up sooner!
Agreed that this change is out-of-scope, but it could be an
interesting improvement. You'd have similar potential speed
improvements for other SortSupport data types like uuid,
strings (short ones), and macaddr. Low cardinality data
sets would probably benefit the most.
> When you say "comparison is done" it sounds like more comparing is going to be done, but what I think you mean is that comparison is finished.
Peter's version of my patch ended up stripping out and/or
changing some of my original comments, so most of them are
fixed by virtue of that. And agreed about the ambiguity in
wording above — I changed "done" to "finished".
> I didn't see any links for [1], [2] and [3] in your email.
Good catch. Here are the original footnote links:
[1] https://github.com/brandur/inet-sortsupport-test
[2] https://github.com/brandur/inet-sortsupport-test/blob/master/results.md
[3] https://github.com/brandur/postgres/compare/master...brandur-inet-sortsupport-unit#diff-a28824d1339d3bb74bb0297c60140dd1
> The way that you put a Min() on the subnet size potentially constrains
> the size of the bitmask used for the network component of the
> abbreviated key (the component that comes immediately after the
> ipfamily status bit). Why not just let the bitmask be a bitmask,
> without bringing SIZEOF_DATUM into it? Doing it that way allowed for a
> more streamlined approach, with significantly fewer special cases.
Peter: thanks a lot for the very thorough look and revised
version! Generally agreed that fewer special cases is good,
but I was also trying to make sure that we didn't
compromise the code's understandability by optimizing for
fewer special cases above everything else (especially for
this sort of thing where tricky bit manipulation is
involved).
But I traced through your variant and it looks fine to me
(still looks correct, and readability is still good). I've
pulled most of your changes into V4.
> I'm not sure whether or not your approach had bugs, but I
> didn't like the way you sometimes did a straight "network
> = ipaddr_datum" assignment without masking.
For what it's worth, I believe this did work, even if it
did depend on being within that one branch of code. Agreed
though that avoiding it (as in the new version) is more
hygienic.
> I really liked your diagrams, but much of the text that went with them
> either seemed redundant (it described established rules about how the
> underlying types sort), or seemed to discuss things that were better
> discussed next to the relevant network_abbrev_convert() code.
Thanks! I kept most of your changes, but resurrected some
of my original introductory text. The fine points of the
code's implementation are intricate enough that I think
having some background included is useful to new entrants;
specifically:
1. Norms for naming the different "parts" (network, size,
subnet) of an inet/cidr value aren't broadly
well-established, so defining what they are before
showing diagrams is helpful.
2. Having examples of each part (`1.2.3.0`, `/24`,
`0.0.0.4`) helps mentally cement them.
3. I know that it's in the PG documentation, but the rules
for sorting inet/cidr are not very intuitive. Spending a
few lines re-iterating them so that they don't have to
be cross-referenced elsewhere is worth the space.
Anyway, once again appreciate the extreme attention to
detail on these reviews — this level of rigor would be a
very rare find in projects outside of the Postgres
community!
Brandur
Attachment
And a slightly amended version of the last patch with a bug
fixed where IPv4 abbreviated keys were were not being
initialized correctly on big-endian machines.
fixed where IPv4 abbreviated keys were were not being
initialized correctly on big-endian machines.
Attachment
On Fri, Jul 26, 2019 at 7:25 PM Peter Geoghegan <pg@bowt.ie> wrote: > I guess that the idea here was to prevent masking on ipv6 addresses, > though not on ipv4 addresses. Obviously we're only dealing with a > prefix with ipv6 addresses, whereas we usually have the whole raw > ipaddr with ipv4. Not sure if I'm doing the right thing there in v3, > even though the tests pass. In any case, this will need to be a lot > clearer in the final version. This turned out to be borked for certain IPv6 cases, as suspected. Attached is a revised v6, which fixes the issue by adding the explicit handling needed when ipaddr_datum is just a prefix of the full ipaddr from the authoritative representation. Also made sure that the tests will catch issues like this. Separately, it occurred to me that it's probably not okay to do straight type punning of the ipaddr unsigned char array to a Datum on alignment-picky platforms. Using a memcpy() seems like the right approach, which is what we do in the latest revision. I accepted almost all of Brandur's comment revisions from v5 for v6. I'm probably going to commit this tomorrow morning Pacific time. Do you have any final input on the testing, Brandur? I would like to hear your thoughts on the possibility of edge cases that still don't have coverage. The tests will break if the new "if (ip_bits(authoritative) == 0)" branch is removed, but only at one exact point. I'm pretty sure that there are no remaining subtleties like that one, but two heads are better than one. -- Peter Geoghegan
Attachment
Thanks Peter. V6 is pretty uncontroversial by me — the new
conditional ladder broken cleanly into cases of (1) all
subnet, (2) network/subnet mix, and (3) all network is a
little more verbose, but all in all makes things easier to
reason about.
> Do you have any final input on the testing, Brandur? I
> would like to hear your thoughts on the possibility of
> edge cases that still don't have coverage. The tests will
> break if the new "if (ip_bits(authoritative) == 0)"
> branch is removed, but only at one exact point. I'm
> pretty sure that there are no remaining subtleties like
> that one, but two heads are better than one.
(Discussed a little offline already, but) the new, more
exhaustive test suite combined with your approach of
synthesizing many random values and comparing before and
after sorting seems sufficient to me. The approach was
meant to suss out any remaining edge cases, and seems to
have done its job.
Brandur
conditional ladder broken cleanly into cases of (1) all
subnet, (2) network/subnet mix, and (3) all network is a
little more verbose, but all in all makes things easier to
reason about.
> Do you have any final input on the testing, Brandur? I
> would like to hear your thoughts on the possibility of
> edge cases that still don't have coverage. The tests will
> break if the new "if (ip_bits(authoritative) == 0)"
> branch is removed, but only at one exact point. I'm
> pretty sure that there are no remaining subtleties like
> that one, but two heads are better than one.
(Discussed a little offline already, but) the new, more
exhaustive test suite combined with your approach of
synthesizing many random values and comparing before and
after sorting seems sufficient to me. The approach was
meant to suss out any remaining edge cases, and seems to
have done its job.
Brandur
On Wed, Jul 31, 2019 at 10:49 AM Peter Geoghegan <pg@bowt.ie> wrote:
On Fri, Jul 26, 2019 at 7:25 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I guess that the idea here was to prevent masking on ipv6 addresses,
> though not on ipv4 addresses. Obviously we're only dealing with a
> prefix with ipv6 addresses, whereas we usually have the whole raw
> ipaddr with ipv4. Not sure if I'm doing the right thing there in v3,
> even though the tests pass. In any case, this will need to be a lot
> clearer in the final version.
This turned out to be borked for certain IPv6 cases, as suspected.
Attached is a revised v6, which fixes the issue by adding the explicit
handling needed when ipaddr_datum is just a prefix of the full ipaddr
from the authoritative representation. Also made sure that the tests
will catch issues like this. Separately, it occurred to me that it's
probably not okay to do straight type punning of the ipaddr unsigned
char array to a Datum on alignment-picky platforms. Using a memcpy()
seems like the right approach, which is what we do in the latest
revision.
I accepted almost all of Brandur's comment revisions from v5 for v6.
I'm probably going to commit this tomorrow morning Pacific time. Do
you have any final input on the testing, Brandur? I would like to hear
your thoughts on the possibility of edge cases that still don't have
coverage. The tests will break if the new "if (ip_bits(authoritative)
== 0)" branch is removed, but only at one exact point. I'm pretty sure
that there are no remaining subtleties like that one, but two heads
are better than one.
--
Peter Geoghegan
On Thu, Aug 1, 2019 at 8:34 AM Brandur Leach <brandur@mutelight.org> wrote: > Thanks Peter. V6 is pretty uncontroversial by me — the new > conditional ladder broken cleanly into cases of (1) all > subnet, (2) network/subnet mix, and (3) all network is a > little more verbose, but all in all makes things easier to > reason about. Pushed. Thanks! -- Peter Geoghegan