Re: Patch for SortSupport implementation on inet/cdir - Mailing list pgsql-hackers
From | Edmund Horner |
---|---|
Subject | Re: Patch for SortSupport implementation on inet/cdir |
Date | |
Msg-id | CAMyN-kAOw-LY1ZKJ9-R4uYdX3d2=_2SiYntpKCHhrCe1Md1NLw@mail.gmail.com Whole thread Raw |
In response to | Patch for SortSupport implementation on inet/cdir (Brandur Leach <brandur@mutelight.org>) |
Responses |
Re: Patch for SortSupport implementation on inet/cdir
Re: Patch for SortSupport implementation on inet/cdir |
List | pgsql-hackers |
On Sat, 9 Feb 2019 at 04:48, Brandur Leach <brandur@mutelight.org> wrote:
> I've attached a patch that implements SortSupport for the
> inet/cidr types. It has the effect of typically reducing
> the time taken to sort these types by ~50-60% (as measured
> by `SELECT COUNT(DISTINCT ...)` which will carry over to
> common operations like index creation, `ORDER BY`, and
> `DISTINCT`.
Hi Brandur,
I had a look at this. Your V2 patch applies cleanly, and the code was straightforward and well commented. I appreciate the big comment at the top of network_abbrev_convert explaining how you encode the data.
The tests pass. I ran a couple of large scale tests myself and didn't find any problems. Sorting a million random inets in work_mem = 256MB goes from roughty 3670ms to 1620ms with the SortSupport, which is pretty impressive. (But that's in my debug build, so not a serious benchmark.)
An interesting thing about sorting IPv4 inets on 64-bit machines is that when the inets are the same, the abbreviated comparitor will return 0 which is taken by the sorting machinery to mean "the datums are the same up to this point, so you need to call the full comparitor" -- but, in this case, 0 means "the datums truly are the same, no need to call the full comparitor". Since the full comparitor assumes its arguments to be the original (typically pass-by-reference) datums, you can't do it there. You'd need to add another optional comparitor to be called after the abbreviated one. In inet's case on a 64-bit machine, it would look at the abbreviated datums and if they're both in the IPv4 family, would return 0 (knowing that the abbreviated comparitor has already done the real work). I have no reason to believe this particular optimisation is worth anything much, though; it's outside the scope of this patch, besides.
I have some comments on the comments:
network.c:552
* SortSupport conversion routine. Converts original inet/cidr representations
* to abbreviated keys . The inet/cidr types are pass-by-reference, so is an
* optimization so that sorting routines don't have to pull full values from
* the heap to compare.
Looks like you have an extra space before the "." on line 553. And abbreviated keys being an optimisation for pass-by-reference types can be taken for granted, so I think the last sentence is redundant.
network.c::567
* IPv4 and IPv6 are identical in this makeup, with the difference being that
* IPv4 addresses have a maximum of 32 bits compared to IPv6's 64 bits, so in
* IPv6 each part may be larger.
IPv6's addresses are 128 bit. I'm not sure sure if "maximum" is accurate, or whether you should just say "IPv4 addresses have 32 bits".
network.c::571
* inet/cdir types compare using these sorting rules. If inequality is detected
* at any step, comparison is done. If any rule is a tie, the algorithm drops
* through to the next to break it:
When you say "comparison is done" it sounds like more comparing is going to be done, but what I think you mean is that comparison is finished.
> [...]
> My benchmarking methodology and script is available here
> [1], and involves gathering statistics for 100
> `count(distinct ...)` queries at various data sizes. I've
> saved the results I got on my machine here [2].
I didn't see any links for [1], [2] and [3] in your email.
Finally, there's a duplicate CF entry: https://commitfest.postgresql.org/22/1990/ .
Since you're updating https://commitfest.postgresql.org/22/1991/ , I suggest you mark 1990 as Withdrawn to avoid confusion. If there's a way to remove it from the CF list, that would be even better.
Edmund
pgsql-hackers by date: