Re: [PATCH] Refactor bytea_sortsupport(), take two - Mailing list pgsql-hackers
| From | John Naylor |
|---|---|
| Subject | Re: [PATCH] Refactor bytea_sortsupport(), take two |
| Date | |
| Msg-id | CANWCAZbhLUxK4X+3w0Xdru74mobsLF7Esr7NQtYGjnZ6w03uhg@mail.gmail.com Whole thread Raw |
| In response to | Re: [PATCH] Refactor bytea_sortsupport(), take two (Aleksander Alekseev <aleksander@tigerdata.com>) |
| List | pgsql-hackers |
On Tue, Sep 16, 2025 at 4:33 PM Aleksander Alekseev
<aleksander@tigerdata.com> wrote:
> > Some comments are randomly different than the equivalents in
> > varlena.c. It's probably better if same things remain the same, but
> > there's nothing wrong either.
>
> I did my best to keep the comments consistent between varlena.c and
> bytea.c. I don't think they are going to be that consistent in the
> long term anyway, so not sure how much effort we should invest into
> this.
I see plenty of random differences by diff'ing the relevant functions
beteween the two. In a large, old codebase, consistency is one of the
most important things to maintain. Also, it takes hardly any time at
all to copy something that doesn't need changing.
After looking more closely, some of the differences actually make
things worse for maintenance IMO, so I think we need to be more
thoughtful. For example in *_abbrev_convert:
- * Maintain approximate cardinality of both abbreviated keys
and original,
- * authoritative keys using HyperLogLog. Used as cheap
insurance against
- * the worst case, where we do many string transformations for no saving
- * in full strcoll()-based comparisons. These statistics are used by
- * varstr_abbrev_abort().
- *
- * First, Hash key proper, or a significant fraction of it.
Mix in length
- * in order to compensate for cases where differences are past
- * PG_CACHE_LINE_SIZE bytes, so as to limit the overhead of hashing.
+ * Maintain approximate cardinality of both abbreviated keys
and original
+ * keys using HyperLogLog.
The first part of the comment explained why we do this at all. The
bytea type does not use strcoll, so it's possible that a future patch
may decide to skip cardinality estimation entirely. It's obviously not
the responsibility of this refactoring patch to investigate that, but
throwing the reason out makes it harder for someone to discover that
bytea could do something different here. In this case, maybe we could
point to varstr_abbrev_convert instead of repeating the whole comment.
/*
- * Clamp cardinality estimates to at least one distinct value. While
- * NULLs are generally disregarded, if only NULL values were
seen so far,
- * that might misrepresent costs if we failed to clamp.
+ * Clamp cardinality estimates to at least one distinct value.
*/
The old comment explained why we clamp, but the new comment just says
what the code is doing, and it's obvious what the code is doing.
- /*
- * In the worst case all abbreviated keys are identical, while
at the same
- * time there are differences within full key strings not captured in
- * abbreviations.
- */
Why is this gone? I could go on, but hopefully you get my point. If
it's short, just keep it, and adjust as necessary. If it's long, point
to varlena.c if the idea is the same.
Back to the actual patch:
- * More generally, it's okay that bytea callers can have NUL bytes in
- * strings because abbreviated cmp need not make a distinction between
- * terminating NUL bytes, and NUL bytes representing actual NULs in the
- * authoritative representation. Hopefully a comparison at or past one
- * abbreviated key's terminating NUL byte will resolve the comparison
- * without consulting the authoritative representation; specifically, some
- * later non-NUL byte in the longer string can resolve the comparison
- * against a subsequent terminating NUL in the shorter string. There will
- * usually be what is effectively a "length-wise" resolution there and
- * then.
- *
- * If that doesn't work out -- if all bytes in the longer string
- * positioned at or past the offset of the smaller string's (first)
- * terminating NUL are actually representative of NUL bytes in the
- * authoritative binary string (perhaps with some *terminating* NUL bytes
- * towards the end of the longer string iff it happens to still be small)
- * -- then an authoritative tie-breaker will happen, and do the right
- * thing: explicitly consider string length.
Why is this gone -- AFAICT it's still true for bytea, no matter what
file it's located in?
--
John Naylor
Amazon Web Services
pgsql-hackers by date: