Re: Proposal: PqSendBuffer removal - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: Proposal: PqSendBuffer removal
Date
Msg-id CAMsr+YGTYrNYSV=iTeVDrMeYSmy_dMJfabqHzoNUtOMz5b9e9Q@mail.gmail.com
Whole thread Raw
In response to Re: Proposal: PqSendBuffer removal  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
faifaiOn Sat, 7 Mar 2020 at 02:45, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On March 5, 2020 1:23:21 PM PST, Aleksei Ivanov <iv.alekseii@gmail.com> wrote:
> >Thank you for your reply!
> >
> >Yes, you are right there will be a separate call to send the data, but
> >is
> >copying data each time more costly operation than just one syscall?
>
> Yes, it's very likely to be more expensive to execute a syscall in a lot of cases. They've gotten a lot more
expensivewith all the security issues. 

Good to know.

> >Besides, if we already have a ready message packet to be sent why
> >should we
> >wait?
>
> In a lot of cases we'll send a number of small messages after each other. We don't want to send those out separately,
that'djust increase overhead. 

Right. We presently set `TCP_NODELAY` to disable Nagle's algorithm, so
we'd tend to generate these messages as separate packets as well as
seperate syscalls, making it doubly undesirable.

We can dynamically twiddle TCP_NODELAY like many other applications do
to optimise the wire packets, but not the syscalls. It'd actually cost
extra syscalls.

> But in some paths/workloads the copy is quite noticable. I've mused before whether we could extend StringInfo to
handlecases like this. E.g. by having StringInfo have two lengths. One that is the offset to the start of the allocated
memory(0 for plain StringInfos), and one for the length of the string being built. 
>
> Then we could get a StringInfo pointing directly to the current insertion point in the send buffer.  To support
growingit, enlargeStringInfo would first subtract the offset to the start of the allocation, and then reallocate that. 
>
> I can imagine that bring useful in a number of places. And because there only would be additional overhead when
actuallygrowing the StringInfo, I don't think the cost would be measurable. 

That sounds pretty sensible as it'd be minimally intrusive.

I've wondered whether we can further optimise some uses by having
libpq-be manage an iovec for us instead, much like we support iovec
for shm_mq. Instead of a StringInfo we'd use an iovec wrapped by a
libpq-managed struct. libpq would reserve the first few entries in the
managed iovec for the message header. Variants of pq_sendint64(...)
etc would add entries to the iovec and could be inlined since they'd
just be convenience routines. The message would be flushed by having
libpq call writev() on the iovec container.

We'd want a wrapper struct for the iovec so we could have libpq keep a
cursor for the next entry in the iovec. For libpq-fe it'd also contain
a map of which iovec entries need to be free()d; for libpq-be we'd
probably palloc(), maybe with a child memory context. To support a
stack-allocated iovec for when we know all the message fields in
advance we'd have an init function that takes the address  of the
preallocated iovec and its length limit.

We could support  We could also support a libpq-wrapped iovec where
libpq can realloc it if it fills.
stack-allocated iovec - so the caller would be responsible for
managing the max size


That way we can do zero-copy scatter-gather I/O for messages that
don't require binary-to-text-format transformations etc.

BTW, if we change StringInfo, I'd like to also official bless the
usage pattern where we wrap a buffer in a StringInfo so we can use
pq_getmsgint64 etc on it. Add a initConstStringInfo(StringInfo si,
const char * buf, size_t buflen) or something that assigns the
StringInfo values and sets maxlen = -1.  The only in-core usage I see
for this so far is in src/backend/replication/logical/worker.c but
it's used extremely heavily in pglogical etc. It'd just be a
convenience function that blesses and documents existing usage.

But like Tom I really want to first come back to the evidence. Why
should we bother? Are we solving an actual problem here? PostgreSQL is
immensely memory-allocation-happy and copy-happy. Shouldn't we be more
interested in things like reducing the cost of multiple copies and
transform passes of Datum values? Especially since that's an actual
operational pain point when you're working with multi-hundred-megabyte
bytea or text fields.

Can you come up with some profiling/performance numbers that track
time spent on memory copying in the areas you propose to target, plus
malloc overheads? With a tool like systemtap or perf it should not be
overly difficult to do so by making targeted probes that filter based
on callstack, or on file / line-range or function.



pgsql-hackers by date:

Previous
From: Andy Fan
Date:
Subject: Re: Index Skip Scan
Next
From: Michael Paquier
Date:
Subject: Re: Add an optional timeout clause to isolationtester step.