Re: Proposal: PqSendBuffer removal - Mailing list pgsql-hackers
From | Craig Ringer |
---|---|
Subject | Re: Proposal: PqSendBuffer removal |
Date | |
Msg-id | CAMsr+YGTYrNYSV=iTeVDrMeYSmy_dMJfabqHzoNUtOMz5b9e9Q@mail.gmail.com Whole thread Raw |
In response to | Re: Proposal: PqSendBuffer removal (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
faifaiOn Sat, 7 Mar 2020 at 02:45, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On March 5, 2020 1:23:21 PM PST, Aleksei Ivanov <iv.alekseii@gmail.com> wrote: > >Thank you for your reply! > > > >Yes, you are right there will be a separate call to send the data, but > >is > >copying data each time more costly operation than just one syscall? > > Yes, it's very likely to be more expensive to execute a syscall in a lot of cases. They've gotten a lot more expensivewith all the security issues. Good to know. > >Besides, if we already have a ready message packet to be sent why > >should we > >wait? > > In a lot of cases we'll send a number of small messages after each other. We don't want to send those out separately, that'djust increase overhead. Right. We presently set `TCP_NODELAY` to disable Nagle's algorithm, so we'd tend to generate these messages as separate packets as well as seperate syscalls, making it doubly undesirable. We can dynamically twiddle TCP_NODELAY like many other applications do to optimise the wire packets, but not the syscalls. It'd actually cost extra syscalls. > But in some paths/workloads the copy is quite noticable. I've mused before whether we could extend StringInfo to handlecases like this. E.g. by having StringInfo have two lengths. One that is the offset to the start of the allocated memory(0 for plain StringInfos), and one for the length of the string being built. > > Then we could get a StringInfo pointing directly to the current insertion point in the send buffer. To support growingit, enlargeStringInfo would first subtract the offset to the start of the allocation, and then reallocate that. > > I can imagine that bring useful in a number of places. And because there only would be additional overhead when actuallygrowing the StringInfo, I don't think the cost would be measurable. That sounds pretty sensible as it'd be minimally intrusive. I've wondered whether we can further optimise some uses by having libpq-be manage an iovec for us instead, much like we support iovec for shm_mq. Instead of a StringInfo we'd use an iovec wrapped by a libpq-managed struct. libpq would reserve the first few entries in the managed iovec for the message header. Variants of pq_sendint64(...) etc would add entries to the iovec and could be inlined since they'd just be convenience routines. The message would be flushed by having libpq call writev() on the iovec container. We'd want a wrapper struct for the iovec so we could have libpq keep a cursor for the next entry in the iovec. For libpq-fe it'd also contain a map of which iovec entries need to be free()d; for libpq-be we'd probably palloc(), maybe with a child memory context. To support a stack-allocated iovec for when we know all the message fields in advance we'd have an init function that takes the address of the preallocated iovec and its length limit. We could support We could also support a libpq-wrapped iovec where libpq can realloc it if it fills. stack-allocated iovec - so the caller would be responsible for managing the max size That way we can do zero-copy scatter-gather I/O for messages that don't require binary-to-text-format transformations etc. BTW, if we change StringInfo, I'd like to also official bless the usage pattern where we wrap a buffer in a StringInfo so we can use pq_getmsgint64 etc on it. Add a initConstStringInfo(StringInfo si, const char * buf, size_t buflen) or something that assigns the StringInfo values and sets maxlen = -1. The only in-core usage I see for this so far is in src/backend/replication/logical/worker.c but it's used extremely heavily in pglogical etc. It'd just be a convenience function that blesses and documents existing usage. But like Tom I really want to first come back to the evidence. Why should we bother? Are we solving an actual problem here? PostgreSQL is immensely memory-allocation-happy and copy-happy. Shouldn't we be more interested in things like reducing the cost of multiple copies and transform passes of Datum values? Especially since that's an actual operational pain point when you're working with multi-hundred-megabyte bytea or text fields. Can you come up with some profiling/performance numbers that track time spent on memory copying in the areas you propose to target, plus malloc overheads? With a tool like systemtap or perf it should not be overly difficult to do so by making targeted probes that filter based on callstack, or on file / line-range or function.
pgsql-hackers by date: