Re: Streaming I/O, vectored I/O (WIP) - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Streaming I/O, vectored I/O (WIP)
Date
Msg-id CA+hUKG+ZLXp1jk+ML11GedOUKw9mEzAie3nmY9041CygNpd=9w@mail.gmail.com
Whole thread Raw
In response to Re: Streaming I/O, vectored I/O (WIP)  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: Streaming I/O, vectored I/O (WIP)
List pgsql-hackers
On Thu, Nov 30, 2023 at 12:16 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 29/11/2023 21:39, Thomas Munro wrote:
> > One thing I wasn't 100% happy with was the treatment of ENOSPC.  A few
> > callers believe that short writes set errno: they error out with a
> > message including %m.  We have historically set errno = ENOSPC inside
> > FileWrite() if the write size was unexpectedly small AND the kernel
> > didn't set errno to a non-zero value (having set it to zero ourselves
> > earlier).  In FileWriteV(), I didn't want to do that because it is
> > expensive to compute the total write size from the vector array and we
> > managed to measure an effect due to that in some workloads.
> >
> > Note that the smgr patch actually handles short writes by continuing,
> > instead of raising an error.  Short writes do already occur in the
> > wild on various systems for various rare technical reasons other than
> > ENOSPC I have heard (imagine transient failure to acquire some
> > temporary memory that the kernel chooses not to wait for, stuff like
> > that, though certainly many people and programs believe they should
> > not happen[1]), and it seems like a good idea to actually handle them
> > as our write sizes increase and the probability of short writes might
> > presumably increase.
>
> Maybe we should bite the bullet and always retry short writes in
> FileWriteV(). Is that what you meant by "handling them"?
> If the total size is expensive to calculate, how about passing it as an
> extra argument? Presumably it is cheap for the callers to calculate at
> the same time that they build the iovec array?

It's cheap for md.c, because it already has nblocks_this_segment.
That's one reason I chose to put the retry there.  If we push it down
to fd.c in order to be able to help other callers, you're right that
we could pass in the total size (and I guess assert that it's
correct), but that is sort of annoyingly redundant and further from
the interface we're wrapping.

There is another problem with pushing it down to fd.c, though.
Suppose you try to write 8192 bytes, and the kernel says "you wrote
4096 bytes" so your loop goes around again with the second half the
data and now the kernel says "-1, ENOSPC".  What are you going to do?
fd.c doesn't raise errors for I/O failure, it fails with -1 and errno,
so you'd either have to return -1, ENOSPC (converting short writes
into actual errors, a lie because you did write some data), or return
4096 (and possibly also set errno = ENOSPC as we have always done).
So you can't really handle this problem at this level, can you?
Unless you decide that fd.c should get into the business of raising
errors for I/O failures, which would be a bit of a departure.

That's why I did the retry higher up in md.c.

> > With the previous version of the patch, we'd have to change a couple
> > of other callers not to believe that short writes are errors and set
> > errno (callers are inconsistent on this point).  I don't really love
> > that we have "fake" system errors but I also want to stay focused
> > here, so in this new version V3 I tried a new approach: I realised I
> > can just always set errno without needing the total size, so that
> > (undocumented) aspect of the interface doesn't change.  The point
> > being that it doesn't matter if you clobber errno with a bogus value
> > when the write was non-short.  Thoughts?
>
> Feels pretty ugly, but I don't see anything outright wrong with that.

Cool.  I would consider cleaning up all the callers and get rid of
this ENOSPC stuff in independent work, but I didn't want discussion of
that (eg what external/extension code knows about this API?) to derail
THIS project, hence desire to preserve existing behaviour.



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: logical decoding and replication of sequences, take 2
Next
From: Tomas Vondra
Date:
Subject: Re: Parallel CREATE INDEX for BRIN indexes