On Tue, May 26, 2020 at 10:59 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On Mon, May 25, 2020 at 12:49:45PM -0700, Jeff Davis wrote:
> >Do you think the difference in IO patterns is due to a difference in
> >handling reads vs. writes in the kernel? Or do you think that 128
> >blocks is not enough to amortize the cost of a seek for that device?
>
> I don't know. I kinda imagined it was due to the workers interfering
> with each other, but that should affect the sort the same way, right?
> I don't have any data to support this, at the moment - I can repeat
> the iosnoop tests and analyze the data, of course.
About the reads vs writes question: I know that reading and writing
two interleaved sequential "streams" through the same fd confuses the
read-ahead/write-behind heuristics on FreeBSD UFS (I mean: w(1),
r(42), w(2), r(43), w(3), r(44), ...) so the performance is terrible
on spinning media. Andrew Gierth reported that as a problem for
sequential scans that are also writing back hint bits, and vacuum.
However, in a quick test on a Linux 4.19 XFS system, using a program
to generate interleaving read and write streams 1MB apart, I could see
that it was still happily generating larger clustered I/Os. I have no
clue for other operating systems. That said, even on Linux, reads and
writes still have to compete for scant IOPS on slow-seek media (albeit
hopefully in larger clustered I/Os)...
Jumping over large interleaving chunks with no prefetching from other
tapes *must* produce stalls though... and if you crank up the read
ahead size to be a decent percentage of the contiguous chunk size, I
guess you must also waste I/O bandwidth on unwanted data past the end
of each chunk, no?
In an off-list chat with Jeff about whether Hash Join should use
logtape.c for its partitions too, the first thought I had was that to
be competitive with separate files, perhaps you'd need to write out a
list of block ranges for each tape (rather than just next pointers on
each block), so that you have the visibility required to control
prefetching explicitly. I guess that would be a bit like the list of
physical extents that Linux commands like filefrag(8) and xfs_bmap(8)
can show you for regular files. (Other thoughts included worrying
about how to make it allocate and stream blocks in parallel queries,
...!?#$)