Re: [HACKERS] Tuplesort merge pre-reading - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: [HACKERS] Tuplesort merge pre-reading
Date
Msg-id CAH2-WznrO1XQ5F3Mb+mWyrE_aY5DJWOFh=ePbw1BVi1=JoG9sQ@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Tuplesort merge pre-reading  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] Tuplesort merge pre-reading  (Peter Geoghegan <pg@bowt.ie>)
Re: [HACKERS] Tuplesort merge pre-reading  (Robert Haas <robertmhaas@gmail.com>)
Re: [HACKERS] Tuplesort merge pre-reading  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
On Thu, Apr 13, 2017 at 9:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm fairly sure that the point was exactly what it said, ie improve
> locality of access within the temp file by sequentially reading as many
> tuples in a row as we could, rather than grabbing one here and one there.
>
> It may be that the work you and Peter G. have been doing have rendered
> that question moot.  But I'm a bit worried that the reason you're not
> seeing any effect is that you're only testing situations with zero seek
> penalty (ie your laptop's disk is an SSD).  Back then I would certainly
> have been testing with temp files on spinning rust, and I fear that this
> may still be an issue in that sort of environment.

I actually think Heikki's work here would particularly help on
spinning rust, especially when less memory is available. He
specifically justified it on the basis of it resulting in a more
sequential read pattern, particularly when multiple passes are
required.

> The larger picture to be drawn from that thread is that we were seeing
> very different performance characteristics on different platforms.
> The specific issue that Tatsuo-san reported seemed like it might be
> down to weird read-ahead behavior in a 90s-vintage Linux kernel ...
> but the point that this stuff can be environment-dependent is still
> something to take to heart.

BTW, I'm skeptical of the idea of Heikki's around killing polyphase
merge itself at this point. I think that keeping most tapes active per
pass is useful now that our memory accounting involves handing over an
even share to each maybe-active tape for every merge pass, something
established by Heikki's work on external sorting.

Interestingly enough, I think that Knuth was pretty much spot on with
his "sweet spot" of 7 tapes, even if you have modern hardware. Commit
df700e6 (where the sweet spot of merge order 7 was no longer always
used) was effective because it masked certain overheads that we
experience when doing multiple passes, overheads that Heikki and I
mostly removed. This was confirmed by Robert's testing of my merge
order cap work for commit fc19c18, where he found that using 7 tapes
was only slightly worse than using many hundreds of tapes. If we could
somehow be completely effective in making access to logical tapes
perfectly sequential, then 7 tapes would probably be noticeably
*faster*, due to CPU caching effects.

Knuth was completely correct to say that it basically made no
difference once more than 7 tapes are used to merge, because he didn't
have logtape.c fragmentation to worry about.

-- 
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/



pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: [HACKERS] pg_dump emits ALTER TABLE ONLY partitioned_table
Next
From: Noah Misch
Date:
Subject: Re: [HACKERS] [pgsql-www] Small issue in online devel documentation build