Re: Seq scans roadmap - Mailing list pgsql-hackers
From | Luke Lonergan |
---|---|
Subject | Re: Seq scans roadmap |
Date | |
Msg-id | C3E62232E3BCF24CBA20D72BFDCB6BF803EB6A65@MI8NYCMAIL08.Mi8.com Whole thread Raw |
In response to | Re: Seq scans roadmap (Heikki Linnakangas <heikki@enterprisedb.com>) |
Responses |
Re: Seq scans roadmap
Re: Seq scans roadmap |
List | pgsql-hackers |
Heikki, > That's interesting. Care to share the results of the > experiments you ran? I was thinking of running tests of my > own with varying table sizes. Yah - it may take a while - you might get there faster. There are some interesting effects to look at between I/O cache performance and PG bufcache, and at those speeds the only tool I've found that actually measures scan rate in PG is VACUUM. "SELECT COUNT(*)" measures CPU consumption in the aggregation node, not scan rate. Note that the copy from I/O cache to PG bufcache is where the L2 effect is seen. > The main motivation here is to avoid the sudden drop in > performance when a table grows big enough not to fit in RAM. > See attached diagram for what I mean. Maybe you're right and > the effect isn't that bad in practice. There are going to be two performance drops, first when the table doesn't fit into PG bufcache, the second when it doesn't fit in bufcache + I/O cache. The second is severe, the first is almost insignificant (for common queries). > How is that different from what I described? My impression of your descriptions is that they overvalue the case where there are multiple scanners of a large (> 1x bufcache) table such that they can share the "first load" of the bufcache, e.g. your 10% benefit for table = 10x bufcache argument. I think this is a non-common workload, rather there are normally many small tables and several large tables such that sharing the PG bufcache is irrelevant to the query speed. > Yeah I remember the discussion on the L2 cache a while back. > > What do you mean with using readahead inside the heapscan? > Starting an async read request? Nope - just reading N buffers ahead for seqscans. Subsequent calls use previously read pages. The objective is to issue contiguous reads to the OS in sizes greater than the PG page size (which is much smaller than what is needed for fast sequential I/O). > > The modifications you suggest here may not have the following > > properties: > > - don't pollute bufcache for seqscan of tables > 1 x bufcache > > - for tables > 1 x bufcache use a ringbuffer for I/O that > is ~ 32KB to > > minimize L2 cache pollution > > So the difference is that you don't want 3A (the take > advantage of pages already in buffer cache) strategy at all, > and want the buffer ring strategy to kick in earlier instead. > Am I reading you correctly? Yes, I think the ring buffer strategy should be used when the table size is > 1 x bufcache and the ring buffer should be of a fixed size smaller than L2 cache (32KB - 128KB seems to work well). - Luke
pgsql-hackers by date: