Thread: seq scan cache vs. index cache smackdown
Hi All, I have boiled my situation down to the following simple case: (postgres version 7.3) * Query 1 is doing a sequential scan over a table (courtesy of field ILIKE 'foo%') and index joins to a few others * Query 2 is doing a functional index scan over the same table (lower(field) LIKE 'foo%') and index joins to a few others * neither query has an order by clause * for the purpose of testing, both queries are designed to return the same result set Obviously Q2 is faster than Q1, but if I ever run them both at the same time (lets say I run two of Q1 and one of Q2 at the same time) then Q2 consistently returns WORSE times than Q1 (explain analyze confirms that it is using the index). My assumption is that the sequential scan is blowing the index from any cache it might live in, and simultaneously stealing all the disk IO that is needed to access the index on disk (the table has 200,000 rows). If I simplify the case to not do the index joins (ie. operate on the one table only) the situation is not as dramatic, but similar. My thoughts are: 1) kill the sequential scan - but unfortunately I don't have direct control over that code 2) change the way the server allocates/prioritizes different caches - i don't know enough about how postgres caches work to do this (if it's possible) 3) try it on postgres 7.4 - possible, but migrating the system to 7.4 in production will be hard because the above code that I am not responsible for has a lot of (slightly wacky) implicit date casts 4) ask the fine people on the mailing list for other suggestions! -- Mark Aufflick e mark@pumptheory.com w www.pumptheory.com (work) w mark.aufflick.com (personal) p +61 438 700 647 f +61 2 9436 4737 ======================================================================== iBurst Wireless Broadband from $34.95/month www.platformnetworks.net Forward undetected SPAM to: spam@mailsecurity.net.au ========================================================================
Hi, I think there was some discussion about seq scans messing up the cache, and talk about doing something about it but I don't think it has been addressed yet. Maybe worth a troll through the archives. It is certainly true that in many situations, a seq scan is preferable to using an index. I have been testing a situation here on two versions of the same database, one of the databases is much bigger than the other (artificially bloated for testing purposes). Some of the query plans change to use seq scans on the big database, where they used indexes on the little database - but either way, in *single user* testing the performance is fine. My concern is that this kind of testing has very little relevance to the real world of multiuser processing where contention for the cache becomes an issue. It may be that, at least in the current situation, postgres is giving too much weight to seq scans based on single user, straight line performance comparisons. If your assumption is correct, then addressing that might help, though it's bound to have it's compromises too... regards Iain ----- Original Message ----- From: "Mark Aufflick" <mark@pumptheory.com> To: <pgsql-performance@postgresql.org> Sent: Tuesday, February 15, 2005 8:34 AM Subject: [PERFORM] seq scan cache vs. index cache smackdown > Hi All, > > I have boiled my situation down to the following simple case: (postgres > version 7.3) > > * Query 1 is doing a sequential scan over a table (courtesy of field ILIKE > 'foo%') and index joins to a few others > * Query 2 is doing a functional index scan over the same table > (lower(field) LIKE 'foo%') and index joins to a few others > * neither query has an order by clause > * for the purpose of testing, both queries are designed to return the same > result set > > Obviously Q2 is faster than Q1, but if I ever run them both at the same > time (lets say I run two of Q1 and one of Q2 at the same time) then Q2 > consistently returns WORSE times than Q1 (explain analyze confirms that it > is using the index). > > My assumption is that the sequential scan is blowing the index from any > cache it might live in, and simultaneously stealing all the disk IO that > is needed to access the index on disk (the table has 200,000 rows). > > If I simplify the case to not do the index joins (ie. operate on the one > table only) the situation is not as dramatic, but similar. > > My thoughts are: > > 1) kill the sequential scan - but unfortunately I don't have direct > control over that code > 2) change the way the server allocates/prioritizes different caches - i > don't know enough about how postgres caches work to do this (if it's > possible) > 3) try it on postgres 7.4 - possible, but migrating the system to 7.4 in > production will be hard because the above code that I am not responsible > for has a lot of (slightly wacky) implicit date casts > 4) ask the fine people on the mailing list for other suggestions! > -- > Mark Aufflick > e mark@pumptheory.com > w www.pumptheory.com (work) > w mark.aufflick.com (personal) > p +61 438 700 647 > f +61 2 9436 4737 > > > ======================================================================== > iBurst Wireless Broadband from $34.95/month www.platformnetworks.net > Forward undetected SPAM to: spam@mailsecurity.net.au > ======================================================================== > > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings
> My concern is that this kind of testing has very little relevance to the > real world of multiuser processing where contention for the cache becomes an > issue. It may be that, at least in the current situation, postgres is > giving too much weight to seq scans based on single user, straight line To be fair, a large index scan can easily throw the buffers out of whack as well. An index scan on 0.1% of a table with 1 billion tuples will have a similar impact to buffers as a sequential scan of a table with 1 million tuples. Any solution fixing buffers should probably not take into consideration the method being performed (do you really want to skip caching a sequential scan of a 2 tuple table because it didn't use an index) but the volume of data involved as compared to the size of the cache. I've often wondered if a single 1GB toasted tuple could wipe out the buffers. I would suppose that toast doesn't bypass them. -- Rod Taylor <pg@rbt.ca>
The world rejoiced as mark@pumptheory.com (Mark Aufflick) wrote: > Hi All, > > I have boiled my situation down to the following simple case: > (postgres version 7.3) > > * Query 1 is doing a sequential scan over a table (courtesy of field > ILIKE 'foo%') and index joins to a few others > * Query 2 is doing a functional index scan over the same table > (lower(field) LIKE 'foo%') and index joins to a few others > * neither query has an order by clause > * for the purpose of testing, both queries are designed to return the > same result set > > Obviously Q2 is faster than Q1, but if I ever run them both at the > same time (lets say I run two of Q1 and one of Q2 at the same time) > then Q2 consistently returns WORSE times than Q1 (explain analyze > confirms that it is using the index). > > My assumption is that the sequential scan is blowing the index from > any cache it might live in, and simultaneously stealing all the disk > IO that is needed to access the index on disk (the table has 200,000 > rows). There's something to be said for that... > If I simplify the case to not do the index joins (ie. operate on the > one table only) the situation is not as dramatic, but similar. > > My thoughts are: > > 1) kill the sequential scan - but unfortunately I don't have direct > control over that code This is a good choice, if plausible... > 2) change the way the server allocates/prioritizes different caches - > i don't know enough about how postgres caches work to do this (if it's > possible) That's what the 8.0 cache changes did... Patent claim issues are leading to some changes to the prioritization, which is liable to change 8.0.something and 8.1. > 3) try it on postgres 7.4 - possible, but migrating the system to 7.4 > in production will be hard because the above code that I am not > responsible for has a lot of (slightly wacky) implicit date casts Moving to 7.4 wouldn't materially change the situation; you'd have to go all the way to version 8. -- (format nil "~S@~S" "cbbrowne" "gmail.com") http://linuxdatabases.info/~cbbrowne/postgresql.html Rules of the Evil Overlord #32. "I will not fly into a rage and kill a messenger who brings me bad news just to illustrate how evil I really am. Good messengers are hard to come by." <http://www.eviloverlord.com/>
Hi Rod, > Any solution fixing buffers should probably not take into consideration > the method being performed (do you really want to skip caching a > sequential scan of a 2 tuple table because it didn't use an index) but > the volume of data involved as compared to the size of the cache. Yes, in fact indexes aren't so different to tables really in that regard. It sounds like version 8 may help out anyway. regards Iain
Mark Aufflick <mark@pumptheory.com> writes: > Obviously Q2 is faster than Q1, That's not really obvious at all. If there are lots of records being returned the index might not be faster than a sequential scan. > My assumption is that the sequential scan is blowing the index from any cache > it might live in, and simultaneously stealing all the disk IO that is needed to > access the index on disk (the table has 200,000 rows). It kind of sounds to me like you've lowered random_page_cost to reflect the fact that your indexes are nearly always completely cached. But when they're not this unrealistic random_page_cost causes indexes to be used when they're no longer faster. Perhaps you should post an "EXPLAIN ANALYZE" of your Q1 and Q2 (the latter preferable with and without enable_indexscan, but since it's a join you may not be able to get precisely the comparable plan without just that one index scan.) > 2) change the way the server allocates/prioritizes different caches - i don't > know enough about how postgres caches work to do this (if it's possible) Postgres keeps one set of shared buffers, not separate pools . Normally you only allocate a small amount of your memory for Postgres and let the OS handle disk caching. What is your shared_buffers set to and how much memory do you have? > 3) try it on postgres 7.4 - possible, but migrating the system to 7.4 in > production will be hard because the above code that I am not responsible for > has a lot of (slightly wacky) implicit date casts I can't think of any 7.4 changes that would affect this directly, but there were certainly plenty of changes that had broad effects. you never know. 8.0, on the other hand, has a new algorithm that specifically tries to protect against the shared buffers being blown out by a sequential scan. But that will only help if it's the shared buffers being thrashed that's hurting you, not the entire OS file system cache. -- greg
Greg Stark <gsstark@mit.edu> writes: > 8.0, on the other hand, has a new algorithm that specifically tries to > protect against the shared buffers being blown out by a sequential > scan. But that will only help if it's the shared buffers being > thrashed that's hurting you, not the entire OS file system cache. Something we ought to think about sometime: what are the performance implications of the real-world situation that we have another level of caching sitting underneath us? AFAIK all the theoretical studies we've looked at consider only a single level of caching. But for example, if our buffer management algorithm recognizes an index page as being heavily hit and therefore keeps it in cache for a long time, then when it does fall out of cache you can be sure it's going to need to be read from disk when it's next used, because the OS-level buffer cache has not seen a call for that page in a long time. Contrariwise a page that we think is only on the fringe of usefulness is going to stay in the OS cache because we repeatedly drop it and then have to ask for it again. I have no idea how to model this situation, but it seems like it needs some careful thought. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > Greg Stark <gsstark@mit.edu> writes: > > 8.0, on the other hand, has a new algorithm that specifically tries to > > protect against the shared buffers being blown out by a sequential > > scan. But that will only help if it's the shared buffers being > > thrashed that's hurting you, not the entire OS file system cache. > > Something we ought to think about sometime: what are the performance > implications of the real-world situation that we have another level of > caching sitting underneath us? It seems inevitable that Postgres will eventually eliminate that redundant layer of buffering. Since mmap is not workable, that means using O_DIRECT to read table and index data. Every other database eventually goes this direction, and for good reason. Having two layers of caching and buffering is inherently inefficient. It also makes it impossible for Postgres to offer any application-specific hints to the caching replacement algorithms. In that world you would configure Postgres much like you configure Oracle, with shared_buffers taking up as much of your memory as you can afford. And the OS file system cache is kept entirely out of the loop. > AFAIK all the theoretical studies we've looked at consider only a single > level of caching. But for example, if our buffer management algorithm > recognizes an index page as being heavily hit and therefore keeps it in > cache for a long time, then when it does fall out of cache you can be sure > it's going to need to be read from disk when it's next used, because the > OS-level buffer cache has not seen a call for that page in a long time. > Contrariwise a page that we think is only on the fringe of usefulness is > going to stay in the OS cache because we repeatedly drop it and then have to > ask for it again. Hum. Is it clear that that's bad? By the same logic it's the ones on the fringe that you're likely to have to read again anyways. The ones that are being heavily used are likely not to have to be read again anyways. -- greg
> It seems inevitable that Postgres will eventually eliminate that redundant > layer of buffering. Since mmap is not workable, that means using O_DIRECT > to > read table and index data. What about going the other way and simply letting the o/s do all the caching? How bad (or good) would the performance really be? Merlin
Tom, Greg, Merlin, > But for example, > if our buffer management algorithm recognizes an index page as being > heavily hit and therefore keeps it in cache for a long time, then when > it does fall out of cache you can be sure it's going to need to be read > from disk when it's next used, because the OS-level buffer cache has not > seen a call for that page in a long time. Contrariwise a page that we > think is only on the fringe of usefulness is going to stay in the OS > cache because we repeatedly drop it and then have to ask for it again. Now you can see why other DBMSs don't use the OS disk cache. There's other issues as well; for example, as long as we use the OS disk cache, we can't eliminate checkpoint spikes, at least on Linux. No matter what we do with the bgwriter, fsyncing the OS disk cache causes heavy system activity. > It seems inevitable that Postgres will eventually eliminate that redundant > layer of buffering. Since mmap is not workable, that means using O_DIRECT > to read table and index data. Why is mmap not workable? It would require far-reaching changes to our code -- certainly -- but I don't think it can be eliminated from consideration. > What about going the other way and simply letting the o/s do all the > caching? How bad (or good) would the performance really be? Pretty bad. You can simulate this easily by turning your shared_buffers way down ... -- Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: > Why is mmap not workable? We can't control write order. There are other equally bad problems, but that one alone eliminates it from consideration. See past discussions. regards, tom lane
Josh Berkus wrote: > Now you can see why other DBMSs don't use the OS disk cache. There's > other > issues as well; for example, as long as we use the OS disk cache, we can't > eliminate checkpoint spikes, at least on Linux. No matter what we do with > the bgwriter, fsyncing the OS disk cache causes heavy system activity. MS SQL server uses the O/S disk cache...the database is very tightly integrated with the O/S. Write performance is one of the few things SQL server can do better than most other databases despite running on a mid-grade kernel and a low-grade filesystem...what does that say? ReadFileScatter() and ReadFileGather() were added to the win32 API specifically for SQL server...this is somewhat analogous to transaction based writing such as in Reisfer4. I'm not arguing ms sql server is better in any way, IIRC they are still using table locks (!). > > It seems inevitable that Postgres will eventually eliminate that > redundant > > layer of buffering. Since mmap is not workable, that means using > O_DIRECT > > to read table and index data. IMO, The O_DIRECT argument makes assumptions about storage and o/s technology that are moving targets. Not sure about mmap(). Merlin
Josh Berkus <josh@agliodbs.com> writes: > Why is mmap not workable? It would require far-reaching changes to our code > -- certainly -- but I don't think it can be eliminated from consideration. Fundamentally because there is no facility for being notified by the OS before a page is written to disk. And there's no way to prevent a page from being written to disk (mlock just prevents it from being flushed from memory, not from being synced to disk). So there's no way to guarantee the WAL will be written before the buffer is synced to disk. Maybe it could be done by writing and syncing the WAL independently before the shared buffer is written to at all, but that would be a completely different model. And it would locking the shared buffer until the sync is done, and require a private copy of the shared buffer necessitating more copies than the double buffering in the first place. -- greg
>Josh Berkus wrote: >> Now you can see why other DBMSs don't use the OS disk cache. There's >> other >> issues as well; for example, as long as we use the OS disk cache, we >can't >> eliminate checkpoint spikes, at least on Linux. No matter what we do >with >> the bgwriter, fsyncing the OS disk cache causes heavy system >activity. > >MS SQL server uses the O/S disk cache... No, it doesn't. They open all files with FILE_FLAG_WRITE_THROUGH and FILE_FLAG_NO_BUFFERING. It scales the size of it dynamically with the system, but it uses it's own buffer cache. > the database is very tightly >integrated with the O/S. That it is. >Write performance is one of the few things SQL >server can do better than most other databases despite running on a >mid-grade kernel and a low-grade filesystem...what does that say? >ReadFileScatter() and ReadFileGather() were added to the win32 API >specifically for SQL server...this is somewhat analogous to transaction >based writing such as in Reisfer4. (Those are ReadFileScatter and WriteFileGather) I don't think that's correct either. Scatter/Gather I/O is used to SQL Server can issue reads for several blocks from disks into it's own buffer cache with a single syscall even if these buffers are not sequential. It did make significant performance improvements when they added it, though. (For those not knowing - it's ReadFile/WriteFile where you pass an array of "this many bytes to this address" as parameters) > I'm not arguing ms sql server is >better in any way, IIRC they are still using table locks (!). Not at all. They use row level locks, escalated to page level, then escalated to table level. Has been since 7.0. In <= 6.5 they had page level and table level locks. I think possibly back in 4.2 (this is 16-bit days on OS/2) they had only table level locks, but that's a long time ago. They don't do MVCC, though. (I'm not saying it's better either. At some things it is, at many it is not) //Magnus
In the last exciting episode, merlin.moncure@rcsonline.com ("Merlin Moncure") wrote: >> It seems inevitable that Postgres will eventually eliminate that >> redundant layer of buffering. Since mmap is not workable, that >> means using O_DIRECT to read table and index data. > > What about going the other way and simply letting the o/s do all the > caching? How bad (or good) would the performance really be? I'm going to see about taking this story to OLS (Ottawa Linux Symposium) in July and will see what hearing I can get. There are historically some commonalities in the way this situation is regarded, in that there was _long_ opposition to the notion of having unbuffered disk devices. If there's more "story" that definitely needs to be taken, let me know... -- output = reverse("moc.enworbbc" "@" "enworbbc") http://linuxdatabases.info/info/slony.html Rules of the Evil Overlord #90. "I will not design my Main Control Room so that every workstation is facing away from the door." <http://www.eviloverlord.com/>
In the 'wishful hand waving' department : read index -> determine (tuple id,page) to hit in table -> for each of these, tell the OS 'I'm gonna need these' via a NON BLOCKING call. Non blocking because you feed the information to the OS as you read the index, streaming it. Meanwhile, the OS accumulates the requests in an internal FIFO, reorganizes them according to the order best suited to good disk head movements, then reads them in clusters, and calls a callback inside the application when it has data available. Or the application polls it once in a while to get a bucketload of pages. The 'I'm gonna need these()' syscall would also sometimes return 'hey, I'm full, read the pages I have here waiting for you before asking for new ones'. A flag would tell the OS if the application wanted the results in any order, or with order preserved. Without order preservation, if the application has requested twice the same page with different tuple id's, the OS would call the callback only once, giving it a list of the tuple id's associated with that page. It involves a tradeoff between memory and performance : as the size of the FIFO increases, likelihood of good contiguous disk reading increases. However, the memory structure would only contain page numbers and tuple id's, so it could be pretty small. Returning the results in-order would also need more memory. It could be made very generic if instead of 'tuple id' you read 'opaque application data', and instead of 'page' you read '(offset, length)'. This structure actually exists already in the Linux Kernel, it's called the Elevator or something, but it works for scheduling reads between threads. You can also read 'internal not yet developed postgres cache manager' instead of OS if you don't feel like talking kernel developers into implementing this thing. > (Those are ReadFileScatter and WriteFileGather) >
PFC <lists@boutiquenumerique.com> writes: > You can also read 'internal not yet developed postgres cache manager' > instead of OS if you don't feel like talking kernel developers into > implementing this thing. It exists already, it's called aio. But there are a *lot* of details you skipped over. And as always, the devil is in the details. -- greg
Josh Berkus wrote: > > Now you can see why other DBMSs don't use the OS disk cache. ... > ...as long as we use the OS disk cache, we can't > eliminate checkpoint spikes, at least on Linux. Wouldn't the VM settings like the ones under /proc/sys/vm and/or the commit=XXX mount option if using ext3 be a good place to control this? It seems if you wanted, by setting /proc/sys/vm/dirty_background_ratio and /proc/sys/vm/dirty_expire_centisecs very low you'd be constantly flushing dirty pages. Has anyone experimented with these kinds of values: /proc/sys/vm/dirty_ratio /* the generator of dirty data writes back at this ratio */ /proc/sys/vm/dirty_background_ratio /* start background writeback */ /proc/sys/vm/dirty_writeback_centisecs /* the interval between [some style of] writebacks */ /proc/sys/vm/dirty_expire_centisecs /* the number of centiseconds that data is allowed to remain dirty I tried these to workaround the opposite kind of problem.... on a laptop running linux under vmware I wanted to avoid having it do writes quickly to make each individual transaction go faster; at the expense of a big spike in IO that the sales guy would trigger explicitly before talking a while. Setting each of those very high and using a commit=600 mount option made the whole demo run with very little IO except for the explicit sync; but I never took the time to understand which setting mattered to me or why. >>It seems inevitable that Postgres will eventually eliminate that redundant >>layer of buffering. Since mmap is not workable, that means using O_DIRECT >>to read table and index data.
Magnus Hagander wrote: > I don't think that's correct either. Scatter/Gather I/O is used to SQL > Server can issue reads for several blocks from disks into it's own > buffer cache with a single syscall even if these buffers are not > sequential. It did make significant performance improvements when they > added it, though. > > (For those not knowing - it's ReadFile/WriteFile where you pass an array > of "this many bytes to this address" as parameters) Isn't that like the BSD writev()/readv() that Linux supports also? Is that something we should be using on Unix if it is supported by the OS? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
>> I don't think that's correct either. Scatter/Gather I/O is >used to SQL >> Server can issue reads for several blocks from disks into it's own >> buffer cache with a single syscall even if these buffers are not >> sequential. It did make significant performance improvements >when they >> added it, though. >> >> (For those not knowing - it's ReadFile/WriteFile where you >pass an array >> of "this many bytes to this address" as parameters) > >Isn't that like the BSD writev()/readv() that Linux supports also? Is >that something we should be using on Unix if it is supported by the OS? Yes, they certainly seem very similar. The win32 functions are explicitly designed for async I/O (they were after all created specifically for SQL Server), so they put harder requirements on the parameters. Specifically, it writes complete system pages only, and each pointer has to point to only one page. In a file opened without buffering it will also write all buffers out and then wait for I/O completion from the device instead of one for each. Not sure what the writev/readv ones do (not clear from my linux man page). Now wether this is something we could make use of - I'll leave that up to those who know the buffer manager a lot better than I do. //Magnus
> Magnus Hagander wrote: > > I don't think that's correct either. Scatter/Gather I/O is used to SQL > > Server can issue reads for several blocks from disks into it's own > > buffer cache with a single syscall even if these buffers are not > > sequential. It did make significant performance improvements when they > > added it, though. > > > > (For those not knowing - it's ReadFile/WriteFile where you pass an array > > of "this many bytes to this address" as parameters) > > Isn't that like the BSD writev()/readv() that Linux supports also? Is > that something we should be using on Unix if it is supported by the OS? readv and writev are in the single unix spec...and yes they are basically just like the win32 versions except that that are synchronous (and therefore better, IMO). On some systems they might just be implemented as a loop inside the library, or even as a macro. http://www.opengroup.org/onlinepubs/007908799/xsh/sysuio.h.html On operating systems that optimize vectored read operations, it's pretty reasonable to assume good or even great performance gains, in addition to (or instead of) recent changes to xlog.c to group writes together for a file...it just takes things one stop further. Is there a reason why readv/writev have not been considered in the past? Merlin
"Merlin Moncure" <merlin.moncure@rcsonline.com> writes: > Is there a reason why readv/writev have not been considered in the past? Lack of portability, and lack of obvious usefulness that would justify dealing with the lack of portability. I don't think there's any value in trying to write ordinary buffers this way; making the buffer manager able to write multiple buffers at once sounds like a great deal of complexity and deadlock risk in return for not much. It might be an alternative to the existing proposed patch for writing multiple WAL buffers at once, but frankly I consider that patch a waste of effort. In real scenarios you seldom get to write more than one WAL page without a forced sync occurring because someone committed. Even if *your* transaction is long, any other backend committing a small transaction still fsyncs. On top of that, the bgwriter will be flushing WAL in order to maintain the write-ahead rule any time it dumps a dirty buffer. I have a personal to-do item to make the bgwriter explicitly responsible for writing completed WAL pages as part of its duties, but I haven't done anything about it because I think that it will write lots of such pages without any explicit code, thanks to the bufmgr's LSN interlock. Even if it doesn't get it done that way, the simplest answer is to add a little bit of code to make sure bgwriter generally does the writes, and then we don't care. If you want to experiment with writev, feel free, but I'll want to see demonstrable performance benefits before any such code actually goes in. regards, tom lane
Merlin Moncure wrote: > > readv and writev are in the single unix spec...and yes ... > > On some systems they might just be implemented as a loop inside the > library, or even as a macro. You sure? Requirements like this: http://www.opengroup.org/onlinepubs/007908799/xsh/write.html "Write requests of {PIPE_BUF} bytes or less will not be interleaved with data from other processes doing writes on the same pipe." make me think that it couldn't be just a macro; and if it were a loop in the library it seems it'd still have to make sure it's done with a single write system call. (yeah, I know that requirement is just for pipes; and I suppose they could write a loop for normal files and a different special case for pipes; but I'd be surprised).
>> (For those not knowing - it's ReadFile/WriteFile where you pass an array >> of "this many bytes to this address" as parameters) > > Isn't that like the BSD writev()/readv() that Linux supports also? Is > that something we should be using on Unix if it is supported by the OS? Nope, readv()/writev() read/write from/to the file sequentially to/from a list of buffers in memory. The Windows calls read/write at random file offsets to/from a list of buffers.