Re: Making all nbtree entries unique by having heap TIDs participate in comparisons - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Making all nbtree entries unique by having heap TIDs participate in comparisons |
Date | |
Msg-id | CAH2-Wz=L8G68UAWrW_UTvac-cb_SmSP-ReNzV6LHoKYmx1uE-Q@mail.gmail.com Whole thread Raw |
In response to | Re: Making all nbtree entries unique by having heap TIDs participatein comparisons (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons Re: Making all nbtree entries unique by having heap TIDs participatein comparisons |
List | pgsql-hackers |
On Sun, Mar 10, 2019 at 5:17 PM Peter Geoghegan <pg@bowt.ie> wrote: > The regression that I mentioned earlier isn't in pgbench type > workloads (even when the distribution is something more interesting > that the uniform distribution default). It is only in workloads with > lots of page splits and lots of index churn, where we get most of the > benefit of the patch, but also where the costs are most apparent. > Hopefully it can be fixed, but if not I'm inclined to think that it's > a price worth paying. This certainly still needs further analysis and > discussion, though. This revision of the patch does not attempt to > address that problem in any way. I believe that I've figured out what's going on here. At first, I thought that this regression was due to the cycles that have been added to page splits, but that doesn't seem to be the case at all. Nothing that I did to make page splits faster helped (e.g. temporarily go back to doing them "bottom up" made no difference). CPU utilization was consistently slightly *higher* with the master branch (patch spent slightly more CPU time idle). I now believe that the problem is with LWLock/buffer lock contention on index pages, and that that's an inherent cost with a minority of write-heavy high contention workloads. A cost that we should just accept. Making the orderline primary key about 40% smaller increases contention when BenchmarkSQL is run with this particular configuration. The latency for the NEW_ORDER transaction went from ~4ms average on master to ~5ms average with the patch, while the latency for other types of transactions was either unchanged or improved. It's noticeable, but not that noticeable. This needs to be put in context. The final transactions per minute for this configuration was 250,000, with a total of only 100 warehouses. What this boils down to is that the throughput per warehouse is about 8000% of the maximum valid NOPM specified by the TPC-C spec [1]. In other words, the database is too small relative to the machine, by a huge amount, making the result totally and utterly invalid if you go on what the TPC-C spec says. This exaggerates the LWLock/buffer lock contention on index pages. TPC-C is supposed to simulate a real use case with a plausible configuration, but the details here are totally unrealistic. For example, there are 3 million customers here (there are always 30k customers per warehouse). 250k TPM means that there were about 112k new orders per minute. It's hard to imagine a population of 3 million customers making 112k orders per minute. That's over 20 million orders in the first 3 hour long run that I got these numbers from. Each of these orders has an average of about 10 line items. These people must be very busy, and must have an awful lot of storage space in their homes! (There are various other factors here, such as skew, and the details will never be completely realistic anyway, but you take my point. TPC-C is *designed* to be a realistic distillation of a real use case, going so far as to require usable GUI input terminals when evaluating a formal benchmark submission.) The benchmark that I posted in mid-February [2] (which showed better performance across the board) was much closer to what the TPC-C spec requires -- that was only ~400% of maximum valid NOPM (the BenchmarkSQL html reports will tell you this if you download the archive I posted), and had 2,000 warehouses. TPC-C is *supposed* to be I/O bound, and I/O bound workloads are what the patch helps with the most. The general idea with TPC-C's NOPM is that you're required to increase the number of warehouses as throughput increases. This stops you from getting an unrealistically favorable result by churning through a small amount of data, from the same few warehouses. The only benchmark that I ran that actually satisfied TPC-C's NOPM requirements had a total of 7,000 warehouses, and was almost a full terabyte in size on the master branch. This was run on an i3.4xlarge high I/O AWS ec2 instance. That was substantially I/O bound, and had an improvement in throughput that was very similar to the mid-February results which came from my home server -- we see a ~7.5% increase in transaction throughput after a few hours. I attach a graph of block device reads/writes for the second 4 hour run for this same 7,000 warehouse benchmark (master and patch). This shows a substantial reduction in I/O according to OS-level instrumentation. (Note that the same FS/logical block device was used for both WAL and database files.) In conclusion: I think that this regression is a cost worth accepting. The regression in throughput is relatively small (2% - 3%), and the NEW_ORDER transaction seems to be the only problem (NEW_ORDER happens to be used for 45% of all transactions with TPC-C, and inserts into the largest index by far, without reading much). The patch overtakes master after a few hours anyway -- the patch will still win after about 6 hours, once the database gets big enough, despite all the contention. As I said, I think that we see a regression *because* the indexes are much smaller, not in spite of the fact that they're smaller. The TPC-C/BenchmarkSQL indexes never fail to be about 40% smaller than they are on master, no matter the details, even after many hours. I'm not seeing the problem when pgbench is run with a small scale factor but with a high client count. pgbench doesn't have the benefit of much smaller indexes, so it also doesn't bear any cost when contention is ramped up. The pgbench_accounts primary key (which is by far the largest index) is *precisely* the same size as it is on master, though the other indexes do seem to be a lot smaller. They were already tiny, though. OTOH, the TPC-C NEW_ORDER transaction does a lot of straight inserts, localized by warehouse, with skewed access. [1] https://youtu.be/qYeRHK6oq7g?t=1340 [2] https://www.postgresql.org/message-id/CAH2-WzmsK-1qVR8xC86DXv8U0cHwfPcuH6hhA740fCeEu3XsVg@mail.gmail.com -- Peter Geoghegan
Attachment
pgsql-hackers by date: