Reduce pinning in btree indexes - Mailing list pgsql-hackers
From | Kevin Grittner |
---|---|
Subject | Reduce pinning in btree indexes |
Date | |
Msg-id | 721615179.3351449.1423959585771.JavaMail.yahoo@mail.yahoo.com Whole thread Raw |
Responses |
Re: Reduce pinning in btree indexes
Re: Reduce pinning in btree indexes Re: Reduce pinning in btree indexes |
List | pgsql-hackers |
We have a customer who was unable to migrate from a well-known commercial database product to pg because they have a very large software base that holds cursors open for very long periods of time, preventing GlobalXmin values from advancing, leading to bloat. They have a standard test that exercises hundreds of client connections for days at a time, and saw disk space and CPU usage climbing to unacceptable levels[1]. The other database product had no such problems. We suggested the obvious solutions that we always suggest on the community lists, and they had perfectly good reasons why those were not viable for them -- we either needed to prevent bloat under their current software or they could not migrate. What they wanted was what happened in the other database product -- if a snapshot got sufficiently old that cleaning up the MVCC data was a problem *and* the snapshot was used again *and* it read a page which had been modified far enough back that it was not possible to return correct data, then they wanted to receive a "snapshot too old" error. I wrote a patch to do that, but they didn't seem much improvement, because all (auto)vacuum processes blocked indefinitely on the btree index pages where a cursor was parked. So they still got massive bloat and consequent problems. It seemed odd to me that a btree implementation based on the Lehman & Yao techniques would block anything, because the concept is that it reads everything it needs off the index page and pauses "between" pages. We sorta do that, but the "interlock" between vacuum processes and other index usages basically involves holding a pin on the page we just read until we are done using the values we read off that page, and treating the pin as a lighter lock than a shared (or READ) lock -- which only conflicts with a "super exclusive" lock, which consists of getting an exclusive lock only once there are no other processes holding a pin on the page. This ensures that at the end of a vacuum pass over a btree index there are no TIDs in process-local memory from before the start of the pass, and consequently no scan can read a new tuple assigned to the same TID value after the rest of the vacuum phases run. So a pin on a btree page blocks a vacuum process indefinitely. Interestingly, the btree README points out that using the old TID with a new tuple poses no hazard for a scan using an MVCC snapshot, because the new tuple would not be visible to a snapshot created that long ago. The customer's cursors which were causing the problems all use MVCC snapshots, so I went looking to see whether we couldn't take advantage of this fact. For the most part it seemed we were OK if we dropped pins with the READ lock for a btree scan which used an MVCC snapshot. I found that the LP_DEAD hinting would be a problem with an old TID, but I figured we could work around that by storing the page LSN into the scan position structure when the page contents were read, and only doing hinting if that matched when we were ready to do the hinting. That wouldn't work for an index which was not WAL-logged, so the patch still holds pins for those. Robert pointed out that the visibility information for an index-only scan wasn't checked while the index page READ lock was held, so those scans also still hold the pins. There was also a problem with the fact that the code conflated the notion of having a valid scan position with holding a pin on a block in the index. Those two concepts needed to be teased apart, which I did using some new macros and figuring out which one to use where. Without a pin on the block, we also needed to remember what block we had been processing to be able to do the LP_DEAD hinting; for that the block number was added to the structure which tracks scan position. Finally, there was an "optimization" for marking buffer position for possible restore that was incompatible with releasing the pin. I use quotes because the optimization adds overhead to every move to the next page in order set some variables in a structure when a mark is requested instead of running two fairly small memcpy() statements. The two-day benchmark of the customer showed no performance hit, and looking at the code I would be amazed if the optimization yielded a measurable benefit. In general, optimization by adding overhead to moving through a scan to save time in a mark operation seems dubious. At some point we could consider building on this patch to recheck index conditions for heap access when a non-MVCC snapshot is used, check the visibility map for referenced heap pages when the TIDs are read for an index-only scan, and/or skip LP_DEAD hinting for non-WAL-logged indexes. But all those are speculative future work; this is a conservative implementation that just didn't modify pinning where there were any confounding factors. In combination with the snapshot-too-old patch the 2-day customer benchmark ran without problem levels of bloat, controlling disk space growth and keeping CPU usage level. Without the other patch this patch would at least allow autovacuum to maintain statistics, which probably makes it worthwhile even without the other patch, but will probably not have a very big impact on bloat without both. At least two README files and a lot of comments need to be modified before commit to reflect the change in approach if this is accepted by the community. Since this work has been very recent I haven't had time to do that yet. git diff --stat tells me this patch has: 4 files changed, 208 insertions(+), 128 deletions(-) ... and the related "snapshot too old" patch has: 16 files changed, 145 insertions(+), 19 deletions(-) Given that these are going into the last CF and the related patch is really still at "proof of concept" phase, it may be too late to consider them for PostgreSQL 9.5, but I would still like a serious review now in case people feel strongly about controlling bloat in the face of old snapshots, and to know whether to work on this as just for the Postgres Plus Advanced Server fork or for the community. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] This test maintains a certain transaction rate, with user think times on the connections, so performance isn't measured by a tps rate but by how much CPU is consumed to maintain that rate.
Attachment
pgsql-hackers by date: